Archive e-mails system-independently with PDF/A

pdfaPilot e-mail archiving

Compared to an Office document or another file generated from an application software, an e-mail consists of several components, namely a header, the body and, if necessary, attachments.

  • The header of an e-mail is metadata. It is basically the counterpart to the letterhead and contains at least a sender's address and the date of creation of the e-mail. In addition, the header of an e-mail can contain further optional information such as the subject or its recipient. In order to correctly assess e-mails and the reliability of the header information, it is important that the actual routing is independent of the header data and takes place via the Simple Mail Transfer Protocol (SMTP).
  • The body, i.e. the actual mail content, is displayed differently depending on the user-defined settings in the e-mail software. An e-mail can use multiple, parallel content areas: Pure text (ASCII) without umlauts, plain formatted text (such as bold or italics) with support for country-specific encodings (umlauts) or comprehensive HTML formatting with embedded images etc. are possible. However, there is no guarantee for equivalent content. It is easily possible to place different texts in the parallel content areas. Often, for example, the text part only contains the note that an HTML-capable e-mail client is required for the display.
  • The third optional part consists of attachments. These can be documents or images, which can also be combined in a ZIP file, or executable programs or scripts.

If one asks oneself against this background, in which file format e-mails are to be kept, one arrives at the realization that for e-mails no original format exists at all. A sent e-mail will always differ from the received one in the area of the header information. In addition, the sender's email client will usually save in a different format than that of the recipient. The servers responsible for the actual transmission also store e-mails in their own formats. All these formats are basically proprietary and not standardized, which does not guarantee their long-term readability. It becomes completely confusing when the annexes are also included. It would be necessary to keep the e-mail applications and attachment viewers as long as access to the electronic messages should be possible - a rather critical and time-consuming task. To get rid of this dependency, system-independent archiving of all e-mails is recommended and, if possible, all attachments in PDF/A - the reliable standard format that has long been established for general archive material. Exceptions should only be made for formats that are not convertible to the ISO standard, such as audio or video files.

The undisputed advantages of PDF/A for archiving apply equally to e-mails: The format is comprehensive, e.g. all fonts are embedded in the same way as relevant metadata. In addition, PDF/A ensures a clear system-independent appearance and prohibits dynamic content. Finally, and this is the most important argument, the ISO standard is designed for long-term archiving and thus guarantees that files stored in PDF/A format remain reproducible and readable for decades. The availability of a PDF/A viewer can be regarded as saved in the periods relevant for archiving.

In order to be able to search specifically for an e-mail, the header information is read out during its conversion to PDF/A and stored in the PDF/A file in the XMP metadata area. The e-mail body is converted to PDF/A and saved as the main PDF/A file. When handling attachments, the flexibility offered by PDF/A's three available standard parts can be used, all of which can be used for e-mail archiving: With PDF/A-1, which does not allow attachments, you would add possible attachments as additional pages to the e-mail. PDF/A-2 allows embedding PDF/A files so that the attachments can be integrated as PDF/A files. The third and newest standard part, PDF/A-3, goes one step further: Since any attachments are permitted here, it is possible to integrate attachments both in PDF/A and in the original format into the archivable file. Optionally, the e-mail can also be embedded in the original, then such an e-mail could even be opened in the client and answered, for example. It is important that the relationship between archive documents, e-mails and attachments in the PDF/A file is retained and therefore independent of the archive system. The complete system independence provided in this way is a very important advantage, which can make complex migration projects obsolete when changing systems, for example.

callas pdfaPilot is a powerful software that automatically converts all documents and e-mails into the appropriate standard part for the respective requirements. It is based on the same PDF/A technology that Adobe relies on in Acrobat Professional. During conversion, pdfaPilot ensures that the look and feel of documents is maintained in the same way as stored links. Depending on the processing result, callas pdfaPilot then stores the corresponding files in a special folder so that the rest of the workflow can be further automated. If necessary, individual reports are created, which list the problematic files and inform the user. In addition to PDF/A validation and conversion, specific rules can be stored on how certain document types, e.g. audio or film files, should be handled in the e-mail attachment, which metadata should be added to the documents and much more. Depending on the processing volume, callas software provides pdfaPilot in different variants: The desktop application processes individual documents or - via the batch function - all documents that are stored in a folder. The pdfaPilot server processes documents automatically via hot folders. The software can also be integrated into existing applications via a command line (CLI interface) or as a library (C/C++, C# or Java).