Making PDF/A conversion easier

Callas PDFA conversion

PDF/A has been the international ISO standard for long-term archiving since 2005. It guarantees reliable reproduction of documents for years to come, regardless of any technological, hardware or software innovations that may arise. It enables homogeneous archives of both born-digital and scanned documents.

How do you tell the PDF/A variants apart?

Today, there are four variants of PDF/A, namely -1, -2, -3 and -4. Among these, PDF/A-3 stands out because, while the PDF file is still subject to design limitations (as with the other variants), it is also possible to embed any other file format into it. PDF/A-4 also permits this, but only in a special level of compliance known as PDF/A-4f. There is no doubt that PDF/A-3 and PDF/A-4f should only be used in very specific contexts, since a diverse range of archive formats is to be avoided. But more on this later. For now, we will concentrate on the other variants.

PDF/A-1 vs. PDF/A-2 and PDF/A-4

The PDF/A-1, PDF/A-2 and PDF/A-4 sub-standards have some features in common: in general, they limit what features can be used in a PDF file. This means, for example, that no external content, JavaScript, encryption or videos are permitted. Fonts must be embedded and colors must be defined independently of any device (using ICC profiles for instance). But how do these variants differ, and which should you use when? The key factor in this decision is the base PDF standard used in the variant. PDF/A-1 is based on PDF 1.4 (2001), PDF/A-2 on PDF 1.7 (2006), and PDF/A-4 on PDF 2.0 (2017). Each of these base standards introduced new features when it was launched. For example, layers and transparent objects can therefore not be used in PDF/A-1. This means that when converting to PDF/A-1, these kinds of objects need to be modified (flattened) and that means that information is permanently lost.

Another consideration that can have even more serious consequences is related to the fact that each new version of the base PDF standard also expanded the range of permitted internal values within a PDF file. A PDF 1.4 file had to be processable at a reasonable speed with the kind of hardware that was used in 2001, and the specification therefore applied some limits to internal structures. We all know how hardware performance has increased since then, which allows for much wider ranges for such structures in newer PDF files. However, when converting a contemporary PDF file to PDF/A-1, you need to meet the values specified for PDF 1.4. In rare cases, that means making changes at the PDF’s low-level internal structure. Such low-level changes are sometimes only possible by replacing all content on a page with an image. This leads, of course, to a loss of information: for example, text stops being text and is now only an image of text.

A good converter will only take such drastic measures in very, very rare cases. However, the question is: why does this have to happen at all? There is no chance that future hardware will drop back to a level of performance comparable with the year 2001.

So the easy answer to the question “Which archive format is the best?” is to make sure that the base PDF standard for the PDF/A variant is not older than the version of the archived PDF file. In most cases these days, this means PDF/A-2.

It is worth noting at this point that PDF/A-1 remains a valid sub-standard and has not been replaced by PDF/A-2, nor will that ever be the case with PDF/A-4 and PDF/A-2. PDF/A-1 files can remain to be in the archives forever; they don’t need to be converted to a more recent standard. However, it is strongly advisable to archive new files in PDF/A-2.

As of yet, it is not possible to give a strong recommendation for PDF/A-4, as the base PDF format, PDF 2.0 is still rare as of today.

Regardless of when PDF/A-4 will become the dominant variant, though, we don’t consider it a good goal to have an archive only made up of PDF/A-1, PDF/A-2 or PDF/A-4 files. These variants build upon one another, so the better strategy is to adjust the variant used as needed—to the PDF/A variant that corresponds to the newest base PDF version of the files to be archived.

Use cases for PDF/A-3 and PDF/A-4f

Before we go, one final word on those special cases we mentioned before – PDF/A-3 and PDF/A-4f. From an archiving perspective, it is essential to limit the variety of file formats used; these variants therefore require a framework of additional rules. But there are striking use cases for them, that all have in common that there is a specific, defined relationship between the actual archive file and the files that are embedded within it. An example are embedded source files – say, saving a spreadsheet alongside a PDF copy, digital invoices containing machine-readable datasets embedded into a human-readable PDF/A-3 file (for example ZUGFeRD invoices). Email archiving is another classic use case for PDF/A-3 and -4f. Here, the original email can be embedded in EML or MSG format, along with attachments,.

Conclusion

In short, we can say that:

  • Existing PDF/A-1 files don’t need to be converted to a newer standard.
  • It is a good idea to convert new files to PDF/A-2.
  • PDF/A-4 is worth keeping an eye on, and:
  • PDF/A-3 and PDF/A-4f should only be used in contexts where the nature of the embedded files is defined.
Back to overview
 

Subscribe to our blog newsletter for access to regular updates

No strings attached. Unsubscribe anytime. For further details, review our Privacy Policy.