Publish smart – The Internet Standards Series

Publish smart – The Internet Standards Series

Luckily, the times when people tried to get attention starting discussions whether HTML or PDF is the better format are over (or should I say hopefully?). Such discussions are as useful as asking whether a phone call or an email is better or a truck versus a racing car. In today’s digital world with its rich communication capabilities, the question what channel or format should be used has to be asked for the individual type of publication and is indeed not always easy to answer.

The RFC Series is the home for internet standardsand, related best practices and informational documentation developed by responsible organizations: the Internet Engineering Task Force (IETF), the Internet Research Task Force (IRTF), the Internet Architecture Board (IAB)and independent submission streams. They are published by the RFC Editor which again is not a person (anymore) but an organization.The RFC Series had its 50th anniversary in 2019 which was also celebrated in an RFC: https://www.rfc-editor.org/info/rfc8700 that also gives a nice overview about how the whole system has evolved.

When you follow the link above you will notice that you can view / download the RFC in four possible formats: HTML, Text, PDF and XML - and not just HTML which would be the most obvious choice for this kind of document. As you might suspect, we would not have mentioned this if it would not involve PDF. But we are far from saying that PDF is the better format - which would be stupid anyway since we are using web technologies for publishing this article. But PDF has certain qualities that web technologies don’t have.

In the past 50 years RFCs were texts limited to ASCII character codes, originally rather informal documents published as Requests For Comments. We all know that ASCII text is not the most powerful format: It limits characters and disallows Umlauts and other diacritical characters – which makes it difficult to e.g. write a specification about how to encode Umlauts. You can only use word art for graphics which also limits the way it can be positioned, not to mention pagination … So, the organization that publishes the documents, the RFC Editor, was looking for a better solution (almost as long as the organization existed).

The natural decision is HTML and it is no surprise that it is one of the formats they are using now. But the organization acknowledged that this also has downsides, e.g. it can’t easily be downloaded, has no (working) pagination concept, vector graphics are not natively supported in HTML (would require SVG), it is difficult to deal with them when it comes to versions and updates and – although it has some structure – it can’t in this regard be compared with XML.

RFC Editor decided to come up with what I believe is a very smart way to publish technical specifications: First of all new documents can be downloaded in a variety of formats: HTML, TXT, XML and – PDF, each format with its specific advantages. Sidenote: Graphics are still word art in PDF, since they are understandably only created once for all formats (see https://www.rfc-editor.org/rfc/rfc8728.pdf).

RFC 4 file formats

The feature that made them - from a PDF point of view – really stand out, is that they are not using simple PDF but PDF/A-3u. What does that mean?

PDF/A stands for reliability, conformance level U makes sure that all text has Unicode representation which guarantees searchability and text extraction. And standard part 3 allows for embedding arbitrary file formats. PDF/A-3 often combines it with structured information and that is the case here as well: Each RFC in PDF/A-3 comes with an embedded XML structure so that interested parties can easily extract structured content and e.g. put it into their own data repositories.

A very smart way to use PDF features for publishing technical specifications – in fact better than what 'we' do with publishing PDF standards in PDF only (and EPUB) since 'our' ISO procedures do not allow us to do otherwise.