callas Desktop products can 'explore' your PDF documents

W750q85 Explore tools callas

Imagine a scenario where you have copy-pasted some text from your PDF to another location and it comes out totally different from what you had copied (like in the screenshot below). Now what? You would want to explore the text inside the PDF, wouldn’t you?

Remember the basic structure of a PDF with Header, Body, xref table and Trailer? The body contains all the object information, such as fonts, images, text, bookmarks, form fields and so on. That means a PDF with text and an embedded font has all related information in the PDF structure. To view this information, callas Desktop products come with a low level 'Explore PDF' tool. It enables you to view the entire PDF structure at one place and lets you dig into the internal structure of the PDF in different views.

By preference, any fonts that are used in a layout are also included in the PDF file in original format to make sure that the file can be viewed and printed as it was created by the designer.

When it comes to text and fonts, 'Explore PDF' has the 'Resource View' that summarizes all available information about the embedded font and its glyphs (glyphs are the character outlines that are specific to the font and provided through the font file). A simple worksheet with embedded fonts looks like in the screenshot below where there are several indicators for the embedded font

If such an indicator is red, it means that the corresponding property of the indicator applies to the font. This does not have to be a problem right away, it can help making the different properties of the font and its glyphs quickly accessible. In our screenshot above, the indicator lookup informs us that 'e' stands for glyphs without contour and 's' is for such empty glyphs with a width. So, in fact, the respective glyph is a space. The capital 'W' means that the glyph width is used for positioning, which means that the glyph is not positioned using coordinates but the width of a previous glyph. 'L' stands for ligature, which means that a glyph (outline) consists of two characters. Since we are looking at the whole font that is not a contradiction to 'space' since the screenshot shows the summary information for all glyphs in the respective font, further down you have the 'Glyph properties' section where you find the same information for each individual glyph.

'1' and '2' indicate a potential problem with Unicode representation of at least one glyph in the font. What does that mean? When you look at text in the PDF, there are actually two different lookups (encodings) taking place: one is for the glyph (outline) and is needed to display the character on the screen; the other is for the meaning (semantics) of the character and is needed to search for text or copy it out of the PDF file. In PDF, we usually say that the text needs to have a Unicode representation, since the Unicode standard defines the semantic for all characters, e.g. it associates the outline 'A' with 'Latin Capital Letter A' which has the Unicode code pointU+0041. By the way, there is a Check in pdfToolbox with the name 'Text cannot be mapped to Unicode' that allows you to find out whether there is Unicode representation for all text in a PDF.

And now we are back at our initial question: why was the text in the PDF accurately displayed but when copied just garbage? The reason is that the glyph lookup worked just fine, but the lookup of the Unicode representation did not. And '1' and '2' indicate that there is a mismatch between two ways to resolve to a Unicode code point: via a ToUnicode table in the PDF and via the information as present in the font encoding itself. Since the ToUnicode has priority, this is not necessarily a problem but an indication that there could be one.

Now that we have successfully explored the font information in the PDF structure, it's time to explore the font itself. Font files can be highly complex and very large files, with many glyphs, supporting dozens of non-roman letters, rich in features, or they can be very small, containing just a few icons for a website. Another explorer, the ‘Font Explorer’ lets you view the internal structure of embedded fonts in a PDF similar to 'Explore PDF', but with some additional information including in greater detail than the preflight results with a graphical view that shows the outline and coordinates of each glyph. Below, you can see a ligature (f and i) which has the Unicode code point U+FB01 'Latin Small Ligature fi'

Back to overview