aXelerate XObject processing in variable data files

This blog is about the output performance of variable data print (VDP) PDFs and what impact form XObjects have. It also introduces a new feature in pdfToolbox 15 that allows you to visualize and manage form XObjects.

Some time ago I already wrote a blog on "How to optimally prepare PDF files for variable data printing?". The article was closely related to the PDF Association's publication "Best Practice in Creating Print Files for Variable Data Printing (VDP)" which was released shortly before it. Some of the investigations that led to this article were inspired by its valuable content. In that blog, I already discussed the relationship between form XObjects and VDP; now I want to dig deeper.

Let us first recap what form XObjects are. Page objects are in PDF defined in ‘content streams’. Content streams reference resource dictionaries for fonts, images, color spaces, etc. A resource dictionary may also contain other content streams that can then be invoked from the main page content stream so that their objects will be placed on the page. Form XObjects are such referenced content streams. They are very similar to a page, with a bounding box and (usually) their own resources. Hierarchies of form XObjects are not just possible; they are frequently used.

The term "form XObject" refers to printed forms where the "background" is static (the same in all forms) while the other data is dynamic and changes on each instance of the form in a PDF file. It makes sense to encapsulate the static content into a container invoked on each page. That not only saves disk space, it can also speed up output when the static content is processed only once, saved (cached), and reused. That is what happens with many of the VDP files when output.

Without form XObjects, variable data print would be limited to very simple page designs that can be processed fast without caching. Layouts using many colored vector objects, images, etc., rely on caching to be output with reasonable performance. Everything in the page description that is cached needs to be in a form XObject (other objects, e.g., fonts or images, are also cached).

The game gets more complicated since a form XObject may depend on page context, i.e., the graphic state parameters which are in effect when a form XObject is invoked. Examples of such graphic state parameters are line width or color. That means that if a form XObject draws a line and the line width is not specified in the form XObject itself, the width "from outside" will be used. If the same form XObject would be invoked when a different line width is defined the painted lines will be different (thicker or smaller). That "feature" of a form XObject is not often used in real PDF files. However, a PDF processor needs to be prepared so that form XObjects are only reused if what it paints is independent of context.

Dependencies from context are not a big issue: If the output RIP recognizes that the context impacts it, it will just not reuse the cached object. Since that is a rare case anyway, the performance impact is also rare.

A much more severe issue in this context is transparency: If a form XObject is overlapped by a transparent object or if it is transparent itself and overlaps other objects, it cannot be reused. Since only form XObjects can be cached and since other objects impact the appearance of the form XObject it just does not make sense to put the rendered result into the cache.

Let us use an example to demonstrate how problematic that can be.

This is a 5-page PDF file. It is derived from the PDF/VT sample suite and modified a lot to effectively just use the layout. It does not use variable data at all anymore: the 5 pages are the same. We do not need variable data to see the issue. In my test environment, the RIP takes roughly 12.5 seconds to process the 5 pages; no page content is cached, and all pages take the same time: 2.5 seconds. Way too long for a modern digital printing machine. Obviously, that would not get any better if the pages would also contain variable data.

The reason for the slow performance is a soft mask in the file:

Testfile


The soft mask, indicated by the red rectangle, does not impact the ‘core appearance’. However, as we will see, its impact on performance is substantial. I admit that I have added the soft mask for demonstration purposes. I have done so because I have seen such superfluous soft masks that massively slowed the output process in production PDFs, but I cannot use these files in a public article.

What can you do to get rid of the soft mask? The soft mask is in a form XObject; in order to remove the references to it in the page content streams, you have to modify each page separately. While that is an option in the 5-page example, it is not realistic for a typical VDP file with thousands of pages.

But there is a trick to suppress them in output by using layers that are (initially) invisible. If you had a tool that put all form XObjects onto layers, that would be a very quick operation that would only modify the layer configuration in the PDF and not the content streams. The new ‘Distribute form XObjects on layers’ feature in pdfToolbox 15 has been designed for this exact purpose.

In short, you open the PDF in pdfToolbox 15 Desktop, go to Switchboard -> Layers -> Enumerate XObjects, and click Execute. The Layer Explorer will open, listing various layers that were not in the original file. The hierarchy of form XObjects has been distributed onto new layers, and each layer name indicates on how many pages it is used and how many child XObjects it has. An additional "Form XObject BBoxes" layer contains red rectangles that indicate where the form XObjects are on the page.

Layerized PDF in pdfToolbox


If you disable the layer "15-formX on 5 pages" you will see that it contains the soft mask. You can then save this as the (initial) visibility for the PDF by clicking on "Save visibility". An output device (normally) ignores all content on invisible layers. (You should not forget to also switch the “Form XObject BBoxes” layer off.)

The result file in the same environment as before used only a total of ca. 4.5 seconds and the first page alone used 1 second (already much faster). A 100 page variant took 20.5 seconds in total. Since the first page must still have used 1 second, each of the remaining 99 pages was processed in only 0.2 seconds .

As we have seen caching is essential for variable data print and form XObject structures are essential for caching. It is worthwhile to analyze the form XObject structure in VDP files with performance issues to find out whether that could be improved.

But that should only be the starting point: ultimately you would want such optimizations to take place automatically in your PDF workflow - in the same way as many other optimizations that pdfToolbox silently performs for this reason. This is one of those areas where to make the software substantially better, we need to know what doesn’t work, so please send us any VDP files that cause performance issues (whether or not you found a reason using the new form XObject visualization). You can use our regular support address (support@callassoftware.com), even if you are not a customer, and we will get back to you soon thereafter.

Back to overview
 

Subscribe to our blog newsletter for access to regular updates

No strings attached. Unsubscribe anytime. For further details, review our Privacy Policy.