In today’s complex automated prepress workflows, performance can still be an issue. For us as developer of solutions for automated PDF processing, performance improvement is always an important topic. However, since the PDF format itself is very complex, there usually can’t be a compromise between accuracy and performance. One example: You can very quickly determine the list of color spaces defined on a page. However, if you want to know what color spaces are actually used on the page then you have to fully analyze the content stream that defines it’s objects. And that can be even more complicated when transparency blend color spaces or DeviceCMYK images that only use one colorant have to be taken into account.
On the other hand, if in an automated workflow it can be decided fast that a PDF can’t use ICCbased color because it’s list of color spaces does not have any such definitions, that might also help to avoid starting the more thorough engine. While we do rely on APDFL to rasterize pages, we figured that there may be ways to create much faster results.
And this is where our easy-to-use 'PDF information extraction tool'- QuickCheck comes in. It has been designed to deliver results in fractions of seconds and will in various cases be used as a first step in more complex workflow. With this example I do not want to say that you need to combine any color space check with a QuickCheck, the point is that you can use it if performance is an issue in a certain step of an automated workflow.
And it is highly configurable. The point being that you can quickly decide where a PDF should go and how it is being processed in more thorough corrections. Currently, QuickCheck needs fractions of a second for the average prepress PDF, and only a couple seconds for a 1,000 page document like the ISO standard for PDF to report all the supported aspects.
Very early in the project, the possible areas of interest and candidates for Quick Check were figured out like on the document level, is the file damaged or encrypted? Document metadata, Conformance with PDF standards, Output Intent etc. On the other hand, things that could be retrieved on the page level were recognized like page size dimensions, number of pages etc and more like page content and whether color spaces are used (e.g. whether RGB is used), spot colors, fonts, or layer are used.
To be careful to only add aspects to reporting that did not negatively impact performance/execution time, there were various aspects that were kept away from Quick Check. Examples being the use of transparency (opacity, blend mode, soft masks, and so forth), image resolution (or any other information pertaining to images), information about form fields and information about markup annotations or other annotations. This simply does not mean that they would not make it to some future version of pdfToolbox. This blog might be the perfect platform for you to tell us if you want us to add any other retrievable item to Quick Check.
In the upcoming pdfToolbox 11, we have added the possibility of identifying complex PDFs via QuickCheck by looking into the length of content streams and take that as an indicator. This addresses issues with long running PDFs blocking workflows. With this simple and very fast check, there is a good indicator for the complexity of a PDF file or page. In time critical situations, suspects can then be put on hold to be processed when there is less pressure.