Berlin, August 25, 2020 – The subject of Robotic Process Automation (RPA) is gaining increasing attention in the software industry. An increasing number of providers from a diverse range of sectors are introducing applications that can complete tasks previously handled by humans. By adopting these applications, users expect to benefit from optimizing their processes, avoiding errors and eliminating monotonous, repetitive tasks. Dietrich von Seggern, Managing Director at callas software GmbH, explains why the PDF format forms a sound basis for RPA applications.
In order for RPA applications to work smoothly, processes must both have a standard structure and work with standardized files. This is the only way to maximize the number of files processed by a single automation. In many cases, this means that RPA can only work with data that it has itself generated. But why not use a standard here too? There are many good reasons to use the PDF format whenever such processes need to interact with third-party data. After all, PDF is the lowest common denominator for nearly all document types processed or received in offices. Office files, emails and even images can be easily converted to PDF, giving RPA applications a standardized starting point for processing. Moreover, with the many features it has accumulated over the years, PDF is the most powerful, versatile document format in the world. However, not every PDF meets the prerequisites for automatic processing equally well. This starts not with a reliable display model for the document, but rather the actual data it contains.
Making PDFs RPA-ready
- One of the ‘simplest’ stumbling blocks – in fact, a more or less unavoidable one for document-based RPA – is the password protection that is often applied to documents without thinking. Technical and legal restrictions prevent content from being extracted from password-protected files, which instead just have to be returned to sender.
- In order to process PDF files automatically, they need to meet a few requirements. For example, after scanning a document, it is generally necessary to fully index the text of it using OCR, assigning Unicode characters to the extracted content. Only then RPA processes can evaluate the actual text. Even ‘born digital’ PDFs may not necessarily have full Unicode support. This is where dedicated validation tools (and repair tools, where necessary) come into play! One very concrete example would be print files exported from an ERP system, used to consolidate outgoing invoices. A PDF software tool will search through the text, finding keywords or separators that it uses to split the single PDF into separate invoices. Naturally, this only works if the software can recognize the keywords – and without OCR, this is only possible if the text has already been ‘translated’ into Unicode.
- By integrating metadata into PDFs, RPA applications can be provided with pointers that show how to process a given file. For instance, it may make sense to extract information before converting the source file, and then add the extracted information to the PDF. Consider the following example: a retail company receives a set of product descriptions from a supplier in PDF format. They can add entries to the metadata, which they then use to classify the files. If their customers need information, these descriptions can then be added to individual product catalogs and used to generate a table of contents.
- Ideally, the PDF files will be ‘tagged’. This means that not just the semantic component of the text is defined in Unicode, but also that headers, paragraphs, image descriptions and tables are described (‘tagged’) in a structured data format. These tags allow the RPA application to tell how to structure text content (particularly in multi-column layouts), extract headers and organize images using their descriptions. Since assigning tags to PDF documents after the fact is a very time-intensive process, AI is generally used for tasks like reading forms correctly at the field level. This makes it even more important to fully index the text of PDF files as described in the first bullet here.
Businesses looking to maximize process automation can and should first establish a framework for seamlessly leveraging RPA-based applications. Part of this is about building a solid foundation, using maximally homogeneous, standardized data. As the highest common factor for Office files, high-quality PDFs are a good starting point for this foundation.
About callas software
callas software finds simple ways to handle complex PDF challenges. As a technology innovator, callas software develops and markets PDF technology for publishing, print production, document exchange and document archiving. callas software helps agencies, publishing companies and printers to meet the challenges they face by providing software to preflight, correct and repurpose PDF files for print production and electronic publishing. Businesses and government agencies all over the world rely on callas software’s future-proof, fully PDF/A compliant archiving products. In addition, callas software technology is available as a programming library (SDK) for developers with a need for PDF optimization, validation and correction. Software vendors such as Adobe®, Quark®, Xerox® and many others have recognized the quality and flexibility provided by these callas tools and have incorporated them into their solutions.
callas software actively supports international standards and has been participating in ISO, CIP4, the European Color Initiative, the PDF/A Competence Center and the Ghent PDF Workgroup. In addition, callas software is a founding member of the PDF/A Competence Center and in October 2010, Olaf Drümmer became its chairman.
callas software is based in Berlin, Germany. For more information, visit the callas software website at: www.callassoftware.com.