Sample of the Week:
I’ve written about using the Datalogics PDF Java Toolkit to redact PDF files in previous articles which you can review at the links below. In these articles, we were looking for specific words. They’re worth reviewing if you are unfamiliar with what redaction is or how to create redaction annotations.
Automating PDF Redaction using the Datalogics PDF Java Toolkit
Redaction using the Datalogics PDF Java Toolkit
In this article, we’re going to reproduce one of Acrobat’s features for searching and redacting based on a pattern. In Acrobat, you can use the Search and Redact tool to select the pattern or patterns you want to search for.
What most users don’t know is that this dialog is populated by an XML file stored on the system; a file that can be edited. The patterns in this file are actually stored as Regular Expressions or RegEx. We can search and redact PDF files using the Datalogics PDF Java Toolkit using the same Regular Expressions in exactly the same way. In the Gist referenced below, we’ll be searching for phone numbers using a pattern. The input file looks something like the image below.
In the Gist referenced above, we’ll be searching for phone numbers using a pattern. The input file looks something like the image below but without the red marker.
Like all of the other automated redaction samples, we use the ReadingOrderTextExtractor which parses the document and returns the text the way a human reading the document would read and report the text… thus… the name. This is accomplished by retrieving the vertical and horizontal locations of the word bounding boxes and then doing statistical analysis to find common starting points for words and column breaks.
Once we have the words in order, it’s easy to add the redaction annotations and then apply them to the text.
To get started working with PDF, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.