Quantcast
Channel: Datalogics PDF Java Toolkit – Datalogics Blog
Viewing all articles
Browse latest Browse all 57

Search and Redact RegEx Patterns Using the Datalogics PDF Java Toolkit

$
0
0

Sample of the Week:

I’ve written about using the Datalogics PDF Java Toolkit to redact PDF files in previous articles which you can review at the links below. In these articles, we were looking for specific words. They’re worth reviewing if you are unfamiliar with what redaction is or how to create redaction annotations.

Automating PDF Redaction using the Datalogics PDF Java Toolkit
Redaction using the Datalogics PDF Java Toolkit

In this article, we’re going to reproduce one of Acrobat’s features for searching and redacting based on a pattern. In Acrobat, you can use the Search and Redact tool to select the pattern or patterns you want to search for.

Search and Redact

What most users don’t know is that this dialog is populated by an XML file stored on the system; a file that can be edited. The patterns in this file are actually stored as Regular Expressions or RegEx. We can search and redact PDF files using the Datalogics PDF Java Toolkit using the same Regular Expressions in exactly the same way. In the Gist referenced below, we’ll be searching for phone numbers using a pattern. The input file looks something like the image below.

In the Gist referenced above, we’ll be searching for phone numbers using a pattern. The input file looks something like the image below but without the red marker.

Input File

Like all of the other automated redaction samples, we use the ReadingOrderTextExtractor which parses the document and returns the text the way a human reading the document would read and report the text… thus… the name. This is accomplished by retrieving the vertical and horizontal locations of the word bounding boxes and then doing statistical analysis to find common starting points for words and column breaks.

Once we have the words in order, it’s easy to add the redaction annotations and then apply them to the text.

To get started working with PDF, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.

 

Start Your FREE Trial 



Viewing all articles
Browse latest Browse all 57

Trending Articles