Forms automation can support mission-critical processes and can extend beyond the company to include vendors, employees, and customers. And, when a form supports a mission critical activity, like hospital admissions or insurance claims, they will often interact with other enterprise applications like ERP or CRM systems. Whether you knew it or not, a PDF can be used to drive data into and out of the document as part of these company-wide processes. However, organizations that lack experience with PDFs can face a tough challenge when attempting to create a workflow where the PDF form is the central component. Fundamentally, you have to be able to get data into and out of the PDF so that it can be used where needed. There are plenty of options out there that rely on converting the PDF to some other file format, or exporting the data to FDF or XFDF. Let’s take a look at how Datalogics PDF Java Toolkit can be used to extract the data from the PDF and get it into a format that is more friendly. In the example that follows, we will export data from a form into a very simple CSV file.
The ExportFormDataToCsv sample takes two arguments, the path to an input PDF and the path to write out the CSV file, the first being the PDF we want to process and the second being where the output should be written. Since the sample is named ExportFormDataToCsv, it should be no surprise that the output from the sample is a CSV file. It will include two lines, one for the field names and one for the field values. In a CSV file, the first line is interpreted as the column headings, so we use the qualified field name (the qualified field name combines the name of the parent field with the name of the child field so that the field name is unique) as the column heading and the second line is the first set of data in the CSV. This is what will be extracted from the fields in the form. Here is a screenshot of the CSV output opened in Microsoft Excel:
To create the CSV file, the sample relies on the Apache Commons CSV library to ensure a proper CSV file is written out that is compatible with Microsoft Excel. The first step the sample takes is to loop through the fields in a form and write out the qualified name of the field. The code in the exportFieldNames method takes care of this for us by using an iterator that is retrieved from the PDFInteractiveForm object passed into the method. In a PDF, there can only ever be one PDFInteractiveForm object so there is no concern over whether we are getting the correct form from the PDF.
final Iterator<PDFField> fieldIterator = form.iterator(); while (fieldIterator.hasNext()) { final PDFField field = fieldIterator.next();
printer.print(field.getQualifiedName()); } printer.println();
The result of the exportFieldNames method is a single line that represents the column headings in the CSV file. The next step is to write out the values from each field into the CSV file. This is done in the exportFieldValues method where we loop through the fields again, this time we ask for the field value.
The exportFieldValues method will try to output the formatted value, if there are formatting scripts in the PDF that control how data is presented (for example displaying phone number as “(312) 853-8200” instead of a string of digits all smashed together “3128538200”). This also works for additional formatting scripts that you might have embedded for things like addresses, email addresses, or currency. If there is no formatted value, then the sample falls back to writing out the non-formatted value so that the data that is extracted is complete. There is an important and easy to overlook line of code here though:
printer.print(value.get(0).toString());
Fields in a PDF may have multiple values depending on their actual type, and the way PDF Java Toolkit provides access to these is by providing a List of Values. For most scenarios, the value will be the first item in the List that is returned – so as long as that List is not null, the sample will write out that first item. Depending on your scenario, you may want to write out a different value or all possible values. The ExportFormDataToCsv sample will use the defined Logger to log warnings when there is no formatted value or the field is entirely empty so if you see statements coming out of the sample like:
May 09, 2017 5:32:54 PM com.datalogics.pdf.samples.forms.ExportFormDataToCsv exportFieldValues
WARNING: accountNumber has no formatting rules! Writing out non formatted value!
or
May 09, 2017 5:32:54 PM com.datalogics.pdf.samples.forms.ExportFormDataToCsv exportFieldValues
WARNING: durian has no value!
Of course, if you know some of the fields are empty or do not have formatting scripts, you can ignore these warnings.
Datalogics PDF Java Toolkit offers a great deal of functionality for working with PDF Forms, to both enable importing data into the form or exporting data out of the form for use in other systems. Recently, we have made updates to these features to support the execution of calculation, formatting, and validation scripts. These improvements are vital for those who want to extract data and analyze it, as they enable you to extract the data exactly as it is presented in the PDF. This is important to note because PDF viewing applications, like Adobe Acrobat, do a lot of work behind the scenes to present the data as the author intended for it to be viewed. For example, data in a PDF is not always stored in the same format as it is displayed. For example, amounts of currency will be stored as plain numbers (115.00) and when the PDF is open you might see “$115.00” or “€115”. Using Datalogics PDF Java Toolkit to extract data from your forms will ensure data will be extracted in the correct format.
If you have any questions about this sample or the PDF Java Toolkit, comment below or contact us!
The post Exporting Form Data for CSV appeared first on Datalogics Blog.