How to export pdf form fields to xml automatically

Question

I have a pdf file including form fields and need to export the data into a xml file AUTOMATICALLY. Here is a screen of a sample form I created for testing:

enter image description here

Note: It works great exporting it MANUALLY using Acrobat Professional by clicking on Tools > Form > Export Form Data and finally chose xml extension for file output. This is the result I'm getting when I export it manually:

<?xml version="1.0" encoding="UTF-8"?>
<fields>
    <first_name>John</first_name>
    <last_name>Doe</last_name>
</fields>

However, I need to automate it, e.g. with a python script, Java implementation or some command line tools. Any ideas which libraries or tools I could use to export form field data to xml? The tool or library should be open source, that I can integrate it in my workflow.

I already tried python pdfminer library, which helped me to export static parts (like Static form header, First name: and Last name:) of the pdf file: But how to export form field data (in my case the content of the form fields first_name and last_name)??

EDIT: Feel free to download the sample.pdf file here.

pevik · Accepted Answer · 2016-06-30 00:16:50Z

9

+50

How about Apache PDFBox? It is open source and could fit your needs, since the website says "Extract forms data from PDF forms or prefill a PDF form."

EDIT: Check out the PrintFields example.

edited Jun 30, 2016 at 0:16

pevik

4,7413 gold badges35 silver badges47 bronze badges

answered Jan 23, 2014 at 21:00

jimmyp.smith

1786 bronze badges

It looks great! I tried extracting all form fields via command line and it works. I will work on the Java source code example tomorrow, but from what I see it's exactly what I was looking for. I'll keep you updated!
– Michael
Commented Jan 23, 2014 at 22:05
1

I'm glad it helped a little bit. I forgot to say that the jdom library might be a great way to go for converting objects to xml. Good luck!
– jimmyp.smith
Commented Jan 23, 2014 at 22:20

Add a comment |

James Kingsbery · Accepted Answer · 2014-01-22 20:02:46Z

2

In bash, you can do this (at least with my version of these tools, less 444 and cat 8.13):

less ~/Downloads/sample.pdf | cat

I get output that looks like this:

Static form header

First name:   John

Last name:    Doe

Which you can then parse pretty obviously using Java/Python/awk/whatever.

Of course, alternatively, if you don't want to rely on the behavior of particular versions of these (not sure if they always do this or not), you can look up less's source code to see how it does it.

answered Jan 22, 2014 at 20:02

James Kingsbery

7,3862 gold badges39 silver badges67 bronze badges

any idea how I would do it on a Windows machine ?
– Michael
Commented Jan 22, 2014 at 20:05
You can try cygwin. Or you can, as I added in an edit, look at how less itself does it and try to port that code to Windows. Or you can install VMWare, spin up a VM, have the VM do it, and get the result back. Or you can spin up an EC2 instance, have the EC2 instance do it, and return the result.
– James Kingsbery
Commented Jan 22, 2014 at 20:07
Thanks for the thoughts. I will checkout the source code, to see if I can adapt it. Using a VM is not yet an option. I would rather prefer a solution which runs on a standalone machine.
– Michael
Commented Jan 22, 2014 at 20:29
I filled in the fields of a PDF in Adobe Acrobat DC and could not get the field data out. the answer stings were in there but surrounded by binary junk. Filled in the same form in Google Chrome and printed to PDF file and it has nicely structured XML that can be retrieved. Need to find a library that understands all forms of PDF fields.
– rob
Commented Jun 20, 2018 at 10:30

Add a comment |

Community · Accepted Answer · 2017-05-23 12:17:44Z

1

In Java there is a few libraries to work with PDF, but generally it's hard to get formatted information from PDF. I have never implemented that thing, but Qoppa looks good and seems to be advanced but it's not free. It contains jPDFFields which should be useful to extract values from form fields. Also there is a similar thread, in which there is some information about the command line tool.

I hope it will be helpful for you.

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Jan 22, 2014 at 19:31

annaskulimowska

612 silver badges6 bronze badges

Thanks for taking the time. Actually, I was looking for an open source library or tool. Sorry I did not mentioned it, yet. The jPDFFields would do the job. I tried the demo applet and it works, since I can export it to XML (XFDF). However, it's not open source :-/
– Michael
Commented Jan 22, 2014 at 19:49

Add a comment |

Guy Gavriely · Accepted Answer · 2014-01-22 20:08:03Z

1

I had much success using pdfminer:

pdf2txt.py -o out.xml -t xml sample.pdf

and then parse it using xpath and join strings, to use it from your code track the code here

other than that there is a new kid on the block called tabula, written in ruby which I didnt get the chance to use yet but supposed to be great

I understand your unwilling to use paid service, but still worth mentioning that Adobe have a conversion service that at the time of writing costs 2$ a month, check it out, just saying...

answered Jan 22, 2014 at 20:08

Guy Gavriely

11.4k6 gold badges29 silver badges43 bronze badges

Were you be able to export the form fields with pdfminer? Because I was not. I tried converting my pdf sample file (as I provided above) using the (pdf2text demo page)[pdf2html.tabesugi.net:8080/] to extract the form fields, but it the export is limited to static fields only. I haven't done anything in ruby yet, but it might be an option. I will have a look at this. Furthermore I will test your command line snippet in a second, just to make sure I did nothing wrong when I used it before.
– Michael
Commented Jan 22, 2014 at 20:25
AFAIK there is no notion of Fields on pdfminer, but you can go very far with the right xpaths
– Guy Gavriely
Commented Jan 22, 2014 at 20:27
Would you be able to provide a small example or link, if it's worth? From my point of view, I can not imagine how to use xpath for extracting content, when my output file (converted from pdf to text) does not contain any of the form field data. Did I get this right?
– Michael
Commented Jan 22, 2014 at 20:33
this should be converted to current version stackoverflow.com/questions/3984003/…
– Guy Gavriely
Commented Jan 22, 2014 at 20:45
I already tried this solution, but if I remember right, I was not able to use fields = resolve1(doc.catalog['AcroForm'])['Fields']. However, I will try it again. There must be some way to export the form fields. I would also be satisfied if I could store the form field content in an object without parsing it to xml. I'll keep you updated.
– Michael
Commented Jan 22, 2014 at 21:01

Add a comment |

Jonathan · Accepted Answer · 2014-01-23 10:22:26Z

0

For a Java solution, you could use iText to read the fields and then something like jackson-dataformat-xml to write the results as XML. A, somewhat basic, example of this would be:

// read fields
final PdfReader reader = new PdfReader("/path/to/my.pdf");

final AcroFields fields = reader.getAcroFields();
final Map<String, Object> values = new HashMap<>();
for (String fieldName : (Set<String>) fields.getFields().keySet()) {
    values.put(fieldName, fields.getField(fieldName));
}

// write
final XmlMapper mapper = new XmlMapper();
final String result = mapper.writeValueAsString(values);

System.out.println(result);

There is definitely some room for improvement here, but it may be a good enough starting point.

answered Jan 23, 2014 at 10:22

Jonathan

20.3k6 gold badges64 silver badges71 bronze badges

iText is not open source, right? At least I don't see an open source library. If it's not open source, it's not an option, since I would only use the feature to extract form field data.
– Michael
Commented Jan 23, 2014 at 13:31
1

They claim to be open source, the code can be found here and there are two licenses available, commercial and AGPL.
– Jonathan
Commented Jan 23, 2014 at 14:02
I will double check that with our license management! It could work, since the project is currently planed as an internal project. I'll need to wait for the license experts answer.
– Michael
Commented Jan 23, 2014 at 20:08

Add a comment |

Collectives™ on Stack Overflow

How to export pdf form fields to xml automatically

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
java
xml
python-2.7
acrobat
pdf-extraction
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged javaxmlpython-2.7acrobatpdf-extraction or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
java
xml
python-2.7
acrobat
pdf-extraction
or ask your own question.