Tuesday, August 31, 2010

How to parse a PDF

PDFBox is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.
 
JPedal is a Java API for extracting text and images from PDF documents.
 
PDFTextStream is a Java API for extracting text, metadata, and form data from PDF documents. It also comes with an integration module making it easier to convert a PDF document into a Lucene document.
 
XPDF is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line.
 
Based on xpdf, there is a utility called pdftohtml that can translate PDF files into HTML files. This is also not a Java application.

No comments:

Post a Comment