Knowledge Repository: How to parse a PDF

Tuesday, August 31, 2010

How to parse a PDF

PDFBox is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.

JPedal is a Java API for extracting text and images from PDF documents.

PDFTextStream is a Java API for extracting text, metadata, and form data from PDF documents. It also comes with an integration module making it easier to convert a PDF document into a Lucene document.

XPDF is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line.

Based on xpdf, there is a utility called pdftohtml that can translate PDF files into HTML files. This is also not a Java application.

Knowledge Repository

Tuesday, August 31, 2010

How to parse a PDF

No comments:

Post a Comment