PDFBox is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.
JPedal is a Java API for extracting text and images from PDF documents.
PDFTextStream is a Java API for extracting text, metadata, and form data from PDF documents. It also comes with an integration module making it easier to convert a PDF document into a Lucene document.
XPDF is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line.
Based on xpdf, there is a utility called pdftohtml that can translate PDF files into HTML files. This is also not a Java application.
No comments:
Post a Comment