Sunday, March 9, 2014

PDF text extraction tools

Building a larger tool out of a collection of smaller tools can be quite a learning experience.  For the past few months, I've been working on a document search engine to hold and index a collection of PDF files which were generated via the PrintFriendly browser app.

In the past month, I learned a few things about PDFs and extracting text from them.  

- PDF is not a document language like DocBook or HTML.  Rather, it is more of a type setting language, in that letters are located individually on a page.
- There are no good tools to properly extract text from a PDF (commercial tools included).
- Most text extraction tools cannot properly handle the letters "f", "o", "ll", and "t".

Of the various tools tested, it appears that Calibre's ebook-convert produces the cleanest straight-text output.  I'm using that in the text extraction piece of the search and including the ability to edit the extracted text (to improve the searches).