Neighborhood Techie: PDF text extraction tools

Sunday, March 9, 2014

PDF text extraction tools

Building a larger tool out of a collection of smaller tools can be quite a learning experience. For the past few months, I've been working on a document search engine to hold and index a collection of PDF files which were generated via the PrintFriendly browser app.

In the past month, I learned a few things about PDFs and extracting text from them.

- PDF is not a document language like DocBook or HTML. Rather, it is more of a type setting language, in that letters are located individually on a page.

- There are no good tools to properly extract text from a PDF (commercial tools included).

- Most text extraction tools cannot properly handle the letters "f", "o", "ll", and "t".

Of the various tools tested, it appears that Calibre's ebook-convert produces the cleanest straight-text output. I'm using that in the text extraction piece of the search and including the ability to edit the extracted text (to improve the searches).

Neighborhood Techie

Sunday, March 9, 2014

PDF text extraction tools

No comments:

Post a Comment