In the past month, I learned a few things about PDFs and extracting text from them.
- PDF is not a document language like DocBook or HTML. Rather, it is more of a type setting language, in that letters are located individually on a page.
- There are no good tools to properly extract text from a PDF (commercial tools included).
- Most text extraction tools cannot properly handle the letters "f", "o", "ll", and "t".
Of the various tools tested, it appears that Calibre's ebook-convert produces the cleanest straight-text output. I'm using that in the text extraction piece of the search and including the ability to edit the extracted text (to improve the searches).
No comments:
Post a Comment