Saturday, May 11, 2013

Building my own search engine

(From the How Hard Could It Be? Department) Google Reader's pending retirement caused me to start looking for alternative readers. I decided that Tiny Tiny RSS was about the best so went about attempting to install it on the house server. Right away, it complained about the version of PHP employed.

Knowing that this would cause trouble with the KnowledgeTree instance, it looked like a general upgrade was needed. Discovered a problem: KnowledgeTree no longer offers their Community Edition. Ignoring the fact that this effectively angers anyone that ever contributed code to the project, this left me in a difficult spot: run TT-RSS in a VM or come up with alternatives.

After looking at various other DMS software, I started reading about how search engines index documents. Terms like inverse indexing, relative scoring, and soundexes have become familiar. After experimenting with various text management tools and a number of databases, I'm more in awe of Google (and Bing, somewhat) now than I was before.

All that being said... In the interim, I'm running the last available Community Edition of KnowledgeTree, running on the latest version of PHP, with known work-arounds in place (there's a growing number of them).

I'm also learning about some of the obstacles that are inherent with indexing documents. Example: (and it's a horror) PDF appears to be a typesetting language. Even the good text extractors have issues with it. The letter "f" and bolding causes no end of problems in extracted text (there's usually a space after each "f" and bolding tends to produce "doubled" characters in extracted text).

Hopefully, at the end of all this, I'll have a simple program to index all of the documents (pdf, doc, & txt) that I've gathered over the years. If "simple" is unrealistic, I'll probably shoot for "portable".