Monday, August 8, 2011

Utilize Full Text Search

About a year ago, I released a new version of a business registration system that had one major new feature: Full Text Search (FTS). In a long series of releases, it was the most successful feature I ever added. It achieved a huge increase in effective system use and user satisfaction - it was a major win all the way around.

The system already had an outstanding search bar that could search on any combination of fields and leverage the database's powerful regular expression engine. I put a lot of work into this fancy search bar, but in practice, its power went largely unused. Frustrating!

Months later, while upgrading the infrastructure of the system, I noticed that PostgreSQL added full text search as part of its standard release. Hmmm... After digging a bit, 'discovered how it tokenizes words so that you can get better search results. For example, searching for 'industry' would match 'industry, industries, industrial' while ignoring case and punctuation. Cool.

I thought, "Hey, maybe I should add this type of search to the search bar." After a week or two of work, wow. The search results were far better (degrees of magnitude better). They were ranked. The matches were highlighted. Heck, it was like I added Google search to the system.

Here's all I had to do:
  • Choose source fields in tables to tokenize
  • Create matching tsvector fields in those same tables
    (a tsvector is a list of tokenized words)
  • Create triggers to update the tsvector fields anytime their source fields are updated
  • Adjust the searches to match against these tsvector fields
  • Tune the database's FTS engine by customizing its dictionaries to the nature of the system's data
    (i.e. additional entries to better match company names: inc = incorporated = incorporation)
Note that the terminology varies by storage engine, but the technique remains largely the same.

Lesson learned: If you're implementing a search function, utilize FTS.

4 Nov 2011 Update: See more recent post regarding using Apache Solr for enterprise search.