About a year ago, I released a new version of a business registration system that had one major new feature:
Full Text Search (FTS). In a long series of releases, it was the most successful feature I ever added. It achieved a huge increase in effective system use and user satisfaction - it was a major win all the way around.
The system already had an outstanding search bar that could search on any combination of fields and leverage the database's powerful
regular expression engine. I put a lot of work into this fancy search bar, but in practice, its power went largely unused. Frustrating!
Months later, while upgrading the infrastructure of the system, I noticed that
PostgreSQL added full text search as part of its standard release. Hmmm... After digging a bit, 'discovered how it tokenizes words so that you can get better search results. For example, searching for 'industry' would match 'industry, industries, industrial' while ignoring case and punctuation. Cool.
I thought, "Hey, maybe I should add this type of search to the search bar." After a week or two of work, wow. The search results were far better (degrees of magnitude better). They were ranked. The matches were highlighted. Heck, it was like I added Google search to the system.
Here's all I had to do:
- Choose source fields in tables to tokenize
- Create matching tsvector fields in those same tables
(a tsvector is a list of tokenized words)
- Create triggers to update the tsvector fields anytime their source fields are updated
- Adjust the searches to match against these tsvector fields
- Tune the database's FTS engine by customizing its dictionaries to the nature of the system's data
(i.e. additional entries to better match company names: inc = incorporated = incorporation)
Note that the terminology varies by storage engine, but the technique remains largely the same.
Lesson learned: If you're implementing a search function, utilize FTS.
4 Nov 2011 Update: See more recent
post regarding using Apache Solr for enterprise search.