Saturday, January 21, 2012

Indexing Documents with Solr

Apache Solr is great at indexing thousands and thousands of office documents (Word, Excel, PDF, etc.). But where's the handy tool you use to upload all those documents to Solr for indexing? You have to write it friend. Ahh... the joys of open source! As such, we wrote one. It's working great here at the State. It's been in production running hourly over the last couple of months, and we just completed a substantial upgrade last week.

Here it is: sinject v1.3.

Like Solr, Sinject is under the Apache license. Enjoy. If you find it useful or can offer some advice, drop us a note.

Core of Sinject - Submitting Files to Solr
  1. Get the target file's unique id (inode) and other info:
    @file_info = stat($filepath);
    $inode = $file_info[1];
  2. Stage the file to be indexed (eliminate issues with bad filenames):
    # deal with possible quotes in filepath
    $filestring = $filepath;
    $filestring =~ s/'/'\\''/g;
    # copy the file to a temp file w/o weird ch's
    # (retain the file's mod date)
    $cmdstring = "cp -p '$filestring' " . "tempfile";
    # do it!
    $cmdoutput = `$cmdstring`;
  3. Submit file to Solr indexing engine (remember to escape troublesome literals):
    # Solr upload and index file command
    $cmdstring = "curl \"$SOLR_URL/update/extract?" .

        "literal.id=$inode" .
        "&literal.filename=" . uri_escape($filename) .
        
    "&literal.filelink=" . uri_escape($filelink) .
        
    "&literal.filetype=$filetype" .
        "&literal.filedate=$filedate" . uri_escape($timesuff) .
        
    "&commit=false\" -F \"myfile=\@tempfile\"";
    # do it!
    $cmdoutput = `$cmdstring`;
    # check for successful result
    if ($cmdoutput =~ m/<lst name="responseHeader"><int name="status">0<\/int>/) {
        # success - move on to next file

    } else {
        
    # fail - log failure, then move on to next file
    }
  4. Don't forget to commit the updates (i.e. every 100 or so):
    # Solr commit command
    $cmdstring = "curl '$SOLR_URL/update' -H " .
        "\"Content-Type: text/xml\" --data-binary " .
        "'<commit/>'";
    # do it!
    $cmdoutput = `$cmdstring`;
    # use same successful result check as above...
Check-out the code for further detail.

(see earlier post for info about our initial experience with Apache Solr)

Low tech search tags: How to index office documents using Solr, how to upload office documents using Solr, how to index rich text documents using Solr, how to upload PDF documents to Solr, how to upload documents using Solr Cell, Solr document uploader, Solr PDF uploader, Solr rich document crawler, Solr cell sample code, example of using Solr cell.