Here it is: sinject v1.3.
Like Solr, Sinject is under the Apache license. Enjoy. If you find it useful or can offer some advice, drop us a note.
Core of Sinject - Submitting Files to Solr
- Get the target file's unique id (inode) and other info:
@file_info = stat($filepath);
$inode = $file_info[1]; - Stage the file to be indexed (eliminate issues with bad filenames):
# deal with possible quotes in filepath
$filestring = $filepath;
$filestring =~ s/'/'\\''/g;
# copy the file to a temp file w/o weird ch's
# (retain the file's mod date)
$cmdstring = "cp -p '$filestring' " . "tempfile";
# do it!
$cmdoutput = `$cmdstring`; - Submit file to Solr indexing engine (remember to escape troublesome literals):
# Solr upload and index file command
$cmdstring = "curl \"$SOLR_URL/update/extract?" .
"literal.id=$inode" .
"&literal.filename=" . uri_escape($filename) .
"&literal.filelink=" . uri_escape($filelink) .
"&literal.filetype=$filetype" .
"&literal.filedate=$filedate" . uri_escape($timesuff) .
"&commit=false\" -F \"myfile=\@tempfile\"";
# do it!
$cmdoutput = `$cmdstring`;
# check for successful result
if ($cmdoutput =~ m/<lst name="responseHeader"><int name="status">0<\/int>/) {
# success - move on to next file
} else {
# fail - log failure, then move on to next file
} - Don't forget to commit the updates (i.e. every 100 or so):
# Solr commit command
$cmdstring = "curl '$SOLR_URL/update' -H " .
"\"Content-Type: text/xml\" --data-binary " .
"'<commit/>'";
# do it!
$cmdoutput = `$cmdstring`;
# use same successful result check as above...
(see earlier post for info about our initial experience with Apache Solr)
Low tech search tags: How to index office documents using Solr, how to upload office documents using Solr, how to index rich text documents using Solr, how to upload PDF documents to Solr, how to upload documents using Solr Cell, Solr document uploader, Solr PDF uploader, Solr rich document crawler, Solr cell sample code, example of using Solr cell.