Apache Solr: Basic Set Up and Integration with Apache Nutch

Apache Solr is an open source enterprise search platform based on Apache Lucene. It provides full-text search, database integration, rich documents (work, pdf etc.) handling and so on. Apache Solr is written in Java and runs within a servlet container such as Tomcat or Jetty. Its REST-like HTTP/XML and JSON API allow it accessible from almost any programming language. This post covers Apache Solr set up and an example of using Apache Solr with web pages crawled from Apache Nutch.

Set Up Apache Solr

0. Download Apache Solr binaries from http://lucene.apache.org/solr/.

1. Uncompress the Solr binaries.

2. The example folder of the uncompressed Solr directory contains an installation of Jetty, and we can run Solr WAR file with start.jar file using the command below.

$ cd ${apache-solr-root}/example
$ java -jar start.jar

3. Verification. We should be able to access the following links if everything is all right.

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

Integration with Apache Nutch

0. Follow the post Apache Nutch 1.x set up to set up Apache Nutch if you haven’t done so.

1. Copy the schema file ${nutch root directory}/conf/schema.xml to Apache Solr with the command below.

$ cp ${nutch root directory}/conf/schema.xml ${apache-solr-root}/example/solr/conf/

2. Start/restart Apache Solr with the command below.

$ java -jar start.jar

3. Edit solrconfig.xml file under ${apache-solr-root}/solr/conf/, change the “df” line under <requestHandler name=”/select” class=”solr.SearchHandler”> to below.

<str name=”df”>content</str>

Note that content is set according to the defaultSearchField at ${apache-solr-root}/solr/conf/schema.xml.

<defaultSearchField>content</defaultSearchField>

4. Follow the post Apache Nutch 1.x set up or Apache Nutch 1.x: Crawling HTTPs to crawl some data. Note that the invertlinks step should have been run after this step.

5. Index the crawled data with solrindex by the command below.

$ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

6. Go to http://localhost:8983/solr/admin/ to start search.

6.1 Set the query string as “*:*”, and click search. The url request is set to “http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on”. All pages should match the query, but the default shows at most 10 records. We can change the rows=10 to list more records.

6.2 Set the query as “video”, and click search. The url request is set to “http://localhost:8983/solr/select/?q=video&version=2.2&start=0&rows=10&indent=on”.

If we can see the returned records, we have set up a small search engine with Apache Solr and Apache Nutch successfully.