Apache Solr: Basic Set Up and Integration with Apache Nutch

Apache Solr is an open source enterprise search platform based on Apache Lucene. It provides full-text search, database integration, rich documents (work, pdf etc.) handling and so on. Apache Solr is written in Java and runs within a servlet container such as Tomcat or Jetty. Its REST-like HTTP/XML and JSON API allow it accessible from almost any programming language. This post covers Apache Solr set up and an example of using Apache Solr with web pages crawled from Apache Nutch.

Set Up Apache Solr

0. Download Apache Solr binaries from http://lucene.apache.org/solr/.

1. Uncompress the Solr binaries.

2. The example folder of the uncompressed Solr directory contains an installation of Jetty, and we can run Solr WAR file with start.jar file using the command below.

$ cd ${apache-solr-root}/example
$ java -jar start.jar

3. Verification. We should be able to access the following links if everything is all right.

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

Integration with Apache Nutch

0. Follow the post Apache Nutch 1.x set up to set up Apache Nutch if you haven’t done so.

1. Copy the schema file ${nutch root directory}/conf/schema.xml to Apache Solr with the command below.

$ cp ${nutch root directory}/conf/schema.xml ${apache-solr-root}/example/solr/conf/

2. Start/restart Apache Solr with the command below.

$ java -jar start.jar

3. Edit solrconfig.xml file under ${apache-solr-root}/solr/conf/, change the “df” line under <requestHandler name=”/select” class=”solr.SearchHandler”> to below.

<str name=”df”>content</str>

Note that content is set according to the defaultSearchField at ${apache-solr-root}/solr/conf/schema.xml.

<defaultSearchField>content</defaultSearchField>

4. Follow the post Apache Nutch 1.x set up or Apache Nutch 1.x: Crawling HTTPs to crawl some data. Note that the invertlinks step should have been run after this step.

5. Index the crawled data with solrindex by the command below.

$ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

6. Go to http://localhost:8983/solr/admin/ to start search.

6.1 Set the query string as “*:*”, and click search. The url request is set to “http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on”. All pages should match the query, but the default shows at most 10 records. We can change the rows=10 to list more records.

6.2 Set the query as “video”, and click search. The url request is set to “http://localhost:8983/solr/select/?q=video&version=2.2&start=0&rows=10&indent=on”.

If we can see the returned records, we have set up a small search engine with Apache Solr and Apache Nutch successfully.

Apache Nutch 1.x: Crawling HTTPS

This is a follow up post of the Apache Nutch 1.x: Set up and Basic Usage. Please read it before reading this post if you don’t have Apache Nutch set up on your machine.

The default configuration of Apache Nutch 1.5 doesn’t support HTTPS crawling. However, this can be easily enabled by including protocol-httpclient as a plugin. This is done by adding the following content to conf/nutch-site.xml.

<property>
           <name>plugin.includes</name>

           <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

This overwrites the settings in conf/nutch-default.xml.

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

Below we show an example of using Apache Nutch to crawl Google Play pages.

1. Update conf/regex-urlfilter.txt.

1.1 Comment out -[?*!@=]. This line filters out URLs containing a few special characters. Since Google Play URL contains “?” character, we’ll need to comment this line out or modify this line to allow the URLs to be fetched.

1.2 Change the line below “accept anything else” to something as below. This constraints the crawling to a Google Play domain page.

+^https://play.google.com/store/apps/details?id

2. Start Crawling. We can use two methods.

2.1 Use the crawl command shown as below.

bin/nutch crawl urls -dir crawl -depth 10 -topN 10

2.2 Use the step by step command. We can use the shell script below.

#!/bin/bash
bin/nutch inject crawl/crawldb urls
for i in {1..10}
do
   bin/nutch generate crawl/crawldb crawl/segments -topN 10
   s2=`ls -dtr crawl/segments/2* | tail -1`
   echo $s2
   bin/nutch fetch $s2
   bin/nutch parse $s2
   bin/nutch updatedb crawl/crawldb $s2
done
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Apache Nutch 1.x: Set up and Basic Usage

0. Set up

Below are the steps to set up Nutch on Linux.

  • Download latest 1.x version of nutch from http://nutch.apache.org/
  • Set JAVA_HOME environment variable. One can add the following line to ~/.bashrc file.

export JAVA_HOME=<path to Java jdk>

  • Make sure bin/nutch is executable by the command below.

chmod +x bin/nutch

  • Add an agent name in conf/nutch-site.xml as below.

<property>
               <name>http.agent.name</name>
               <value>Nutch Test Spider</value>

</property>

1. An Example

Below are the steps to run Nutch on this blog site.

  • Create a directory named urls, and put a file named seed.txt under the directory with the content below.

http://www.roman10.net/

  • Edit conf/regex-urlfilter.txt file. Change the line below “accept anything else” to something as below. This constraints the crawling to a specific domain.

+^http://([a-z0-9]*.)*roman10.net/

  • Seeding the crawldb with URLs. The command below will convert urls to db entries and put them into crawldb.

bin/nutch inject crawl/crawldb urls

  • Generate the fetch list. A segment directory named as the creation timestamp will be created and the URLs to be fetched will be stored in crawl_generate sub directory.

bin/nutch generate crawl/crawldb crawl/segments

  • Fetch the content. The command below will fetch the content. Two more sub directories will be created under the segment directory, including crawl_fetch (status of fetching each URL) and content (raw content retrieved).

bin/nutch fetch `ls -d crawl/segments/* | tail -1`

  • Parse the content. The command below parses the fetched content. Three more folders are created, including crawl_parse (outlink URLs for updating crawldb), parse_text (text parsed) and parse_data (outlink and metadata parsed).

bin/nutch parse `ls -d crawl/segments/* | tail -1`

  • Update the crawldb database. We can use the command below.

updatedb crawl/crawldb `ls -d crawl/segments/* | tail -1`

  • After update database, crawldb will contain updated entries for initial pages and the newly discovered outlinks. We can start fetching a new segment with the top-scoring 10 pages. This can be done with the command below.

bin/nutch generate crawl/crawldb crawl/segments -topN 10
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

  • We can repeat the above commands to fetch more segments.
  • Invertlinks. We can use the command below. This creates a crawl/linkdb folder, which contains list of known links to each URL, including both source URL and archor text of the link.

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

  • Indexing. This can be done with Apache Solr. The details are not covered in this post.

2. How It Works…

The essential procedure of crawling at Nutch is as below.

inject URLs

loop

        generate fetch list

        fetch

        updatedb

invert links

index

References:

1. Nutch Tutorial: http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website

2. Nutch set up and use blog: http://nutch.wordpress.com/