0. Set up
Below are the steps to set up Nutch on Linux.
- Download latest 1.x version of nutch from http://nutch.apache.org/
- Set JAVA_HOME environment variable. One can add the following line to ~/.bashrc file.
export JAVA_HOME=<path to Java jdk>
- Make sure bin/nutch is executable by the command below.
chmod +x bin/nutch
- Add an agent name in conf/nutch-site.xml as below.
<value>Nutch Test Spider</value>
1. An Example
Below are the steps to run Nutch on this blog site.
- Create a directory named urls, and put a file named seed.txt under the directory with the content below.
- Edit conf/regex-urlfilter.txt file. Change the line below “accept anything else” to something as below. This constraints the crawling to a specific domain.
- Seeding the crawldb with URLs. The command below will convert urls to db entries and put them into crawldb.
bin/nutch inject crawl/crawldb urls
- Generate the fetch list. A segment directory named as the creation timestamp will be created and the URLs to be fetched will be stored in crawl_generate sub directory.
bin/nutch generate crawl/crawldb crawl/segments
- Fetch the content. The command below will fetch the content. Two more sub directories will be created under the segment directory, including crawl_fetch (status of fetching each URL) and content (raw content retrieved).
bin/nutch fetch `ls -d crawl/segments/* | tail -1`
- Parse the content. The command below parses the fetched content. Three more folders are created, including crawl_parse (outlink URLs for updating crawldb), parse_text (text parsed) and parse_data (outlink and metadata parsed).
bin/nutch parse `ls -d crawl/segments/* | tail -1`
- Update the crawldb database. We can use the command below.
updatedb crawl/crawldb `ls -d crawl/segments/* | tail -1`
- After update database, crawldb will contain updated entries for initial pages and the newly discovered outlinks. We can start fetching a new segment with the top-scoring 10 pages. This can be done with the command below.
bin/nutch generate crawl/crawldb crawl/segments -topN 10
s2=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
- We can repeat the above commands to fetch more segments.
- Invertlinks. We can use the command below. This creates a crawl/linkdb folder, which contains list of known links to each URL, including both source URL and archor text of the link.
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
- Indexing. This can be done with Apache Solr. The details are not covered in this post.
2. How It Works…
The essential procedure of crawling at Nutch is as below.
generate fetch list
1. Nutch Tutorial: http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
2. Nutch set up and use blog: http://nutch.wordpress.com/