HtmlUnit Memory Leak — A Workaround

HtmlUnit is a programmable browser without GUI. It’s written in Java and exposes APIs that allow us to open a window, load a web page, execute Javascript, fill in forms, click buttons and clicks etc. HtmlUnit is typically used for website testing or crawling information from web sites.

Recently I worked on a task which uses HtmlUnit 2.10 to retrieve information of some web page with fairly complex Javascript. It seems the Javascript engine is causing some memory leak issues. After loading a few web pages, the memory usage is becoming high (>1GB) and eventually OutOfMemoryError will occur. The HtmlUnit FAQ page suggests that we should call WebClient.closeAllWindows(). I tried but it doesn’t work.

Instead of digging into the JavaScript engine and find out why the error happened. I decided to use a workaround — use a two process approach. The main process will keep track of the pages that has been crawled, what to crawl next etc. and create a child process to do the actual retrieval using HtmlUnit. After the child process finishes crawling for several pages, it will exit. The main process will create a new process to crawl next few pages. To make things simple, the two processes use file IO for Inter-Process Communication (IPC). That is, the child process writes what are the pages have been crawled, the main process reads it to update what have been crawled.

Because all memory allocated to the child process will be freed when it is terminated. This approach can work with the memory leak unfixed, but with performance penalty.

Reference:
1. HtmlUnit website: http://htmlunit.sourceforge.net/
2. http://htmlunit.sourceforge.net/faq.html

Apache Solr: Basic Set Up and Integration with Apache Nutch

Apache Solr is an open source enterprise search platform based on Apache Lucene. It provides full-text search, database integration, rich documents (work, pdf etc.) handling and so on. Apache Solr is written in Java and runs within a servlet container such as Tomcat or Jetty. Its REST-like HTTP/XML and JSON API allow it accessible from almost any programming language. This post covers Apache Solr set up and an example of using Apache Solr with web pages crawled from Apache Nutch.

Set Up Apache Solr

0. Download Apache Solr binaries from http://lucene.apache.org/solr/.

1. Uncompress the Solr binaries.

2. The example folder of the uncompressed Solr directory contains an installation of Jetty, and we can run Solr WAR file with start.jar file using the command below.

$ cd ${apache-solr-root}/example
$ java -jar start.jar

3. Verification. We should be able to access the following links if everything is all right.

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

Integration with Apache Nutch

0. Follow the post Apache Nutch 1.x set up to set up Apache Nutch if you haven’t done so.

1. Copy the schema file ${nutch root directory}/conf/schema.xml to Apache Solr with the command below.

$ cp ${nutch root directory}/conf/schema.xml ${apache-solr-root}/example/solr/conf/

2. Start/restart Apache Solr with the command below.

$ java -jar start.jar

3. Edit solrconfig.xml file under ${apache-solr-root}/solr/conf/, change the “df” line under <requestHandler name=”/select” class=”solr.SearchHandler”> to below.

<str name=”df”>content</str>

Note that content is set according to the defaultSearchField at ${apache-solr-root}/solr/conf/schema.xml.

<defaultSearchField>content</defaultSearchField>

4. Follow the post Apache Nutch 1.x set up or Apache Nutch 1.x: Crawling HTTPs to crawl some data. Note that the invertlinks step should have been run after this step.

5. Index the crawled data with solrindex by the command below.

$ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

6. Go to http://localhost:8983/solr/admin/ to start search.

6.1 Set the query string as “*:*”, and click search. The url request is set to “http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on”. All pages should match the query, but the default shows at most 10 records. We can change the rows=10 to list more records.

6.2 Set the query as “video”, and click search. The url request is set to “http://localhost:8983/solr/select/?q=video&version=2.2&start=0&rows=10&indent=on”.

If we can see the returned records, we have set up a small search engine with Apache Solr and Apache Nutch successfully.

Apache Nutch 1.x: Crawling HTTPS

This is a follow up post of the Apache Nutch 1.x: Set up and Basic Usage. Please read it before reading this post if you don’t have Apache Nutch set up on your machine.

The default configuration of Apache Nutch 1.5 doesn’t support HTTPS crawling. However, this can be easily enabled by including protocol-httpclient as a plugin. This is done by adding the following content to conf/nutch-site.xml.

<property>
           <name>plugin.includes</name>

           <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

This overwrites the settings in conf/nutch-default.xml.

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

Below we show an example of using Apache Nutch to crawl Google Play pages.

1. Update conf/regex-urlfilter.txt.

1.1 Comment out -[?*!@=]. This line filters out URLs containing a few special characters. Since Google Play URL contains “?” character, we’ll need to comment this line out or modify this line to allow the URLs to be fetched.

1.2 Change the line below “accept anything else” to something as below. This constraints the crawling to a Google Play domain page.

+^https://play.google.com/store/apps/details?id

2. Start Crawling. We can use two methods.

2.1 Use the crawl command shown as below.

bin/nutch crawl urls -dir crawl -depth 10 -topN 10

2.2 Use the step by step command. We can use the shell script below.

#!/bin/bash
bin/nutch inject crawl/crawldb urls
for i in {1..10}
do
   bin/nutch generate crawl/crawldb crawl/segments -topN 10
   s2=`ls -dtr crawl/segments/2* | tail -1`
   echo $s2
   bin/nutch fetch $s2
   bin/nutch parse $s2
   bin/nutch updatedb crawl/crawldb $s2
done
bin/nutch invertlinks crawl/linkdb -dir crawl/segments