How Java Garbage Collector Work–A Simple Note

Recently I was asked about how Java Garbage Collector works. I remember I learned about it a few years ago, but seldom need to consider it in my development. (I simply follow the rules of set the reference to null when I’m done with the object.)  So I didn’t figure it out.

Today I spent an hour reading about Java GC and below is a note about it. It’s only the most basic stuff of GC.

————————————Notes Start Here——————————-

A garbage collector does two basic things, detect garbage objects and free the memory space taken by those objects.

Garbage detection is normally done by defining a set of roots and test if objects are reachable from these roots. Objects are not reachable from the roots are considered no longer in use and therefore should be garbage collected.

The Roots

The root set is implementation dependent but it generally includes the following.

  • Local variables, can be either primitive types or object reference. Local variables are inside methods, including method arguments, and they are stored in Java stack. (Note that instance variables are inside classes, not methods)
  • Object references in the constant pool of loaded classes. Constant pool can refer to strings stored on Java heap, including class name, method signatures, field names etc.

Rules

  • Any object referred by a root is reachable, and considered a live object
  • Any object referred by a live object is considered live

Approaches

  • Reference Counting: The garbage collector tracks the number of references to every object. A object is live as long as long there is references to it. The disadvantages of this approach is that it cannot detect cycles. If two or more objects refer to one another but not accessible by the program, the GC cannot garbage collect  these objects because they still have a reference count bigger than 0. Another advantage is the overhead of updating the counters for each objects.
  • Tracing: Trace out graphs of references from the roots and mark the objects along the graph paths. Any objects not marked are considered not live any more. Most JVM nowadays use tracing for garbage collection.

References:
Java’s garbage-collected heap: http://www.javaworld.com/javaworld/jw-08-1996/jw-08-gc.html?page=1

Loading Python Pickle Files from Java

Recently I wrote some program in Python, which dumps data into a pickle file. The file data is loaded into memory by another Python program for further processing. This works pretty well until I need to write a Java program which also loads the data saved in the pickle file.

I searched around but cannot find many resources about loading pickle files from Java. A few forum discussions mentioned calling Python script from Java using Jython interpreter. I digged deeper into the Jython implementation and found that we can actually load pickle files in Java with the help of Jython jar library.

Below we describe the steps of how to load pickle files from Java code.

0. Download Jython jar file and the sources.jar file from Jython website.  Suppose we are using Eclipse IDE, add the downloaded Jython jar file as a library. We can also set the sources.jar as the attached source for the jar library for easier debugging.

1. Create a Python pickle (or cpickle) File. Below is a simple Python script that dump a dictionary of lists to a pickle file.

#!/usr/bin/python

#import cPickle as pickle    #use either cpickle or pickle

import pickle

idToCountries = {}

testa = []

testa.append("us")

testa.append("en")

idToCountries["12345"] = testa

testb=[]

testb.append("cn")

testb.append("th")

idToCountries["12346"] = testb

outf=open("test.pkl", "wb")

pickle.dump(idToCountries, outf)

Save the file as test.py, and execute the commands below to create the test.pkl pickle file.

chmod +x test.py

./test.py

2. Load the data from the pickle file test.pkl, and put the data into a HashMap of ArrayLists. This is demonstrated in the Java file below.

import java.io.BufferedReader;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

import java.util.concurrent.ConcurrentMap;

import java.io.FileInputStream;

import java.io.InputStream;

 

import org.python.core.PyDictionary;

import org.python.core.PyFile;

import org.python.core.PyList;

import org.python.core.PyObject;

import org.python.core.PyString;

import org.python.modules.cPickle;

 

public class PickleLoader {

    

    public static void main(String[] args) {

        PickleLoader loader = new PickleLoader();

        HashMap<String, ArrayList<String>> loadCnt1 =  loader.getIdToCountriesStr();

        System.out.println(loadCnt1.toString());

        HashMap<String, ArrayList<String>> loadCnt2 =loader.getIdToCountriesFileStream();

        System.out.println(loadCnt2.toString());

    }

    

    public HashMap<String, ArrayList<String>> getIdToCountriesStr() {

        HashMap<String, ArrayList<String>> idToCountries = new HashMap<String, ArrayList<String>>();

        File f = new File("test.pkl");

        System.out.println(f.length());

        BufferedReader bufR;

        StringBuilder strBuilder = new StringBuilder();

        try {

            bufR = new BufferedReader(new FileReader(f));

            String aLine;

            while (null != (aLine = bufR.readLine())) {

                strBuilder.append(aLine).append("n");

            }

        } catch (FileNotFoundException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();

        }

        PyString pyStr = new PyString(strBuilder.toString());

        PyDictionary idToCountriesObj =  (PyDictionary) cPickle.loads(pyStr);

        ConcurrentMap<PyObject, PyObject> aMap = idToCountriesObj.getMap();

        for (Map.Entry<PyObject, PyObject> entry : aMap.entrySet()) {

            String appId = entry.getKey().toString();

            PyList countryIdList = (PyList) entry.getValue();

            List<String> countryList = (List<String>) countryIdList.subList(0, countryIdList.size());

            ArrayList<String> countryArrList = new ArrayList<String>(countryList);

            idToCountries.put(appId, countryArrList);

        }

//        System.out.println(idToCountries.toString());

        return idToCountries;

    }

    

    public HashMap<String, ArrayList<String>> getIdToCountriesFileStream() {

        HashMap<String, ArrayList<String>> idToCountries = new HashMap<String, ArrayList<String>>();

        File f = new File("test.pkl");

        InputStream fs = null;

        try {

            fs = new FileInputStream(f);

        } catch (FileNotFoundException e) {

            e.printStackTrace();

            return null;

        }

        PyFile pyStr = new PyFile(fs);

        PyDictionary idToCountriesObj =  (PyDictionary) cPickle.load(pyStr);

        ConcurrentMap<PyObject, PyObject> aMap = idToCountriesObj.getMap();

        for (Map.Entry<PyObject, PyObject> entry : aMap.entrySet()) {

            String appId = entry.getKey().toString();

            PyList countryIdList = (PyList) entry.getValue();

            List<String> countryList = (List<String>) countryIdList.subList(0, countryIdList.size());

            ArrayList<String> countryArrList = new ArrayList<String>(countryList);

            idToCountries.put(appId, countryArrList);

        }

//        System.out.println(idToCountries.toString());

        return idToCountries;

    }

}

The getIdToCountriesStr method reads all the data from the pickle file and pass it as a string to the Jython cPickle.loads method. The getIdToCountriesFileStream open the file as FileInputStream and pass it to Jython cPickle.load method. This is preferred method for large pickle files.

In addition to loading the file, we also need to perform data type conversion to convert the PyList to List and then ArrayList, and PyDictionary to HashMap.

3. Sample Execution

The execution of the Java program will give us the following output.

{12346=[cn, th], 12345=[us, en]}
{12346=[cn, th], 12345=[us, en]}

As expected, both methods can load data from pickle file and save them into a Java HashMap.

Side note: the Jython jar file used for testing is jython-2.5.3.jar.

References:

Jython website: http://www.jython.org/

Android Audio Editor 1.0.0 Released

Last Saturday, I released a new app Android Audio Editor.

Because I was busy with my master thesis and a few other stuff, I haven’t released any new apps for over 10 months. Now my master course is over and I’m back to app development again. It feels good. 🙂

This new app took me about one month to finish. It uses the open source ffmpeg library and ringdroid application. Below is a few screenshots of the app.

1

In terms of functions, Android audio editor has the following main features.

  • ringtone maker: create free ringtone/alarm/notification from mp3 or aac file
  • audio conversion: wav to mp3 conversion, wma to mp3, m4a to mp3, mp3/wav/wma/m4a to aac conversion
  • audio extraction: extract mp3 from mp4 video

As always, if you have any suggestions, please leave a comment.

Video Converter Android > 1,000,000 Downloads

Last Saturday, Oct 6 2012, to be exact, Video Converter Android has reached 1,000,000 downloads milestone.

The app is first released on Dec 17 2011. It takes less than 10 months to reach the 1 million downloads, much faster than I expected. 🙂

I checked my blog, below is a list of milestones in terms of downloads.

  • Jan 21 2012: 50, 000 downloads
  • Feb 20, 2012: 100, 000 downloads
  • Jul 26 2012: 500, 000 downloads
  • Oct 6 2012: 1,000,000 downloads

The daily download count is now stable at around 8000.

Enough good news. The bad news about the stats is the active users are only around 20%. That means the app cannot keep the users for long. I personally think there’s a few reasons listed as below.

  • video conversion is generally slow. I cannot do much about this. Hopefully when mobile devices get more powerful, this will change.
  • Not many times people need to convert videos. Lots of people may download the app, convert it and then uninstall the app.
  • The app size is big, especially the codec (~10MB). People will uninstall the app for more space.
  • The app doesn’t work for some videos on some devices. I am trying to support more devices and videos, but it is hard.

Anyway, I will continue to refine the app. If you have any ideas of how to make the app better, please leave a comment below.

Write Search Engine Friendly Android App Description — Geek’s Approach with Apache Solr, Apache Nutch and Adwords

Many Android users find apps through search. Therefore it is important to write a search engine friendly app description. In this post, we’ll build a search engine to see how our app description works.

The basic idea is as follows. Firstly, we crawl app descriptions related to our app in Google Play (using Apache Nutch). Secondly, we get a few search terms which are frequently used by mobile phone users and related to our app. The third step is to use Nutch to crawl our app description (stored in a text file). The fourth step is to build a search engine to search the data crawled (using Apache Solr). The fifth step is to search with the terms we obtained at step 2 and see how our app ranks. We can refine our app description, and repeat step 3 and step 5.

Below we describe each step in detail.

1. Crawl Google Play app description data using Apache Nutch.

If you don’t have Apache Nutch set up, please refer to Apache Nutch set up and basic usage. In addition, please follow instructions at Apache Nutch Crawling HTTPS to enable HTTPS crawling for Apache Nutch. Note that we need this because Google Play data is accessed through HTTPS.

Note that the conf/nutch-site.xml should have the following configuration.

<configuration>

       <property>

               <name>http.agent.name</name>

               <value>Nutch Test Spider</value>

       </property>

       <property>

               <name>plugin.includes</name>

 

<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

 

       </property>

</configuration>

We want to crawl only app description data, so we change the conf/regex-urlfilter.txt file to something as below.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

# for a more extensive coverage use the urlfilter-suffix plugin

-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

-.*(/[^/]+)/[^/]+1/[^/]+1/

#skip comments and description in other languages

-reviewId

-hl=

# accept app description only

+^https://play.google.com/store/apps/details?id

Specifically, we filtered out URLs contains reviewId and hl, which corresponds to the review pages and description pages in other languages (we only consider default description here).

We’ll create a urls folder with a file named seed.txt under it by the commands below.

cd ${nutch root folder}
mkdir urls
cd urls
touch seed.txt

The following content should be added to the seed.txt file. (In our example, we’ll crawl data related to the app Video Converter Android)

https://play.google.com/store/apps/details?id=roman10.media.converter

Then we can use the following script to crawl the data.

#!/bin/bash

bin/nutch inject crawl/crawldb urls

for i in {1..10}

do

   bin/nutch generate crawl/crawldb crawl/segments -topN 100

   s2=`ls -dtr crawl/segments/2* | tail -1`

   echo $s2

   bin/nutch fetch $s2

   bin/nutch parse $s2

   bin/nutch updatedb crawl/crawldb $s2

done

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

The number of iterations and topN can be other values depend on how many data we want to collect.

Save the content to a script crawl.sh under nutch root directory, and start crawling by enter the following command,

chmod +x crawl.sh
./crawl.sh

2. Obtain a list of keywords using Google Adwords Keyword Tool.

The Google Adwords Keyword Tool allows us to get keywords and their search frequencies. The advanced options let us to constrain the result based on locations, languages, etc. Since we’re dealing with mobile apps, we set the “Show Ideas and Statistics for” as “All mobile devices”. Below is a screenshot of the tool.

Below is a list of 12 keywords and the times they’ve been searched within a month.

video converter: 450,000
video to video converter: 450,000
video to audio: 301,000
video audio: 301,000
audio from video: 301,000
video converters: 201,000
video convert: 135,000
free video converter: 74,000
videos converter: 74,000
video converter free: 74,000
free video converters: 40,500
mp4 video converter: 12,100

3. Use Apache Nutch to Crawl App Description in Text

Suppose our app description is saved in a file /home/roman10/desp/app.txt, we will make the following changes to enable crawling for text app description at Nutch.
Firstly, the conf/nutch-site.xml should be updated with the following configuration.

<configuration>

       <property>

               <name>http.agent.name</name>

               <value>Nutch Test Spider</value>

       </property>

       <property>

<name>plugin.includes</name>

<value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf|tika)|index-basic|query-(basic|site|url)</value>

       </property>

</configuration>

Secondly, we change the regex-urlfilter.txt to something below.

# skip http, ftp and mail, crawl for file

-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse

# for a more extensive coverage use the urlfilter-suffix plugin

-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# accept anything else

+.

Thirdly, we’ll create update seed.txt under urls folder to pointing to the file.

file:////home/roman10/desp/app.txt

Fourth, we update the crawl.sh script to something below.

#!/bin/bash

bin/nutch inject localfs/crawldb urls

bin/nutch generate localfs/crawldb localfs/segments -topN 1

s2=`ls -dtr localfs/segments/2* | tail -1`

echo $s2

bin/nutch fetch $s2

bin/nutch parse $s2

bin/nutch updatedb localfs/crawldb $s2

bin/nutch invertlinks localfs/linkdb -dir localfs/segments

Lastly, we perform the text crawling by enter the command below.

./crawl.sh

4. Build the Search Engine with Apache Solr

The detailed instructions of how to set up Apache Solr can be found at post Apache Solr: Basic Set Up and Integration with Apache Nutch.

To index the crawled http app description and text description, we shall use the command below.

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
bin/nutch solrindex http://127.0.0.1:8983/solr/ localfs/crawldb -linkdb localfs/linkdb localfs/segments/*

Note that the indexing can be done with Apache Solr running.

5. Search the App using the Keywords and Refine the Description

We go to http://localhost:8983/solr/admin/ to search the list of keywords we obtained at step 2. Once we are at the page, click “FULL INTERFACE” at “Make a Query” to switch to full interface. Check the “Debug:enable” checkbox. We can then start searching. Below is a screenshot of the interface with query string set to “video converter”.

Once we hit the Search button, below is the list of results returned.

According to the result above, our description in the text file ranks at number 1, therefore we’re confident about it. Note that the crawled description from the web page may contain some other content besides description, therefore it may not be a fair comparison. But it should be good enough for a rough estimation of how good our description is.

In case our description does not rank good, we can refine the description, crawl, index and search again until we get a good ranking.

Access Cassandra from Python with pycassa

pycassa is a Python client for Apache Cassandra databases. The source code of pycassa can be found at https://github.com/pycassa/pycassa.

0. Install pycassa

sudo easy_install pycassa

or

sudo pip install pycassa

1. Read and Write

Reading and writing with pycassa is easy. We connect to the database, get the column family, and then read from and write to the column family. Below is a simple script demonstrates the basic read and write in pycassa.

#!/usr/bin/python

import os

import datetime

import pycassa


dbHost="localhost"

dbPort="9160"

keyspace="test"

cf="testcf"

rowKey="bbb"


DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"


connectStr=dbHost + ":" + dbPort

print connectStr


pool = pycassa.pool.ConnectionPool(keyspace, [connectStr], timeout=10)

colFam = pycassa.ColumnFamily(pool, cf)

#try to get the row first

try:

    res = colFam.get(rowKey);

except Exception,e:

    res = ""

print res

#insert a column for the row

colFam.insert(rowKey, {'testTime':datetime.datetime.now().strftime(DATETIME_FORMAT)})

#get the row again

res = colFam.get(rowKey);

print res

The output of the script is as below. The second line indicates the row value before write and the third line shows the value after write. Note that the keyspace and column family should have been created before running the script.

2. More about read and Batch Write

We can read a specific column or a slice of columns. We can also perform batch writes. This is shown in the script below.

#!/usr/bin/python

import os

import datetime

import pycassa


dbHost="localhost"

dbPort="9160"

keyspace="test"

cf="testcf"

rowKey="bbb"


DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"


connectStr=dbHost + ":" + dbPort

print connectStr


pool = pycassa.pool.ConnectionPool(keyspace, [connectStr])

colFam = pycassa.ColumnFamily(pool, cf);

try:

    ret = colFam.get(rowKey, ["testTime"]);

except Exception,e:

    ret = ""

print "get with column name: " + str(ret)


try:

    ret = colFam.get(rowKey, column_start="aa", column_finish="zz");

except Exception,e:

    ret = ""

print "get with column slice: " + str(ret)


bat = colFam.batch(queue_size=50)

bat.insert(rowKey, {'testTime':datetime.datetime.now().strftime(DATETIME_FORMAT)})

bat.insert(rowKey, {'testTime2':datetime.datetime.now().strftime(DATETIME_FORMAT)})

bat.send()


print "after insert: " + str(colFam.get(rowKey));

The output of the script is shown as below. Note that the batch write is not necessary in our example since we are writing to the same row. Batch write is usually used to update multiple rows.

3. Unicode

Pycassa will automatically detect the character encoding based on default_validation_class of the column family. We can set this value to pycassa.types.UTF8Type() to instruct pycassa to use UTF-8 encoding. One use case is the the column family’s validation class is set to BytesType, and we want to write UTF-8 characters to the column family. Below is a script shows usage of UTF-8 at pycassa.

#!/usr/bin/python

import os

import datetime

import pycassa


dbHost="localhost"

dbPort="9160"

keyspace="test"

cf="testcf"

rowKey="bbb"


DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"


connectStr=dbHost + ":" + dbPort

print connectStr


pool = pycassa.pool.ConnectionPool(keyspace, [connectStr])

colFam = pycassa.ColumnFamily(pool, cf);

colFam.default_validation_class = pycassa.types.UTF8Type()

try:

    colFam.remove(rowKey, ["money"])

    ret = colFam.get(rowKey, ["money"]);

except Exception,e:

    ret = ""

print "get with column name: " + str(ret)


try:

    ret = colFam.get(rowKey, column_start="aa", column_finish="zz");

except Exception,e:

    ret = ""

print "get with column slice: " + str(ret)


colFam.insert(rowKey, {'money':u'u20AC0.99'})


print "after insert: " + str(colFam.get(rowKey));

Below is the script output.


Reference:

1. Pycassa documentation: http://pycassa.github.com/pycassa/