Ghost Object Causing Crash in Java

A Bad Code Style

Java’s GC has made our life easier, but it also makes things more implicit and sometimes easier to make mistakes. Recently I encountered a bad code style which is easy to introduce bugs.

Imagining you have a class with a reference field and a method. The method creates a new object and assign it to the reference. Every time you call the method, a new object is created and assigned to the reference. The code is essentially like below,


class A {
    SomeClass B;
    void initB() {
         B = new SomeClass();
         …… do something with B … …
    }
}

Just call initB many times, this pattern creates a lot of objects without referencing to them. Since we’re not referencing to them, GC will do us the favor of garbage collecting them. However, things can easily go wrong.

When we do something with B after creation of the object, the object can be referred at somewhere else (e.g.: B is a task, and do something adds it to a system task queue). Now we call initB many times, we’ll end up with lots of objects somewhere and only one referred by B. We may not have control over the non-referenced objects any more. This results in garbage.

Things can go worse. If SomeClass defines some callback method, which is triggered later by some signal. We may ends up having callback method from many instances of SomeClass.

An Example in Android

Below is an example in Android.


package com.example.androidtest;

import android.os.Bundle;
import android.os.Handler;
import android.app.Activity;
import android.util.Log;

public class MainActivity extends Activity {

  private static final String TAG = "MainActivity";
  
  private Handler handler;
  private Runnable runnable;
  private Integer value;
  
  @Override
  protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);
    value = 10;
    startRunnable();
    startRunnable();
    removeRunnable();
  }
  
  private void startRunnable() {
    handler = new Handler();
    runnable = new Runnable() {
      @Override
      public void run() {
        Log.d(TAG, value.toString());
      }
    }; 
    handler.postDelayed(runnable, 2000);
  }
  
  private void removeRunnable() {
    handler.removeCallbacks(runnable);
    value = null;
  }
  
}

 

The startRunnable method creates a new Handler and Runnable objects, assigns them to the class reference field. We called this method twice, leaving a Handler and Runnable unreferenced by us.

However, these objects are still referenced by some Android system components. The postDelayed method essentially adds the runnable to a message queue associated with the current thread (The actual details are more complicated than this but you get the idea).

We then call removeRunnable to remove the runnable associated with handler, and assign null to value.

However, this doesn’t remove the ghost runnable associated with the ghost handler. And the run() method of the ghost runnable will be triggered, which will cause a crash when it tries to access value.toString().

The Fix

This coding style is bad and easy to introduce bugs. We essentially only want a single instance of Handler and Runnable at any time. If we have to create a new instance, we should ensure the old one is discarded properly, and leaving them referenced by system components is not.

If the system provides some callback functions that are guaranteed to be called once and we can create the objects inside those functions.

If we need to create objects inside some random methods, we can either reuse the old object, or properly reset the old object before creating new one. The code below shows how to do that to fix the example above.


package com.example.androidtest;

import android.app.Activity;
import android.os.Bundle;
import android.os.Handler;
import android.util.Log;

public class MainActivityNew extends Activity {

  private static final String TAG = "MainActivityNew";
  
  private Handler handler;
  private Runnable runnable;
  private Integer value;
  
  @Override
  protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);
    value = 10;
    startRunnable();
    startRunnable();
    removeRunnable();
  }
  
  private void startRunnable() {
    if (handler == null) {
      handler = new Handler();
    } else {
      handler.removeCallbacks(runnable);
    }
    runnable = new Runnable() {
      @Override
      public void run() {
        Log.d(TAG, value.toString());
      }
    }; 
    handler.postDelayed(runnable, 2000);
  }
  
  private void removeRunnable() {
    handler.removeCallbacks(runnable);
    value = null;
  }
  
}

 

In startRunnable method, if a Handler instance is available, we’ll just reuse it; If there’re potentially old Runnable associated with handler, we’ll remove them.

HtmlUnit Memory Leak — A Workaround

HtmlUnit is a programmable browser without GUI. It’s written in Java and exposes APIs that allow us to open a window, load a web page, execute Javascript, fill in forms, click buttons and clicks etc. HtmlUnit is typically used for website testing or crawling information from web sites.

Recently I worked on a task which uses HtmlUnit 2.10 to retrieve information of some web page with fairly complex Javascript. It seems the Javascript engine is causing some memory leak issues. After loading a few web pages, the memory usage is becoming high (>1GB) and eventually OutOfMemoryError will occur. The HtmlUnit FAQ page suggests that we should call WebClient.closeAllWindows(). I tried but it doesn’t work.

Instead of digging into the JavaScript engine and find out why the error happened. I decided to use a workaround — use a two process approach. The main process will keep track of the pages that has been crawled, what to crawl next etc. and create a child process to do the actual retrieval using HtmlUnit. After the child process finishes crawling for several pages, it will exit. The main process will create a new process to crawl next few pages. To make things simple, the two processes use file IO for Inter-Process Communication (IPC). That is, the child process writes what are the pages have been crawled, the main process reads it to update what have been crawled.

Because all memory allocated to the child process will be freed when it is terminated. This approach can work with the memory leak unfixed, but with performance penalty.

Reference:
1. HtmlUnit website: http://htmlunit.sourceforge.net/
2. http://htmlunit.sourceforge.net/faq.html

A Bug Occurred at malloc

0. Context

I’m developing a program that allocates memory dynamically in a loop. For every iteration of the loop, the code allocates memory using malloc and then operates on the allocated memory. The program crashes at second iteration.

1. Locate the Bug

I added a few printf statements, and found that the program crashes when calling malloc at second iteration.

I used valgrind to run the program, and below is the output,

alloc size: 46104
–4149– VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) – exiting
–4149– si_code=1;  Faulting address: 0x74616874;  sp: 0x62ac7a58
valgrind: the ‘impossible’ happened:
  Killed by fatal signal
==4149==    at 0x3809D201: myvprintf_str (m_debuglog.c:530)
==4149==    by 0x3809D9E2: vgPlain_debugLog_vprintf (m_debuglog.c:877)
==4149==    by 0x380298E5: vprintf_WRK (m_libcprint.c:111)
==4149==    by 0x380299A7: vgPlain_printf (m_libcprint.c:143)
==4149==    by 0x38027A96: vgPlain_assert_fail (m_libcassert.c:259)
==4149==    by 0x656C501F: ???
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable
==4149==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==4149==    by 0x80527B2: ??? (in /home/roman10/Desktop/github/AndZop/jni/andzop_desktop/ffplay)

It can be seen that the segmentation fault is caused by malloc. But why?

2. The Reason and Fix

How come malloc can give segmentation fault?

I searched for a while, and found a discussion at reference 1. Someone else has encountered the same error at valgrind, and it’s caused by some bug at other code lines instead of malloc.

I checked my code carefully, and got the reason. At the first iteration, I used more the memory space allocated. When malloc is called at second iteration, it probably tried to allocate memory from the same memory region that was used illegally at first iteration.

3. Additional Notes

The code that stopped working doesn’t have to be the code that caused it.

Memory related bugs is hard to debug and one should pay special attention to memory manipulations.

Memory overflow may not cause any issues at the place it happens. But it’s likely to cause issues at other places. 

References:
1.
http://comments.gmane.org/gmane.comp.debugging.valgrind/9620

Bug Caused by pthread_create Input Parameters Passing

0. The Context

I was working on a multi-threaded application using pthread library on Linux. One part of the application uses a thread pool with 100 threads. Each thread is supposed to finish its job and exit. The execution noramlly takes less than several seconds (we’ll call it short-living thread).

1. Locate the Bug

The application behaves abnormal, and I added a few printf statements to find what went wrong. It seems the input parameters passed into the short living thread changes inside the thread execution function. I know that usually means I’m not handling the memory correctly.

2. The Reason and Fix

The input parameter is a data structure, I passed the pointer of the data structure to the thread execution function, and the content of the data structure is copied in the thread execution function. Then the data structure instance is reused.

Wait a minute. What if there’re multiple threads created by pthread_create at almost the same time, before the copy operation is carried out within each thread, the content can be changed by some thread already.

I realized that I’m not handling the parameter passing in pthread correctly. I should allocate separate data structure instance for each short living thread in the thread pool, then the data can never be messed up between threads.

3. How to Avoid Such Bugs

Every body who has programmed a multi-threaded application knows how difficult debugging can be sometimes. The shared data, the scheduling, dead-lock avoidance, there’s lots to be careful.

Program slowly, think before changing the code, and don’t leave everything to debugging.

A Bug Caused by Overlooking Operators Precedence in C

Side Note: I’ve been thinking of writing blogs for bugs that have caused me hours to locate and fix for quite some time.

Sometimes I found myself making silly mistakes, sometimes I made same mistakes multiple times. Just like every other programmer, I spent lots of time locating bugs and fixing them. It’s worthwhile to note the mistake down and avoid making them again. So here comes the first one.

0. The Context
I’m developing an application that needs to read files using mmap. I code the part that mmap a file, but whenever I run the program, the mmap operation always give me “No Such Device” error.

1. Locate the Bug
I searched mmap Linux man page, the “No Such Device” error code means “The fildes parameter does not refer to a *TYPE2 stream file (*STMF) in the “root” (/), QOpenSys, or user-defined file systems.”

A couple of other articles discussed this bug, but I’m not making the same mistake as discussed in those articles.

As the application contains lots of code and difficult to debug, I decided to write a simple test program (actually I just found a mmap sample online and copied it).

#include <stdio.h>

#include <stdlib.h>

#include <fcntl.h>

#include <unistd.h>

#include <sys/mman.h>

#include <sys/stat.h>

#include <errno.h>

 

int main(int argc, char *argv[])

{

    int fd, offset;

    char *data;

    struct stat sbuf;

    FILE *f;

 

    if (argc != 2) {

        fprintf(stderr, "usage: mmapdemo offsetn");

        exit(1);

    }

 

    f = fopen("./h1_1280_720_5m.mp4_mbstpos_gop1.txt", "w");

    fprintf(f, "sssssssssssssssssssssssssss");

    fflush(f);

    //this line will cause no such device error

    printf("fd: %dn", fd);

    if (fd = open("./h1_1280_720_5m.mp4_mbstpos_gop1.txt", O_RDONLY) == -1) {            //line caused bug

    //if ((fd = open("./h1_1280_720_5m.mp4_mbstpos_gop1.txt", O_RDONLY)) == -1) {            //correct source code

        perror("open");

        exit(1);

    }

    printf("fd: %dn", fd);

    if (stat("./h1_1280_720_5m.mp4_mbstpos_gop1.txt", &sbuf) == -1) {

        perror("stat");

        exit(1);

    }

    printf("file size: %ld", sbuf.st_size);

    offset = atoi(argv[1]);

    if (offset < 0 || offset > sbuf.st_size-1) {

        fprintf(stderr, "mmapdemo: offset must be in the range 0-%ldn", sbuf.st_size-1);

        exit(1);

    }

    

    data = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)    ;

    if (data == (-1)) {

        perror("mmap");

        exit(1);

    }

 

    printf("byte at offset %d is '%c'n", offset, data[offset]);

 

    return 0;

}

The test code works fine, I then compare the code with the mmap code from the app. I didn’t notice any difference. Then I started to replace the code from test code with the app code line by line (comments out the original code, copy app code over and change parameters. In this way, both the test code and app code are kept and I can compare them easily). The error appears! It’s good that I can reproduce the error. I know that’s the first step to locate the bug in lots of cases.

Then I compare the code line by line, and located the difference.

if (fd = open(“./h1_1280_720_5m.mp4_mbstpos_gop1.txt”, O_RDONLY) == -1)
if ((fd = open(“./h1_1280_720_5m.mp4_mbstpos_gop1.txt”, O_RDONLY)) == -1)

2. The Reason and Fix

Once I located the bug, I almost know why this occurs. It must due to the precedence of “=” and “==” operators. I don’t remember the detail, so I googled it.

“==” operator has hight precedence than “=” operator. So in the app code, the open method return value is compared with -1 first, and the comparison result is assigned to fd. If the file is opened successfully, open method return values won’t equal to -1, the comparison result would be 0 (false). I then added some printf statement to confirm the reasoning.

Once the reason is found, fixing the bug is easy.

3. How to Avoid Such Bugs

I actually once read a programming suggestion that we should use parenthesis to indicate the order of evaluation in a compound statement. I agreed and followed most of the time.

Sometimes it can be painful to count the number of parenthesis as the statement is really complicated. Maybe that’s why I didn’t follow the practice so strictly.

The bug took me around 1.5 hours to fix and another 45 minutes to note it down. I’ve paid my price and I should remember to use parenthesis to indicate the execution order clearly, even they’re ugly sometimes. If the statement is too complex, we can always break it down to multiple simple statements.

The lesson is: Make precedence explicit by Parenthesis!

References:
1. C Operator Precedence and Associativity: http://www.difranco.net/cop2220/op-prec.htm