Choose the Right Clock on Android

Android framework provides three different time keeping facilities. This post discusses the main differences and usage of them.

System.currentTimeMillis

It returns the number of milliseconds elapsed since the epoch, which is 00:00:00 January 1, 1970 UTC.

Note that this clock is not guaranteed to be monotonic, because of the followings:

  • The clock can be changed by network time sync
  • The clock can be updated by user or program

The clock can jump backwards or forwards unpredictably. We can listen for ACTION_TIME_TICK,ACTION_TIME_CHANGED and ACTION_TIMEZONE_CHANGED Intent broadcasts to find out when the time changes.

When to use

  • Get the wall clock time
  • Inerval/elapsed time that needed to span across device reboot. This is not ideal, but a cheap and OK solution sometimes. Ideally we want to have a server provided timestamp to avoid all the unpredictably jump can happen to this clock.

When not to use

  • Interval or elapsed timing doesn’t need to span across device reboot. E.g.: memory cache expiration time

SystemClock.uptimeMillis

It returns the number of milliseconds since device boot. The clock stops when system enters deep sleep. The clock is guaranteed to be monotonic.

This clock is the basis for Thread.sleep(millls), Object.wait(millis), and System.nanoTime().

When to use

  • Interval or elapsed timing doesn’t need to span across device reboot, and we want to exclude device deep sleep

SystemClock.elapsedRealtime

It returns the number of milliseconds since device boot. The clock don’t stop even when system enters deep sleep. It’s also guaranteed to be monotonic.

elapsedRealtimeNanos(): Similar to method above, but returns number of nano seconds.

When to use

  • Interval or elapsed timing doesn’t need to span across device reboot, and we want to include device deep sleep

Caveats

1. When using System.nanoTime or elapsedRealtimeNanos, differences span greater than approximately 292 years (263 nanoseconds) will not be computed correctly due to numerical overflow.

2. When comparing two timestamps, we should use t2 – t1 > 0 instead of t2 > t1, because of possible numerical overflow.

In order to help us understand why t2 – t1 is better than t2 > t1 in case of numerical overflow, let’s assume that t1 and t2 are bytes within the range of [-128, 127].
When t1 = 126, t2 = t1 + 3 will result in numerical overflow, and t2 = -127. If we use t2 > t1, it will return false. But t2 – t1 = -127 – 126, which will again result in numerical overflow, and the result is 3.

This only works when the difference of t2 and t1 itself can be represented by the data type. In the byte example above, t2 – t1 must be less than or equal to 127.

ArrayMap and its friends in Android

Previous post we discussed about SparseArray and its friends in Android. While they can replace HashMap when integer and long are used as key. It doesn’t work when the key is an arbitrary object.

Actually similar ideas can be applied to create more memory efficient generic map data structure. This post will go through ArrayMap, ArraySet and their friends, which replaces HashMap and HashSet in a more memory efficient way.

Usage

ArrayMap implements Map<K, V> interface, and it’s fully compatible with HashMap. There’a simplified version of ArrayMap: SimpleArrayMap, which doesn’t have the burden of maintaining compatibility with Java Map inteface, but still provides most of the commonly needed methods.

Correspondingly there’s ArraySet implements the Set<V> interface.

We should consider using ArrayMap and ArraySet at the following cases.

1. The number items is less than 1000.
2. Frequent reading, insertion and removal are less frequent.

And if we don’t need to maintain compatibility with Java interface, we can use SimpleArrayMap where ArrayMap is used.

Internals

We have described how HashMap works internally at SparseArray and its friends in Android. Basically it uses an array of linked list, and lots of unused space is allocated in the array.

For ArrayMap, two arrays are used. One is used to store the hash code of all keys, and the other is used to store the key value pairs. Key is stored at the even index, and value is saved at odd index.

Figure 1. Internal arrays in ArrayMap [image from ref 1]

Binary search is applied on the hash code array to locate the index, and the same index is used in the second array to store and retrieve key and value. Specifically,

1. For get, the hash code of key is computed and used in binary search to find the index. Then two searches are performed to find the equal key in the second array. One is searching forward, and the other is search backward. If no equal key is found, null is returned.

2. For put, if an equal key is found, the value is updated; otherwise a new key, value pair is inserted at the end of the equal hash code range. Array resizing might be applied in the process.

3. ArrayMap internally caches the arrays of size 4 and 8, to avoid spamming garbage. This further improves the memory efficiency when number of items in the map is small.

Advantages and Disadvantages

The internals of ArrayMap offers the following advantages compared with HashMap.

1. It’s more memory efficient. There’s no unused space allocated.

2. When there’s no element, ArrayMap doesn’t allocate memory, whereas HashMap allocates unused space.

3. Iteration with keyAt and valueAt is just like iterating an array. It’s very efficient. HashMap uses Iterator, which is slower and takes more memory.

But these advantages are not free,

1. Accessing a value from ArrayMap is O(logn), whereas in HashMap is O(1). This makes not much difference when number of elements in the Map is less than 1000.

2. Insertion and Deletion are relatively expensive, since we need to shift array elements, or allocate new array and copy elements over.

References

1. Android Fun with ArrayMaps performance video.

SparseArray and its friends in Android

One most important factor affecting Android app performance is memory. Android framework provides some speical classes to help developers optimizating memory usage. SparseArray and its friends are some of them.

How to use

SparseArray and its friends are designed to replace HashMap in Android when there isn’t large number of elements in the map. Below we list outs the Java usage and its corresponding Android replacement.

SparseArray          <Integer, Object>
SparseBooleanArray   <Integer, Boolean>
SparseIntArray       <Integer, Integer>
SparseLongArray      <Integer, Long>
LongSparseArray      <Long, Object>
LongSparseLongArray  <Long, Long>

Note that Android support library have two classes named SparseArrayCompat and LongSparseArray for backward compatibility. SparseArrayCompat supports methods like removeAtRange, which is only available at SparseArray from API 19.

When to use

In a word, we should use SparseArray and its friends when both of the following are met,
1. The key of the map is integer or long.
2. The number of elements is less than several hundreds.

This is because of the following advantages offered by SparseArray.
1. The key is primitive type, so putting an element doesn’t require creating an aditional Entry object, while HashMap does.
2. Also because key is primitive, there’s no auto-boxing of keys.

According to Android Memories presentation, for holding 1000 elements, SparseArray uses 8k while HashMap uses 64k.

class HashMap<k, v="">{ //  Class = 12 + 8 * 4 = 48 bytes
    Entry<k, v="">[] table; // Entry = 32 + 16 + 16 = 64 bytes
    Entry<k, v="">[] forNull; //  Array = 20 + 1000 * 64 = 64024 bytes
    int size;
    int modCount;
    int threshold;
    Set keys;
    Set<entry<K,V>> entries; // Total = 64,136 bytes
    Collection values;
}

class SparseIntArray { //  Class = 12 + 3 * 4 = 24 bytes
    int[] keys;  //  Array = 20 + 1000 * 4 = 4024 bytes
    int[] values;
    int size; //  Total = 8.072 bytes
}

But the memory efficiency doesn’t come for free.
1. HashMap computes the hash code to find the index to insert a new element in O(1). SparseArray uses binary search to find the index to insert in O(logn), which is less time efficient. According to Android official documentation, this difference is less than 50% for up to hundreds of items.
2. SparseArray is Android special, meaning less portable.

In addition, SparseArray offers some additional features.
1. The keys put into SparseArray are sorted ascendingly.
2. We can get an individual key or value given an index.

In contrast, HashMap doesn’t guarantee any order of the map. In fact, the order can change over time.

Internals

The pros and cons of SparseArray vs HashMap is due to their implementation.

For SparseArray, keys and values are kept in separate arrays.

1. Adding an element will use a binary search to find the insertion position.
a. If there’s already an element for key, replace it.
b. If there’s not and insertion space is available because the element there is marked as DELETED, insert it there.
c. Otherwise, we need to move array elements and grow the array capacity if necessary, and then insert the element.

2. Remove an element is optimized. The class simply mark it as DELETED. And when we query the size, the array needs to grow size, or do anything required to have a correct index, the array is then actually compacted.

Note that the method to compact the array is called gc, though it doesn’t necessarily will cause Java runtime GC to happen.

3. Because keys and values are kept in separate arrays and inserted using binary search position, the keys are guaranteed to be sorted.

HashMap is implemented using an array of linked list.

1. Adding an element will cause the class to compute the hash value of the key object (another hash is done on top of the hash code to further avoid collision). Based on the hash code, an array index is found to insert the value.

At each array index, there’s a linked list of Entry object, with key, value, hash and a pointer to next Entry object. When the same key (same reference or equals) is used, the old value is updated. Otherwise, a new Entry object is created, and added to the linked list.

2. Remove an element is similar. Finding the index based on hash in the array and traversal the linked list to find the correct Entry object to remove.

For a class with good hashcode implementation, the number of collisions will be small and the linked list at each array index will be short.

Android LayerDrawable and Drawable.Callback

LayerDrawable is a Drawable that manages an array of Drawables, where each Drawable is a layer of it. It can be used to compose fancy visual effects. However, used incorrectly, it can also introduce difficult to catch bugs. This post discusses a bug that can be caused by the Callback mechanism of Drawable.

Chain of Callbacks

In LayerDrawable, every layer/drawable registers the LayerDrawable as its Drawable.callback. This allows the layer to inform the LayerDrawable when it needs to redraw. As shown in the invalidateSelf method from Drawable.java, the DrawableCallback.invalidateDrawable is called to inform the client (the LayerDrawable, in this case) the drawable needs to redraw.

public void invalidateSelf() {
    final Callback callback = getCallback();
    if (callback != null) {
        callback.invalidateDrawable(this);
    }
}

 

Also, a View registers itself as Drawable.callback, so when the drawable needs to redraw, the View can be informed and invalidated. So if we set background of a View to be LayeredDrawable, we have a chain of DrawableCallbacks. This is illustrated in the figure below.

1.jpg

Figure 1. Chain of Callbacks when Setting LayerDrawable as View’s Background

When the Drawable needs to redraw, it calls its callback.invalidateDrawable, which in turns calls LayerDrawable’s callback.invalidateDrawable, which causes the View to redraw the corresponding region if necessary.

View Breaks Callback of Old Background Drawable

So far all work well, but the setBackgroundDrawable of View.java has the following code snippet.

public void setBackgroundDrawable(Drawable background) {
    ...

    /*
     * Regardless of whether we're setting a new background or not, we want
     * to clear the previous drawable.
     */
    if (mBackground != null) {
        mBackground.setCallback(null);
        unscheduleDrawable(mBackground);
    }

    …
    if (background != null) {
        background.setCallback(this);
    }
    ...
}

This means setting a new background will break the old background drawable’s callback by setting it to null, regardless whether the callback is still set to the View.

The Bug

With all the knowledge above, we can “fabricate” a bug by follow the steps below.

  1. Set a Drawable A as a background of a View V. The A.callback is set to V.

  2. Create a LayerDrawable L with A as one layer. Now A.callback is set to L.

  3. Now set another Drawable (or just null) as background of V. V will set callback of its old background (A in our example) to null, which breaks the link between A and L.

  4. Updates A won’t trigger L to update.

The solution to this issue is to update the background of V before creating L. This is illustrated in the code example here.

In the example, we tried animating the actionbar background. The background consists of two layers, the first one is a green ColorDrawable, and second a bitmap drawable. We want to animate the alpha value of the bitmap drawable, so it appears the image is faded in.

The code contains two methods, one works and the other doesn’t due to the bug we described. Each method can be triggered by a button. Below we extract the key part of the code, one can refer to the link for more details.


Button btn1 = (Button) findViewById(R.id.button1);
btn1.setOnClickListener(new View.OnClickListener() {
  @Override
  public void onClick(View v) {
    // 1. This sets launcherIconDrawable.callback to actionBar
    actionBar.setBackgroundDrawable(launcherIconDrawable);
    animateActionBarWorking();
  }
});
Button btn2 = (Button) findViewById(R.id.button2);
btn2.setOnClickListener(new View.OnClickListener() {
  @Override
  public void onClick(View v) {
    // 1. This sets launcherIconDrawable.callback to actionBar
    actionBar.setBackgroundDrawable(launcherIconDrawable);
    animateActionBarNotWorking();
  }
});

private void animateActionBarNotWorking() {
	Drawable[] layers = new Drawable[] { colorLayer, launcherIconDrawable };
	// 2. This sets launcherIconDrawable.callback to layerDrawable
	LayerDrawable layerDrawable = new LayerDrawable(layers);
	// 3. This sets launcherIconDrawable.callback to null
	actionBar.setBackgroundDrawable(layerDrawable);
	ValueAnimator valueAnimator = ValueAnimator.ofInt(0, 255);
	valueAnimator.setDuration(1000);
	valueAnimator.addUpdateListener(new ValueAnimator.AnimatorUpdateListener() {
	  @Override
	  public void onAnimationUpdate(ValueAnimator animation) {
		// 4. Updates launcherIconDrawable will not trigger action bar background to update
		// as launcherIconDrawable.callback is null
		launcherIconDrawable.setAlpha((Integer) animation.getAnimatedValue());
	  }
	});
	valueAnimator.start();
}

private void animateActionBarWorking() {
	actionBar.setBackgroundDrawable(null);
	animateActionBarNotWorking();
}

 

References:

  1. LayerDrawable: http://developer.android.com/reference/android/graphics/drawable/LayerDrawable.html

Support Vector Machine Concept: VC Dimension

[latexpage]

Given a set of n samples $x_1, x_2, …, x_n$, we want to label them as either -1 or 1. In total, there are $2^n$ possible label combinations.

A class of learning machines H can be used to label the samples. If for each label combination, we can always find a learning machine $h in H$ that labels it correctly, we then say that H shatters n points.

VC (Vapnik-Chervonekis) dimension is then defined as the maximum number of points that can be shattered by H, which measures the capacity of the hypothesis class H.

Note that VC(H) = 4 does not mean H can shatter any 4 points in the hyperplane, as long as there’s 4 points can be shattered by H, it’s good enough. And it also infers that for any > 4 (say 5) points in the hyperplane, it cannot be shattered by H.

Now consider a simple example to find the VC dimension of a class of hypothesis H. H is defined as straight line in a two dimensional hyperplane.

We can find the following 3 points a, b and c where all $2^3$ possible label combinations can be classified correctly by H.

1

As mentioned before, it’s possible to find 3 points that H cannot shatter. For example, we just place a, b and c in the same line. There will be combinations that a straight line cannot label them correctly. But this doesn’t affect the VC dimension as it’s defined as the maximum number of points can be shattered.

One can try separate 4 points using a straight line, but no matter where we place those 4 points, there will be combinations that cannot be separated.

 

Naive Bayes

Bayes’ Theorem

Let’s start from Bayes’ theorem, also referred as Bayes’ law or Bayes’ rule.

P(A|B) = P(B, A) / P(B)

= P(B|A) * P(A) / P(B)

= P(B|A) * P(A) / (P(B|A) * P(A) + P(B|^A) * P(^A))

P(A): prior probability. It’s the probability event A happens.

P(^A): the probability that event A not happen.

P(B): evidence, or background. The probability of event B happens.

P(B|A), P(B|^A): conditional probability, or likelihood. The probability of event B happens given A happened or not happened respectively.

P(A|B): posterior probability. The probability of A happens taking into account B for and against A.

Naive Bayes

When used for classification, Bayes’ Theorem can be expressed as below,

P(C|F1, F2, … Fn) = P(C, F1, F2, …, Fn) / P(F1, F2, … , Fn)

= P(F1, F2, … Fn|C) * P(C) / P(F1, F2, …, Fn)

C is some class/label we can classify a sample into, and F1, F2, … Fn represents features of the sample data.

P(F1, F2, …, Fn) doesn’t depend on C and are normally given or can be calculated based on probability of each feature. It’s effectively a constant and can be ignored for classification purpose.

The numerator can be expressed as following,

P(C, F1, F2 … , Fn)

= P(C) * P(F1, F2, … , Fn|C)

= P(C) * P(F1 | C) * P(F2, F3, … Fn | F1, C)

= P(C) * P(F1 | C) * P(F2 | F1, C) * P(F3, … Fn | F1, F2, C)

= P(C) * P(F1 | C) * P(F2 | F1, C) * P(F3 | F1, F2, C) * …  * P(Fn | F1, F2, …, Fn-1, C)

In Naive Bayes, all features are assumed to be independent. Thus Fi is independent from every other feature Fj where j != i. Therefore we have

P(F2 | F1, C) = P(F2 | C)

P(F3 | F1, F2, C) = P(F3 | C)

P(Fn | F1, F2, … Fn-1, C) = P(Fn | C)

Thus,

P(C, F1, F2 … , Fn) = P(C) * P(F1 | C) * P(F2 | C) * P(F3 | C), …, P(Fn | C)

For example, two authors A and B like to use words “love”, “life” and “money”. The probability of these words appears in A’s article is 0.1, 0.1 and 0.8, and in B’s as 0.5, 0.3 and 0.2. Now we have the phrase “love life”, which one of the author is more likely to have written that?

Without any information, there’s 50% percent probability for either A or B. Assuming the words are independent features, we can use Naive Bayes.

P(A | love, life) = P(A) * P(love | A) * P(life | A) / P(love, life) = 0.5 * 0.1 * 0.1 / P(love, life)

P(B | love, life) = P(B) * P(love | B) * P(life | B) / P(love, life) = 0.5 * 0.5 * 0.3 / P(love, life)

Clearly, it’s more likely that the phrase “love life” is written by author B. Note that P(love, life) is independent from the authors and just a scaling factor.

References:

  1. Bayes’ theorem: http://en.wikipedia.org/wiki/Bayes%27_theorem
  2. Naive Bayes classifier: http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  3. Udacity, Intro to Machine Learning. Lesson 1. Naive Bayes.

Performance Metrics for Binary Classification

Binary classification classifies samples as either 0 (negative) or 1(positive). Depending on which class/label the sample data belongs to and its predication result, each sample fits in a cell of the following table.

predicted | actual positive negative
positive true positive (TP) false positive (FP)
negative false negative (FN) true negative (TN)

We can then calculates different metrics, which measures the classification results from different angles.

True Positive Rate (Sensitivity, hit rate, recall): Out of all the positive samples, what fraction is actually detected as positive.

TPR = TP / (TP + FN)

True Negative Rate (Specificity): Out of all the negative samples, what fraction is actually detected as negative.

TNR = TN / (TN + FP)

Positive Predictive Value (Precision): Out of all samples predicted as positive, what fraction is actually positive.

PPV=TP / (TP + FP)

Negative Predictive Value: Out of all samples predicted as negative, what fraction is actually negative.

NPV = TN / (TN + FN)

False Positive Rate (Fall out): Out of all negative samples, what fraction is detected as positive by mistake.

FPR = FP / (FP + TN) = 1 – TNR

False Discovery Rate: Out of all samples predicted as positive, what fraction is actually negative.

FDR = FP / (FP + TP) = 1 – PPV

Accuracy: Out of all samples, what fraction is predicted correctly. That is, positive samples are predicted as positive, negative samples are predicted as negative.

Accuracy = (TP + TN) / (P + N)

The table below gives a nice overall view of the metrics mentioned above,

predicted | actual

positive negative

positive

true positive (TP) false positive (FP)

PPV = TP / (TP + FP)

negative

false negative (FN) true negative (TN)

NPV = TN / (FN + TN)

TPR = TP / (TP + FN) TNR = TN / (FP  + TN)

Accuracy =

(TP + TN) / (P + N)

F1 score: the harmonic mean of precision and recall.

F1 =2 * TPR*PPV / (TPR + PPV) = 2TP / (2TP + FP + FN)

Matthews correlation coefficient: It takes account true and false positives and negatives, and is regarded as balanced measure. It can be used for cases where the number of samples at different classes vary drastically.

MCC = (TP * TN – FP * FN) / √(TP + FP)(TP + FN)(TN + FP)(TN + FN)

MCC returns a value between -1 to 1, where -1 indicates total disagreement between prediction and actual facts, 0 means no better than random guess, and 1 indicates perfect prediction.

References:

  1. Matthews correlation coefficient: http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
  2. Sensitivity and specificity: http://en.wikipedia.org/wiki/Sensitivity_and_specificity

Ghost Object Causing Crash in Java

A Bad Code Style

Java’s GC has made our life easier, but it also makes things more implicit and sometimes easier to make mistakes. Recently I encountered a bad code style which is easy to introduce bugs.

Imagining you have a class with a reference field and a method. The method creates a new object and assign it to the reference. Every time you call the method, a new object is created and assigned to the reference. The code is essentially like below,


class A {
    SomeClass B;
    void initB() {
         B = new SomeClass();
         …… do something with B … …
    }
}

Just call initB many times, this pattern creates a lot of objects without referencing to them. Since we’re not referencing to them, GC will do us the favor of garbage collecting them. However, things can easily go wrong.

When we do something with B after creation of the object, the object can be referred at somewhere else (e.g.: B is a task, and do something adds it to a system task queue). Now we call initB many times, we’ll end up with lots of objects somewhere and only one referred by B. We may not have control over the non-referenced objects any more. This results in garbage.

Things can go worse. If SomeClass defines some callback method, which is triggered later by some signal. We may ends up having callback method from many instances of SomeClass.

An Example in Android

Below is an example in Android.


package com.example.androidtest;

import android.os.Bundle;
import android.os.Handler;
import android.app.Activity;
import android.util.Log;

public class MainActivity extends Activity {

  private static final String TAG = "MainActivity";
  
  private Handler handler;
  private Runnable runnable;
  private Integer value;
  
  @Override
  protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);
    value = 10;
    startRunnable();
    startRunnable();
    removeRunnable();
  }
  
  private void startRunnable() {
    handler = new Handler();
    runnable = new Runnable() {
      @Override
      public void run() {
        Log.d(TAG, value.toString());
      }
    }; 
    handler.postDelayed(runnable, 2000);
  }
  
  private void removeRunnable() {
    handler.removeCallbacks(runnable);
    value = null;
  }
  
}

 

The startRunnable method creates a new Handler and Runnable objects, assigns them to the class reference field. We called this method twice, leaving a Handler and Runnable unreferenced by us.

However, these objects are still referenced by some Android system components. The postDelayed method essentially adds the runnable to a message queue associated with the current thread (The actual details are more complicated than this but you get the idea).

We then call removeRunnable to remove the runnable associated with handler, and assign null to value.

However, this doesn’t remove the ghost runnable associated with the ghost handler. And the run() method of the ghost runnable will be triggered, which will cause a crash when it tries to access value.toString().

The Fix

This coding style is bad and easy to introduce bugs. We essentially only want a single instance of Handler and Runnable at any time. If we have to create a new instance, we should ensure the old one is discarded properly, and leaving them referenced by system components is not.

If the system provides some callback functions that are guaranteed to be called once and we can create the objects inside those functions.

If we need to create objects inside some random methods, we can either reuse the old object, or properly reset the old object before creating new one. The code below shows how to do that to fix the example above.


package com.example.androidtest;

import android.app.Activity;
import android.os.Bundle;
import android.os.Handler;
import android.util.Log;

public class MainActivityNew extends Activity {

  private static final String TAG = "MainActivityNew";
  
  private Handler handler;
  private Runnable runnable;
  private Integer value;
  
  @Override
  protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);
    value = 10;
    startRunnable();
    startRunnable();
    removeRunnable();
  }
  
  private void startRunnable() {
    if (handler == null) {
      handler = new Handler();
    } else {
      handler.removeCallbacks(runnable);
    }
    runnable = new Runnable() {
      @Override
      public void run() {
        Log.d(TAG, value.toString());
      }
    }; 
    handler.postDelayed(runnable, 2000);
  }
  
  private void removeRunnable() {
    handler.removeCallbacks(runnable);
    value = null;
  }
  
}

 

In startRunnable method, if a Handler instance is available, we’ll just reuse it; If there’re potentially old Runnable associated with handler, we’ll remove them.

Logistic Regression

This is some notes taken when I summarize the things learned after taking Andrew Ng’s machine learning course at coursera.

Introduction

Linear regression predicts continuous values. At times, we need to categorize things. Logistic regression is a probabilistic statistical classification model does that.

We will examine how logistic regression classify things to two categories (either 0 or 1) first, and then how it is used for multiple categories.

The logistic regression model can be described by the following logistic/sigmoid function below,

1

h(x) an be interpreted as the estimated probability that y = 1 on input x.

If theta’X >= 0, h(x) >= 0.5, we predict output y = 1

If theta’X < 0, h(x) < 0.5, we predict output y = 0

theta’X essentially describes the decision boundary.  Note that we can use  other values instead of 0.5 as the cutoff point if it is more suitable.

Cost Function

The cost function for logistic regression is defined as below,

3

The cost is further defined as,

5

We can merge the functions, and the cost function eventually becomes

4

With regularization, the cost function becomes,

6Note that j starts from 1 as a convention.

Gradient Descent

The gradient descent of logistic regression is identical to linear regression, except that h(x(i)) is different.

Multi-class Classification: One-vs-All

We can use one-vs-all technique to apply logistic regression to multi-class classification. The idea is to train a logistic regression classifier for each class i to predict the probability that y = i. Then we pick the category that has the maximum probability for an input.

Linear Regression

This is some notes taken when I summarize the things learned after taking Andrew Ng’s machine learning course at coursera.

Introduction

Regression is a technique to model relationships among variables. Typically, there’s one dependent variable y and one or many independent variables. This relationship is usually expressed as a regression function.

Linear regression, as the name suggests, models the relationship using a linear regression function. Depending on how many independent variables we have, we have simple linear regression with one independent variable and multivariate linear regression with more than one independent variables.

The hypothesis of linear regression can be described by the following equation,

linear_regression

The X are called features, and theta are the parameters. Given a set of training samples, we’ll need to choose theta to fit the training examples.

To measure how well we fit the training examples, we define the cost function of linear regression as below,

linear_regression_cost_function

m represents the number of training samples, h(x) is the predicted value and y is the sample output value. The cost function measures the average square error of all samples and then divide by 2.

This is essentially an optimization problem where we need to choose parameter theta such that the cost defined by the cost function is minimized.

Over-fitting and Regularization

Fitting the regression parameters minimize the error for training samples, however we can run into the problem of trying too hard such that the regression function doesn’t generalize well. i.e.: The hypothesis produce high error for input outside of the training set. This problem is known as overfitting.

Two commonly used techniques to address overfitting is reducing number of features and regularization.

Regularization adds an additional term to the cost function to penalize having large theta value, which tends to produce much more smooth curves.

linear_regression_cf_regularization

Note that by convention, the regularization term exclude j=0 case, which is theta 0.

Given the hypothesis and its cost function, there’re many ways to fit the parameter theta (i.e., solve the optimization problem), including conjugate gradient, BFGS, L-BFGS etc. The most commonly used technique is Gradient Descent.

Gradient Descent

The idea of gradient descent is to start at some random values, evaluate the cost. And keep iterating on theta value based on the function below to reduce the cost until we reach a minimal.

linear_regression_gradient_descent

The alpha is called the learning rate. It can be proven that if choose a sufficiently small alpha value, the cost will converge at some minimum. However, we don’t want alpha value to be too small in practice because it will take longer time. Typically, we try out a range of alpha values (0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1) and plot the cost to see how fast it converges.

For linear regression with regularization, the above equation is essentially the following,gd

The second term can easily be rewritten as,

gd2

Feature Scaling and Mean Normalization

When we do gradient descent, the values for different features normally differ in scale. For example, feature A may have value in the range of [1, 10], feature B varies from [-10000, 10000].

It’s good to have the feature values have similar scales and centered around 0 (i.e.: have approximately mean of 0).

The former can be achieved using feature scaling, just divide every value of that feature by a number such that the range is approximately [-1, 1]. The latter is accomplished using mean normalization (This doesn’t apply to X0). We can usually use (X – mean) to achieve this.

Numerical Analysis

Besides using optimization algorithms to fit theta iteratively, it turns out we can also compute the theta values numerically.

Without regularization, the numerical equation is as below,

linear_regression_numerical_analysis

While this method doesn’t need to choose learning rate and iterate, it is more computationally expensive as n get large because of the matrix multiplication and inverse. In addition, the inverse may not even exist. This is typically due to redundant features (some features are not linearly independent) or too many features too few samples.

With regularization, the numerical solution is the following,

linear_regression_numerical_analysis_2

Note that inverse part will exist even if the equation without regularization is not invertible.