Support Vector Machine Concept: VC Dimension


Given a set of n samples $x_1, x_2, …, x_n$, we want to label them as either -1 or 1. In total, there are $2^n$ possible label combinations.

A class of learning machines H can be used to label the samples. If for each label combination, we can always find a learning machine $h in H$ that labels it correctly, we then say that H shatters n points.

VC (Vapnik-Chervonekis) dimension is then defined as the maximum number of points that can be shattered by H, which measures the capacity of the hypothesis class H.

Note that VC(H) = 4 does not mean H can shatter any 4 points in the hyperplane, as long as there’s 4 points can be shattered by H, it’s good enough. And it also infers that for any > 4 (say 5) points in the hyperplane, it cannot be shattered by H.

Now consider a simple example to find the VC dimension of a class of hypothesis H. H is defined as straight line in a two dimensional hyperplane.

We can find the following 3 points a, b and c where all $2^3$ possible label combinations can be classified correctly by H.


As mentioned before, it’s possible to find 3 points that H cannot shatter. For example, we just place a, b and c in the same line. There will be combinations that a straight line cannot label them correctly. But this doesn’t affect the VC dimension as it’s defined as the maximum number of points can be shattered.

One can try separate 4 points using a straight line, but no matter where we place those 4 points, there will be combinations that cannot be separated.


Naive Bayes

Bayes’ Theorem

Let’s start from Bayes’ theorem, also referred as Bayes’ law or Bayes’ rule.

P(A|B) = P(B, A) / P(B)

= P(B|A) * P(A) / P(B)

= P(B|A) * P(A) / (P(B|A) * P(A) + P(B|^A) * P(^A))

P(A): prior probability. It’s the probability event A happens.

P(^A): the probability that event A not happen.

P(B): evidence, or background. The probability of event B happens.

P(B|A), P(B|^A): conditional probability, or likelihood. The probability of event B happens given A happened or not happened respectively.

P(A|B): posterior probability. The probability of A happens taking into account B for and against A.

Naive Bayes

When used for classification, Bayes’ Theorem can be expressed as below,

P(C|F1, F2, … Fn) = P(C, F1, F2, …, Fn) / P(F1, F2, … , Fn)

= P(F1, F2, … Fn|C) * P(C) / P(F1, F2, …, Fn)

C is some class/label we can classify a sample into, and F1, F2, … Fn represents features of the sample data.

P(F1, F2, …, Fn) doesn’t depend on C and are normally given or can be calculated based on probability of each feature. It’s effectively a constant and can be ignored for classification purpose.

The numerator can be expressed as following,

P(C, F1, F2 … , Fn)

= P(C) * P(F1, F2, … , Fn|C)

= P(C) * P(F1 | C) * P(F2, F3, … Fn | F1, C)

= P(C) * P(F1 | C) * P(F2 | F1, C) * P(F3, … Fn | F1, F2, C)

= P(C) * P(F1 | C) * P(F2 | F1, C) * P(F3 | F1, F2, C) * …  * P(Fn | F1, F2, …, Fn-1, C)

In Naive Bayes, all features are assumed to be independent. Thus Fi is independent from every other feature Fj where j != i. Therefore we have

P(F2 | F1, C) = P(F2 | C)

P(F3 | F1, F2, C) = P(F3 | C)

P(Fn | F1, F2, … Fn-1, C) = P(Fn | C)


P(C, F1, F2 … , Fn) = P(C) * P(F1 | C) * P(F2 | C) * P(F3 | C), …, P(Fn | C)

For example, two authors A and B like to use words “love”, “life” and “money”. The probability of these words appears in A’s article is 0.1, 0.1 and 0.8, and in B’s as 0.5, 0.3 and 0.2. Now we have the phrase “love life”, which one of the author is more likely to have written that?

Without any information, there’s 50% percent probability for either A or B. Assuming the words are independent features, we can use Naive Bayes.

P(A | love, life) = P(A) * P(love | A) * P(life | A) / P(love, life) = 0.5 * 0.1 * 0.1 / P(love, life)

P(B | love, life) = P(B) * P(love | B) * P(life | B) / P(love, life) = 0.5 * 0.5 * 0.3 / P(love, life)

Clearly, it’s more likely that the phrase “love life” is written by author B. Note that P(love, life) is independent from the authors and just a scaling factor.


  1. Bayes’ theorem:
  2. Naive Bayes classifier:
  3. Udacity, Intro to Machine Learning. Lesson 1. Naive Bayes.

Performance Metrics for Binary Classification

Binary classification classifies samples as either 0 (negative) or 1(positive). Depending on which class/label the sample data belongs to and its predication result, each sample fits in a cell of the following table.

predicted | actual positive negative
positive true positive (TP) false positive (FP)
negative false negative (FN) true negative (TN)

We can then calculates different metrics, which measures the classification results from different angles.

True Positive Rate (Sensitivity, hit rate, recall): Out of all the positive samples, what fraction is actually detected as positive.

TPR = TP / (TP + FN)

True Negative Rate (Specificity): Out of all the negative samples, what fraction is actually detected as negative.

TNR = TN / (TN + FP)

Positive Predictive Value (Precision): Out of all samples predicted as positive, what fraction is actually positive.

PPV=TP / (TP + FP)

Negative Predictive Value: Out of all samples predicted as negative, what fraction is actually negative.

NPV = TN / (TN + FN)

False Positive Rate (Fall out): Out of all negative samples, what fraction is detected as positive by mistake.

FPR = FP / (FP + TN) = 1 – TNR

False Discovery Rate: Out of all samples predicted as positive, what fraction is actually negative.

FDR = FP / (FP + TP) = 1 – PPV

Accuracy: Out of all samples, what fraction is predicted correctly. That is, positive samples are predicted as positive, negative samples are predicted as negative.

Accuracy = (TP + TN) / (P + N)

The table below gives a nice overall view of the metrics mentioned above,

predicted | actual

positive negative


true positive (TP) false positive (FP)

PPV = TP / (TP + FP)


false negative (FN) true negative (TN)

NPV = TN / (FN + TN)

TPR = TP / (TP + FN) TNR = TN / (FP  + TN)

Accuracy =

(TP + TN) / (P + N)

F1 score: the harmonic mean of precision and recall.

F1 =2 * TPR*PPV / (TPR + PPV) = 2TP / (2TP + FP + FN)

Matthews correlation coefficient: It takes account true and false positives and negatives, and is regarded as balanced measure. It can be used for cases where the number of samples at different classes vary drastically.

MCC = (TP * TN – FP * FN) / √(TP + FP)(TP + FN)(TN + FP)(TN + FN)

MCC returns a value between -1 to 1, where -1 indicates total disagreement between prediction and actual facts, 0 means no better than random guess, and 1 indicates perfect prediction.


  1. Matthews correlation coefficient:
  2. Sensitivity and specificity:

Logistic Regression

This is some notes taken when I summarize the things learned after taking Andrew Ng’s machine learning course at coursera.


Linear regression predicts continuous values. At times, we need to categorize things. Logistic regression is a probabilistic statistical classification model does that.

We will examine how logistic regression classify things to two categories (either 0 or 1) first, and then how it is used for multiple categories.

The logistic regression model can be described by the following logistic/sigmoid function below,


h(x) an be interpreted as the estimated probability that y = 1 on input x.

If theta’X >= 0, h(x) >= 0.5, we predict output y = 1

If theta’X < 0, h(x) < 0.5, we predict output y = 0

theta’X essentially describes the decision boundary.  Note that we can use  other values instead of 0.5 as the cutoff point if it is more suitable.

Cost Function

The cost function for logistic regression is defined as below,


The cost is further defined as,


We can merge the functions, and the cost function eventually becomes


With regularization, the cost function becomes,

6Note that j starts from 1 as a convention.

Gradient Descent

The gradient descent of logistic regression is identical to linear regression, except that h(x(i)) is different.

Multi-class Classification: One-vs-All

We can use one-vs-all technique to apply logistic regression to multi-class classification. The idea is to train a logistic regression classifier for each class i to predict the probability that y = i. Then we pick the category that has the maximum probability for an input.

Linear Regression

This is some notes taken when I summarize the things learned after taking Andrew Ng’s machine learning course at coursera.


Regression is a technique to model relationships among variables. Typically, there’s one dependent variable y and one or many independent variables. This relationship is usually expressed as a regression function.

Linear regression, as the name suggests, models the relationship using a linear regression function. Depending on how many independent variables we have, we have simple linear regression with one independent variable and multivariate linear regression with more than one independent variables.

The hypothesis of linear regression can be described by the following equation,


The X are called features, and theta are the parameters. Given a set of training samples, we’ll need to choose theta to fit the training examples.

To measure how well we fit the training examples, we define the cost function of linear regression as below,


m represents the number of training samples, h(x) is the predicted value and y is the sample output value. The cost function measures the average square error of all samples and then divide by 2.

This is essentially an optimization problem where we need to choose parameter theta such that the cost defined by the cost function is minimized.

Over-fitting and Regularization

Fitting the regression parameters minimize the error for training samples, however we can run into the problem of trying too hard such that the regression function doesn’t generalize well. i.e.: The hypothesis produce high error for input outside of the training set. This problem is known as overfitting.

Two commonly used techniques to address overfitting is reducing number of features and regularization.

Regularization adds an additional term to the cost function to penalize having large theta value, which tends to produce much more smooth curves.


Note that by convention, the regularization term exclude j=0 case, which is theta 0.

Given the hypothesis and its cost function, there’re many ways to fit the parameter theta (i.e., solve the optimization problem), including conjugate gradient, BFGS, L-BFGS etc. The most commonly used technique is Gradient Descent.

Gradient Descent

The idea of gradient descent is to start at some random values, evaluate the cost. And keep iterating on theta value based on the function below to reduce the cost until we reach a minimal.


The alpha is called the learning rate. It can be proven that if choose a sufficiently small alpha value, the cost will converge at some minimum. However, we don’t want alpha value to be too small in practice because it will take longer time. Typically, we try out a range of alpha values (0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1) and plot the cost to see how fast it converges.

For linear regression with regularization, the above equation is essentially the following,gd

The second term can easily be rewritten as,


Feature Scaling and Mean Normalization

When we do gradient descent, the values for different features normally differ in scale. For example, feature A may have value in the range of [1, 10], feature B varies from [-10000, 10000].

It’s good to have the feature values have similar scales and centered around 0 (i.e.: have approximately mean of 0).

The former can be achieved using feature scaling, just divide every value of that feature by a number such that the range is approximately [-1, 1]. The latter is accomplished using mean normalization (This doesn’t apply to X0). We can usually use (X – mean) to achieve this.

Numerical Analysis

Besides using optimization algorithms to fit theta iteratively, it turns out we can also compute the theta values numerically.

Without regularization, the numerical equation is as below,


While this method doesn’t need to choose learning rate and iterate, it is more computationally expensive as n get large because of the matrix multiplication and inverse. In addition, the inverse may not even exist. This is typically due to redundant features (some features are not linearly independent) or too many features too few samples.

With regularization, the numerical solution is the following,


Note that inverse part will exist even if the equation without regularization is not invertible.