Given a set of n samples $x_1, x_2, …, x_n$, we want to label them as either -1 or 1. In total, there are $2^n$ possible label combinations.
A class of learning machines H can be used to label the samples. If for each label combination, we can always find a learning machine $h in H$ that labels it correctly, we then say that H shatters n points.
VC (Vapnik-Chervonekis) dimension is then defined as the maximum number of points that can be shattered by H, which measures the capacity of the hypothesis class H.
Note that VC(H) = 4 does not mean H can shatter any 4 points in the hyperplane, as long as there’s 4 points can be shattered by H, it’s good enough. And it also infers that for any > 4 (say 5) points in the hyperplane, it cannot be shattered by H.
Now consider a simple example to find the VC dimension of a class of hypothesis H. H is defined as straight line in a two dimensional hyperplane.
We can find the following 3 points a, b and c where all $2^3$ possible label combinations can be classified correctly by H.
As mentioned before, it’s possible to find 3 points that H cannot shatter. For example, we just place a, b and c in the same line. There will be combinations that a straight line cannot label them correctly. But this doesn’t affect the VC dimension as it’s defined as the maximum number of points can be shattered.
One can try separate 4 points using a straight line, but no matter where we place those 4 points, there will be combinations that cannot be separated.