Real-Time Video Bilayer Segmentation–Theory–Part 2 Producing the Trimap

Side Note: First draft on Mar 31 2011.

This article is a follow up of the part 1 Bayesian Estimation.

2. Producing the Trimap

Below is an example of a trimap,


Figure 1. Trimap based on Iprob

How do we get this trimap?

Part 1 has covered how to compute the Iprob, once we have the Iprob for every pixel of a video frame, we compare the Iprob with a small real value epsilon.


here B stands for background, F stands for foreground and U refers to unknown. In figure 1, the white area are F, black area are B and gray area are U.

For the white and black areas, it’s confident to decide whether they’re foreground or background based on Iprob alone. The gray regions are further processed through MRF to decide it’s foreground or background.

As a matter of fact, the benefit of this trimap is to reduce the MRF processing. With the trimap, MRF processing is only done for pixels in unknown region.

3. GraphCut for Unkown Regions

There are many problems can be described using MRF in the form of

E = Ed+λEs

where the Ed is the data cost and Es is the smoothness cost. In essence, Ed describes the energy of a pixel being classified as certain category, Es describes the dependence of this classification’s dependency on its neighbors. The solution for these problems are usually finding a classification that results in minimum cost for all pixels.

There’re many ways to solve the cost (some refer as energy) minimization problem, one of them is GraphCut.

For this real-time bilayer segmentation, the Ed and Es function can be expressed as,


here Data(Seg) is the Ed for all pixels under unknown region. And,


here Sth(Seg) is the Es for all neighboring pixels under unknown region. dist(p,q) refers Euclidean distance between pixels p and q.

Note that Seg can be either F or B in the two equations above. For smoothness energy, if the two neighboring pixels p and q are classified as the same category (both F or both B), the energy is 0.

P(Cp|F), P(Cp|B) and Iprob are already calculated in part 1, smoonthness energy is also straightforward to calculated for two given pixels.

The method of calculating L(pi|F) and L(pi|B) will be covered in Part 3.

Real-Time Video Bilayer Segmentation–Theory–Part 1 Bayesian Estimation

Side Note: First draft on Mar 30 2011.

This article is mainly based on the reference paper 1 and 2.

Video bilayer segmentation refers to the process of dividing the video frames into foreground and background. Here we introduce a video bilayer segmentation process which is close to real-time.

The entire process can be illustrated as the figure below,


Figure 1. Process Overview of the real-time Bilayer Segmentation

The input includes the video and the segementation mask for the first frame. The segmentation of the first frame can be done using background subtraction, interactive graph cut, image snapping, lazy snapping and so on.

The segmentation for the rest of the video frames are done one by one automatically by the process illustrated above.

1. Bayesian Estimation

For each pixel p in a video frame, a probability Iprob (p) of a pixel belongs to foreground can be expressed as,


where Cp is the color vector of pixel p, F and B are foreground and background respectively. The likelihood P(Cp|F) and P(Cp|B) are calculated by accumulating background and foreground color historgrams. The prior P (F) and P (B) are computed from the previous segmentation mask.

1.1 Calculation of likelihood P(Cp|F) and P(Cp|B)

The likelihood basically indicates the probability of a color being foreground (as in P(Cp|F)) or background (as in P(Cp|B)) based on color distribution of all previous segmentation results.

To build a color histogram (here we use 2 dimensional histogram as example), we set a two dimentional grid, each bin in the grid with certain value ranges. For example,

[0…10, 0..10][0..10, 11…20]….[0..10, 251..260]

[11..20, 0..10][11..20, 11..20]…[11..20, 251..260]

[251..260, 0..10][251..260, 11..20]…[251..260, 251..260]

And we count the number of pixels that falls into this 2-dimentional grids. As the video frame pixel normally contains 3 components, therefore, a 3-dimensional can be built for it.

There’re two ways of creating the color histograms for likelihood calculation. The first one is the accumlative histograms. As the foreground and background changes, the accumlative histogram can incorporate these changes into the calculation. As segmentation always contain some errors, the accumlation process can be improved by only updating the histogram for the bins which have zero values. In this case, the error pixels doesn’t accumulate and only have very small values in the histogram with limited influence.

The other method is to use the first segmentation result to build the color histogram and use it for subseqent processing. This is useful if we know the foreground and background color distribution is not going to change much.

Once the color histogram is built, We can normalize them and read the values for the likelihoods P(Cp|F) and P(Cp|B).

1.2 Compute Priors P (F) and P(B)

The priors are computed based on the previous frame, in consideration of temporal correlations. The computation can be expressed as the formula below,


where a(t-1) is the previous segmentation mask, with 255 indicates the foreground and 0 for background. G3x3 and G7x7 are Gaussian filters. Resize are scaling transformation operations. The result Mt is a smoothed mask.

With Mt, the P (F) and P(B) are be calculated by,


With the priors and likehoods, Iprob can be calculated.


1. Real-Time Video Matting Based on Bilayer Segmentation

2. Live Video Segmentation in Dynamic Backgrounds Using Thermal Vision

Use OpenCV2.1.0 with Matlab 2010a

Side Note: First draft on Mar 27 2011.

OpenCV and Matlab are often “competitors”, as a lot of image/video processing techniques are implemented in both of them. However, there’re times you want to combine the strength/functions of the two. And luckily Matlab provides C/C++ interface.

Below are the step-by-step configurations for setting up Matlab environment for calling OpenCV functions.

1. Download and Install OpenCV.

2. Set up C/C++ compiler for Matlab.

  • Start up matlab 2010a, enter “mex –setup”. The screen capture below shows this step in detail,


3. Edit the mexopts.bat file.

  • The file location can be found at the second last line of the screen capture.
  • Add “set OPENCVDIR=C:OpenCV2.1” at the beginning of the file just below the line “set VCINSTALLDIR=%VSINSTALLDIR%VC”.
  • Add “%OPENCVDIR%includeopencv;” at the begnning of the the “set INCLUDE=” line
  • Add “%OPENCVDIR%lib;%OPENCVDIR%bin;” at the beginning of the “set LIB=” line
  • Add “cv210.lib highgui210.lib cvaux210.lib cxcore210.lib” at the line “set LINKFLAGS=” at the position before “/nologo”.

4. Edit PATH environment variable

Append “C:OpenCV2.1lib” at the end of the PATH.


5. Restart Matlab and you can start using OpenCV from Matlab by using the Matlab C++ interface.

Image Histogram Equalization

Side Note: First draft on Mar 26 2011.

1. Concepts

Image Histogram refers to a graphical representation of tonal distribution of a digital image. It plots the number of pixels at each tonal level. (according to wikipedia)

Image histogram equalization is an image processing method for image contrast adjustment. The nice property of image histogram equalization is it makes use of the input image property. It can be very effective for image that has low contrast. Below is an example,


2. Matlab Implementation

Matlab has histogram equalization function built-in, histeq. But one can easily implement it himself/herself.

A simple histogram equalization makes use of the Cumulative Distribution Function (CDF),


where Sk is the transformed tonal level for an input tonal level rk. k = 0, 1, …2^numOfBitsPerChannelOfAPixel-1 is the tonal level range. nj is the total number of pixels with jth tonal level. The equation says the total number of pixels have gray level no more than kth level divide by the total number of pixels will be the CDF of input tonal level k.

The CDF is able to map the input tonal level for every pixel to a range of [0, 1], once we scale this range to [0, 255], we can produce the output tonal level for every pixel.

The matlab implementation for Gray image and explanation is as below,

    %1. read the image file into a matrix
    %2. get the total number of pixel
    %3. get a one-dimensional matrix indicating number of 
    %pixels at each
    for i=1:1:256
    figure, imhist(I);  %for test
    %4. get the cumulative number of pixels at each grey-level
    %[note]can be improved by using the previous calculated
    % B for calculation of
    %current B
    for i=1:1:256
        for j=1:1:i
    %5. get the normalized cumulative distribution at each 
    % grey-level
    for i=1:1:256
    figure, plot(0:255, B);
    %6. for each pixel, compute it's output intensity
    O=zeros(nrows,ncols, 'uint8');
    for i=1:1:nrows
        for j=1:1:ncols
    AO = zeros(1, 256);
    for i=1:1:256
        AO(1, i) = numel(O(O==i-1));
    figure, imhist(O);
    %for comprison with built-in histogram equalization
    O1 = histeq(I);
    figure, imhist(O1);
    figure, imshow(O1);

The plots for the histogram are shown below,


The 3 plots are input image, the output image histogram plot for the implementation above and the matlab built-in implementation histogram.

3. Photoshop

Photoshop (not sure starts with which version, but CS5 definitely) has histogram equalization built-in at Image=>Adjustments=>Curves.

Pre-increment and Post-increment

Side Note: First draft on Mar 26 2011. Back in undergraduate, I was in a programming competition team. We were writing small programs that have to be extremely efficient as there’re time constraint for running these programs. One night, a teammate of mine told me that “you know, pre-increment is more efficient than post-increment”.

Well. The different of pre-increment and post-increment is not difficult to tell. People has basic programming background usually know pre-increment returns the value after increment (think it as increment first, then return), while post-increment returns the value before increment (think it as return the value first, then increment). For example (in C programming language),

#include <stdio.h>

int main(int argc, char **argv)
    int x = 5;
    printf("x=%dn", ++x);	//pre-increment
    printf("x=%dn", x++);	//post-increment
    printf("x=%dn", x);
    return 0;

The program will give you the following result,

x = 6
x = 6
x = 7

That’s the difference most people know about. If we don’t care about the value returned by the increments, instead using it for increasing the value only, is any difference between them?

The answer is YES. And this goes down to the implementation of the two operations. Here we just use C syntax to illustrate the difference, the actual implementation for different languages may vary, but you should get the idea that pre-increment is more efficient.

int pre_increment(int* _input) {
    *_input = *_input + 1;
    return *_input;

int post_increment(int* _input) {
    int temp = *_input;
    *_input = *_input + 1;
    return temp;

As you can tell from the code, post_increment needs to introduce a temporary variable to hold the value before the increment operation. That’s why it is less efficient.

So next time, you are writing a loop increment operation, you know pre-increment is a better choice. Smile

for (int i = 0; i < N; ++i) {
    //do some processing

MPEG4–Overview of Video Decoding

Side note: First draft on Mar 23 2011.

This article is based on MPEG4 Simple Profile.

MPEG4 Part2 (Visual) standard defines the bitstream syntax for MPGE4 Part2 compatible video, but let the manufactuers to customize and figure out their encoding process and implementation. On the other hand, the standard defines the decoding process in detail. This article gives an overview of the decoding process.

The entire process can be illustrated as the diagram below,


The decoding consists of 3 main procedures, shape decoding, texture decoding and motion decoding. For MPEG-4 SP video, arbitrary shape is not supported and therefore the shape decoding is not applicable.

The decoding starts with Entropy decoding, which is not shown in the above diagram. The decoded bits will go through different decoding procedures according to bitstream syntax.

The texture decoding will decode the run length symbols, then the inverse scan is carried out to recover the quantized DCT coefficients. Then inverse quantization and IDCT is done to recover the pixel values for DCT blocks. The texture decoding operates on the spatial domain.

Motion decoding is then carried out to get the motion vectors,. These motion vectors, with the decoded texture results and previous reconstructed VOP as reference, motion compensation is carried out to reconstruct the VOP. The motion decoding operates on the time domain.

The Evil of Trial-and-Error

Side note: First draft on Mar 23 2011. If something good becomes cheap, it may not be good at all.

Well, maybe the title is a bit overwhelming. Trial is a good thing in general. You have the chances to test out your ideas, collect feedbacks, and correct the errors. In traditional wisdom, it’s always good to try it out, especially when it takes a lot of courage.

But things turn upside down when trail is cheap. It costs almost nothing for you to try. Ya, that’s often true in programming. Actually, that’s an attractive thing about it. You have an idea, you always implement it and try it out. Computers give you feedback instantly in most cases. Isn’t that great?

Ya, I love it too. But it turns out we love trial-and-error so much that we become additive to it and simply give up thinking. Well, I believe many programmers have experience debugging a program without much thinking. Just modify something and try it out, see if it works. It takes no time anyway. Error? Modify something else, try it out. Thinking is too troublesome and sometimes we’re just lazy (busy typing is not really working hard by the way). In the end, hours have gone and we’re still stuck.

Actually, those bugs may just require a little bit of thinking, sometimes a little bit of reading. But we are too busy with trial-and-error. Ya, we work extra hours to do it.

That’s the evil of trial-and-error.


Side note: First Draft on Mar 22 2011.

This article covers discrete cosine transform and inverse discrete cosine transform used in MPEG4 Simple Profile.

MPEG-4 part 2(also called MPEG4 Visual) defines 3 different types of inverse DCT operations, standard 8×8 IDCT, SA-IDCT (Shape Adaptive DCT), and ΔDC-SA-DCT (SA-DCT with DC separation and ΔDC Correction). As MPEG-4 Simple Profile doesn’t support arbitary shape encoding, only standard DCT is applicable for it.

1. DCT

According to MPEG4 standard, the NxN 2-Dimentional DCT is defined as below,


where x, y = 0, 1, …N-1, they’re coordinates in the sample domain (spatial domain).

u, v = 0,1,…N-1, they’re the coordinates in the transform domain, and


The above equations can be expressed in matrix equations. For DCT,


where f is the matrix of samples, A is the transform matrix, and F is the transformed DCT coefficients.

The values of A can be derived from the previous equation,




The first matrix multiplication Af can be seen as 1-D DCT of each column of f, and the second matrix multiplication (Af)At can be seen as a 1-D DCT on each row of (Af). Note that the order of the two multiplications can be exchanged.


The inverse DCT is defined as,


Supposed a pixel is represented in n bits, then the input to the DCT and output from the IDCT are represented with (n+1) bits. The DCT coefficients are represented in (n+4) bits. The range for the coefficients is [-2^(n+3), +2^(n+3)-1].

Again, the equation can be expressed in matrix form,


The meanings and values follow DCT operation above.

A Simple Note on Engery Minimization for Markov Random Field

Side note: First Draft on Mar 22 2011. I don’t like Math, probably because I never find it is useful until my college ends. Then I realized it’s in fact very useful in lots of area, well, at least in multimedia area. To get rid of the afraid I had about math, everything math related is put under Machinery. It’s a cool name.

Markov Random Field is a new branch of probability theory that seems to be important in theory and applications of probability. It’s a branch, so definitely I am not going to/cannot cover it here. Please refer to Markov Random Fields and Their Applications for detailed introduction.

In image and video processing, a lot of problems can be expressed as a energy minization problem with the equation below,

E = Ed+λEs

where Ed is the data cost for every pixel and Es is the inter-pixel cost also called smoothness term. The solution to these problems can often be transformed to minimization of the energy E.

There is very good implementation of energy minization library online at Middlebury Computer Vision Lab. The implementation provides a simple to use API to minimize the energy function with various algorithms, including Iterated Conditional Modes (ICM), Graph Cuts (Swap-move and Expansion-move), Max-Product Loopy Belief Propagation (LBP) and Tree-Reweighted Message Passing (TRM).

With this library, all you have to do is to define the data cost function, smoothness function, pass in the graph elements, and call “optimze()” method. Details can be found at the README.txt of the library.

According to the paper from the lab, the expansion-move graph cuts seem to be a good algorithm to use in many cases, but other algorithms may also perform better under some cases.