Slice Resynchronization in MPEG4

MPEG4 Part-2 introduced three error resilience tools, including Resynchronization, Data Partitioning and Reversible VLC. This post discusses Resynchronization only.

The Problem

The bitstream of a MPEG4 video frame (and lots of other video codecs) is encoded using VLC (Variable Length Coding). Because the number of bits for each coefficient varies and the length is implicit, VLC bitstream is sensitive to errors. If an error causes wrong number of bits to be decoded for a coefficient, the bits for the next coefficient will be affected, and so on. The decoder essentially loses synchronization with the encoder. In this way, the error propagates and the video quality suffers.

GOB (Group of Blocks) in H.261 & H.263

H.261 and H.263 organize the macroblocks into groups, called Group of Blocks. Each GOB contains one or more rows of macroblocks and a GOB header with a resynchronization marker and other information that can be used to resynchronize the decoder.

The GOB approach is based on spatial periodic resynchronization — a resynchronization marker and other info of the GOB header is inserted when a particular macroblock position is reached at encoding. This approach resulted in different number of bits in each GOB because the encoded bits for each macroblock varies. In picture areas where more bits are used to encode the scene, the resynchronization markers are more sparse, thus makes it more difficult to conceal the error at those areas.

Slice in MPEG4 (Packet-Based Resynchronization)

MPEG4 adapts a video packet based resynchronization scheme. In the encoding process, a frame is divided into one or more video packet (also called slice sometimes). The length of each slice/packet is not based on number of macroblocks. Instead, if the number of bits exceeds a predetermined threshold, current slice is ended and a new slice is created at the start of next macroblock.

The structure of a slice is as below,

Resync Marker MB_number quant_scale HEC MB data

A resync marker is used to indicate the start of a new slice. It’s different from all possible VLC code words and the VOP start code. In addition, information that necessary to restart the decoding process is provided, including,

macroblock_number: macroblock position of the first macroblock in the video packet, which facilitates spatial resynchronization.

quantization_scale: quantization parameters needed to decode the first macroblock, which facilitates resynchronization of differential decoding.

HEC: Header Extension Code. A single bit indicating if additional information is following it.  When set to 1, additional info is available in the video packet header: modulo_time_base, vop_time_increment, vop_coding_type, intra_dc_vlc_thr, vop_fcode_forward and vop_fcode_backward.

Note that when HEC is equal to 1, the slice header contains all necessary information to decode the slice, thus the slice can be decoded independently. If HEC is set to 0, the decoder still needs some information from somewhere else to decode the slice.

When slice resynchronization tool is used, some of the encoding tools are modified to remove the dependencies among any two video packets.  One example is the predictive encoding must be confined within a video packet to prevent propagation of errors. In other words, a slice boundary is treated as a VOP boundary at AC/DC predication and motion vector predication.

Fixed-Interval Resynchronization

Packet-based Resynchronization produces video packets of similar length, but not exactly the same length. In case the error happens to result in a bit pattern same as resync marker, the decoder won’t be able to tell. This is normally known as start codes emulations.

To avoid this problem, MPEG4 also adopts a method called fixed interval resynchronization. It requires VOP start codes and video packet resynchronization markers appear only at legal fixed interval position in the bitstream. The fixed interval is achieved by stuffing the video packet with a leading ‘0’ and zero or more ‘1’s.

At decoding, the decoder only needs to search for VOP start code and resynchronization marker at the beginning of each fixed interval. Therefore, emulating a VOP start code or resynchronization marker in the middle of a fixed interval cannot confuse the decoder.


1. The MPEG-4 Book, by Fernando C.N. Pereira, Touradj Ebrahimi
2. MPEG-4 Standard, Part 2, Annex E.1 Error Resilience

JPEG Standard–A Tutorial Based on Analysis of Sample Picture–Part 2. JPEG File

This is a follow up of the previous blog: JPEG Standard, A Tutorial Based on Analysis of Sample Picture – Part 1. Coding of a 8X8 Block.

The sample jpeg image used for analysis in this tutorial is below,


Figure 1. Sample JPEG Image for Analysis of JPEG File

The most common name for a jpeg image file is .jpg and .jpeg. A JPEG file consists of many segments, each begin with a marker. A marker contains two or more bytes. The first byte is 0xFF, the second byte indicates what marker it is. The optional length bytes indicates the size of the payload data of the marker (including the length bytes, excluding the first two marker bytes.) In case the marker payload data doesn’t align with byte boundary, the left bits are set to 1.


Given the sample image, the first two bytes are:  (You can save the sample image and view it using a hex editor.)

ff d8: it’s the SOI (Start Of Image) marker. As its name suggests, it indicates the start of jpeg image file. This marker has no payload data.


The next two bytes are:

ff e0: all marker in ff En (It’s called APPn marker) form indicates application specific section. It means some metadata follows.

The next two bytes indicates the length of the payload for the marker:

00 10: It means the data are 16 bytes, including 00 10. See below for the rest of 16 bytes.

4a 46 49 46 00 01 01 01 00 60 00 60 00 00


The next two bytes starts a new marker:

ff db: it’s the DQT (Define Quantization Table). It is follows by one or more quantization tables.

In the sample image, the following bytes are:

00 43: it indicates the payload data is 67 bytes (including 00 43).

01: quantization table info.  Bit 0..3: QT number. Bit 4..7: QT precision.

Then the quantization table:

02 02 02 03 03 03 06 03

03 06 0c 08 07 08 0c 0c

0c 0c 0c 0c 0c 0c 0c 0c

0c 0c 0c 0c 0c 0c 0c 0c

0c 0c 0c 0c 0c 0c 0c 0c

0c 0c 0c 0c 0c 0c 0c 0c

0c 0c 0c 0c 0c 0c 0c 0c

0c 0c 0c 0c 0c 0c 0c 0c



The next two bytes:

ff c0: SOF0 (Start of Frame, Baseline DCT). Indicates the image is a baseline DCT-based JPEG image.

The bytes followed,

00 11: 17 bytes of data. The rest of 15 bytes are:

08 01 20 01 ba 03 01 22 00 02 11 01 03 11 01

08: 8 bits per sample. JPEG also specifies 12 bits and 16 bits per sample. But most of the jpeg image will be 8 bits per sample.

01 20: 288. The height of the image.

01 ba: 442. The width of the image.

03: number of components. Gray image will be one. RGB or YCbCr image will be 3.

For every component, there’ll be 3 bytes.

01 22 00: 01, component id; 22, component frequency, 0..3 bits (2) for vertical, 4..7 bits (2) for horizontal; 00, quantization table number.

02 11 01: 02, component id; 11, component frequency, 1 for vertical, 1 for horizontal; 01, quantization table number.

03 11 01: 03, component id; 11, component frequency, 1 for vertical, 1 for horizontal; 01, quantization table number.

Note: that’s why YCbCr has the sample ratio of 4 (2+2) : 2 (1+1) : 2 (1+1).


The next marker:

ff c4: DHT (Define Huffman Table). It specifies one or more Huffman tables.

The bytes that follows,

00 1f: 31 bytes.

00: HT info. 0..3 bits: HT number. 4th bit: HT type, 0 for DC table, 1 for AC table. 5..7 bits: must to 0.

Then there’re 16 bytes:

00 01 05 01 01 01 01 01 01 00 00 00 00 00 00 00: each byte represent the number of bytes for huffman code of length a particular length. For example, 00 means there’s no bytes for huffman code of length 1.

As 00 + 01 + 05 + 01 + …. + 00 = 12

The 12 bytes that follows are:

00 01 02 03 04 05 06 07 08 09 0a 0b

The next 3 markers are all DHT. With length of 181 bytes, 31 bytes and 181 bytes respectively.


After these three DHT markers, the next marker is:

ff  da: SOS (Start of Scan). Begins a top-to-bottom scan of the image. In baseline JPEG, there’s usually a single scan.

The bytes that follows,

00 0c: 12 bytes. With the rest of  the 10 bytes as below,

03 01 00 02 11 03 11 00 3f 00

03: number of components in the scan. Normally is 3 for color image.

01 00: component id, 01; Huffman table used: bits 0..3: AC table, here is 0. bits 4..7: DC table, which is 0.

02 11: component id, 02; Huffman table used: bits 0..3: AC table, here is 1. bits 4..7: DC table, which is 1.

03 11: component id, 03; Huffman table: AC table 1, DC table 1.

00 3f 00: ignored.

Entropy-encoded Data

The the entropy-encoded jpeg image data (check out part 1 for more detail) follow. One thing one needs to take note is for entropy data, there’s 0x00 byte follows any 0xff byte. This is to avoid the confusion with marker bytes. This technique is called byte stuffing.


The next marker is at the end of the jpeg file,

ff d9: EOI (End Of Image). It’s also the last two bytes of the jpeg file. As its name suggests, it indicates the end of the jpeg image.

There’re some other markers that are not used in the sample image, one can refer to reference part for more information.


Wikipedia JPEG:

JPEG Standard–A Tutorial Based on Analysis of Sample Picture–Part 1. Coding of a 8×8 Block

JPEG is short form for Joint Photographic Experts Group, which is the standard committee behind JPEG picture compression standard.

Image/video compression standards are quite complex, JPEG is of no exception though it’s almost the simplest and most basic among all. The standard defines two compression methods: the DCT-based lossy compression method, and the predictive based lossless method. Each method can operate in different modes (sequential coding, progressive coding, hierarchical coding etc.).

This tutorial focuses on the Baseline of the JPEG standard, which is the collection of most basic compression algorithms and also the most widely adapted technique (8-bit samples, Huffman coding, two AC tables and two DC tables).

The JPEG encoding and decoding process can be illustrated with the figure below,


Figure 1. The JPEG 8×8 Block Encoding Process


Figure 2. The JPEG 8X8 Block Decoding Process

The DCT/IDCT and Quantization/Dequantization operations on a 8X8 block are illustrated below,


Figure 3. DCT/IDCT, Quantization/Dequantization on a 8X8 Block

We’ll revisit this figure when the corresponding steps are covered below.


The first step of encoding is FDCT, and correspondingly the last step of decoding is IDCT. The 2-dimensional (8X8) Forward Discrete Cosine Transform converts the image data to a domain where the data representation is more compressible.

From signal processing point of view, the FDCT operation takes 64 point 2-dimensional discrete signal and decomposes it into 64 2-dimensional “spatial frequencies”. The amplitudes of the converted signal is called DCT cofficients.

Among all DCT coefficients, the one at the top left is called DC coefficient, while the rest are AC coefficients. After transformation, most of the image energy will be concentrated at lower spatial frequencies (the upper left region), and a lot of high spatial frequencies (towards lower right region) will have amplitude of 0.

For the mathematical formula of FDCT and IDCT, one can refer here.

In theory, IDCT and FDCT doesn’t introduce any loss to the image, but no practical implementation can compute them with 100% accuracy and thus they contribute to the loss of image quality.

In the sample above, (b) is the result of FDCT operation on (a); and (f) is the result of IDCT operation on (e), where (e) is the ouput of dequantizer.

Quantization and Dequantization

The DCT cofficients are quantized by a 64-element Quantization Table (refer to figure 3 (c) as an example). The quantization step is to achieve further compression by discard visually insignificant information. This step is lossy in nature.

The quantization operation is carried out by divide each DCT coefficient by its quantizer step size in the Quantization table and then round to the nearest integer.

The dequantization operation is carried out by multiplying the output from entroy decoder by its quantizer step size.

In the sample above, DCT cofficients in (b) is quantized using Quantization table (c), and the result is (d), which will be going through entropy encoding.

At the decoder side, the output of Entroy Decoder will be (d). And the dequantization using Quantization table (c) will give the result of (e).

Entropy Encoding/Decoding

JPEG standards specify two entropy encoding standards, Huffman coding and arithmetic coding. The Baseline uses Huffman coding.

Before encoding, the quantized DCT coefficients are scanned in Zig-Zag order as illustrated below,


Figure 4. Zig Zag Sequence

The entropy encoding contains two stages:

1. Represent the zig-zag sequence of quantized DCT cofficients into an intermediate sequence of symbols.

2. Convert the symbol sequence into data stream.

For the first stage, the DC coefficient is treated differently as all other AC coefficients as it stores a significant portion of the image energy. The symbol has the following form,

symbol-1(size) symbol-2(amplitude)

where size is the number of bits required to represent amplitude.

DC coefficient is differentially encoded because the DC coefficients from blocks nearby are similar in many cases. Suppose in the sample above, the previous 8X8 block DC coefficient is 12. Then 15 is represented as


for size = 2 and amplitutde = 3 (by referring to figure 5).


Figure 5. Baseline Entropy Coding – Amplitude and Size

For AC coefficients, the stage 1 symbols have the following format,

symbol-1 (runlength, size) symbol-2(amplitude)

where runlength is the number of consecutive zero-valued AC coefficients in the zig-zag sequence before the represented non-zero coefficient, size is the number of bits used to encode amplitude.

Follow the example above. In the zig-zag order, the cofficients are:

0, –2, –1, –1, –1, 0, 0, -1, 0, 0, …

The first stage will get the following symbol sequence (by referring to figure 5),

(1, 2)(-2),  (0,1)(-1),  (0,1)(-1),  (0,1)(-1),  (2,1)(-1),  (0,0)

where the last (0,0) represents the end of block (EOB).

The output for the first stage of entropy encoding would be,

(2)(3), (1, 2)(-2), (0,1)(-1), (0,1)(-1), (0,1)(-1), (2,1)(-1), (0,0)

The second stage codes the symbol sequence using Variable Length Coding (VLC, for symbol 1) and Variable Length Integer (VLI, for symbol 2) defined in the standard.

The related value for above sequence are,

For differential-DC symbol 1, VLC:

(2) => 011

For AC symbol 1, VLC:

(0,0) => 1010

(0,1) => 00

(1, 2) => 11011

(2,1) => 11100

For both DC and AC symbol 2, VLI:

(3) => 11

(-2) => 01

(-1) => 0

The output bitstream for the above sequence would be,

(011)(11) (11011)(01) (00)(0) (00)(0) (00)(0) (11100)(0) (1010)

Without parenthesis,


There’re 31 bits in total used to represent 64 coefficients.

Next part will look into the JPG/JPEG file, see how the encoded data stream and coding tables are stored.


1. The JPEG Still Picture Compression Standard. Paper by Gregory K. Wallace


MPEG4–Overview of Video Decoding

Side note: First draft on Mar 23 2011.

This article is based on MPEG4 Simple Profile.

MPEG4 Part2 (Visual) standard defines the bitstream syntax for MPGE4 Part2 compatible video, but let the manufactuers to customize and figure out their encoding process and implementation. On the other hand, the standard defines the decoding process in detail. This article gives an overview of the decoding process.

The entire process can be illustrated as the diagram below,


The decoding consists of 3 main procedures, shape decoding, texture decoding and motion decoding. For MPEG-4 SP video, arbitrary shape is not supported and therefore the shape decoding is not applicable.

The decoding starts with Entropy decoding, which is not shown in the above diagram. The decoded bits will go through different decoding procedures according to bitstream syntax.

The texture decoding will decode the run length symbols, then the inverse scan is carried out to recover the quantized DCT coefficients. Then inverse quantization and IDCT is done to recover the pixel values for DCT blocks. The texture decoding operates on the spatial domain.

Motion decoding is then carried out to get the motion vectors,. These motion vectors, with the decoded texture results and previous reconstructed VOP as reference, motion compensation is carried out to reconstruct the VOP. The motion decoding operates on the time domain.


Side note: First Draft on Mar 22 2011.

This article covers discrete cosine transform and inverse discrete cosine transform used in MPEG4 Simple Profile.

MPEG-4 part 2(also called MPEG4 Visual) defines 3 different types of inverse DCT operations, standard 8×8 IDCT, SA-IDCT (Shape Adaptive DCT), and ΔDC-SA-DCT (SA-DCT with DC separation and ΔDC Correction). As MPEG-4 Simple Profile doesn’t support arbitary shape encoding, only standard DCT is applicable for it.

1. DCT

According to MPEG4 standard, the NxN 2-Dimentional DCT is defined as below,


where x, y = 0, 1, …N-1, they’re coordinates in the sample domain (spatial domain).

u, v = 0,1,…N-1, they’re the coordinates in the transform domain, and


The above equations can be expressed in matrix equations. For DCT,


where f is the matrix of samples, A is the transform matrix, and F is the transformed DCT coefficients.

The values of A can be derived from the previous equation,




The first matrix multiplication Af can be seen as 1-D DCT of each column of f, and the second matrix multiplication (Af)At can be seen as a 1-D DCT on each row of (Af). Note that the order of the two multiplications can be exchanged.


The inverse DCT is defined as,


Supposed a pixel is represented in n bits, then the input to the DCT and output from the IDCT are represented with (n+1) bits. The DCT coefficients are represented in (n+4) bits. The range for the coefficients is [-2^(n+3), +2^(n+3)-1].

Again, the equation can be expressed in matrix form,


The meanings and values follow DCT operation above.

MPEG4 – Quantization and Dequantization

Side Note: First draft on Mar 22 2011.

This article is for MEPG-4 Simple Profile. MPEG4 standard defines two inverse quantization methods. SP only supports the second method.

MPEG4 part2 is gives the bitstream syntax and decoing process, and the process of encoding is left to manufacturers. As long as the output bitstream is compatible with the bitstream defined, it’s MPEG4 compliant encoder.

Therefore, this article covers the quantization principle, and gives more detailed dequantization steps defined in MPEG-4 standard. (Method 2 only)

Note that for this article, both ‘x’ and ‘*’ means multiplication.

1. Quantization

Quantization operates on the output of DCT, and it is the fundamental source for the loss in the lossy compression method. For example, the values turn to 0 after quantization can never get back to its original value in dequantization. In contrast, cosine transform is not lossy in nature, but since we cannot do continuous transformation in digital computer, it is still lossy in discrete form.

Why do we want to do quantization? Studies show that most of the DCT block energy is clustered at the upper left of the block. In fact, with a few DCT coefficients near that corner, a similar block as the original block can be produced after going through Inverse DCT. Therefore, we can use quantization to filtered out those insignificant coefficients (reduce to 0) and also reduce the range of other coefficients, which will helpful in entropy encoding.

Quantization is very simple. It can be expressed as

QO = round(X/QS)

where X is the input value, QS is quantizer step size, which controls the range of the output (QO).

2. Dequantization

Dequantization(inverse quantization) can be referred as,

Y = QO x QS

where QO is the quantized value, QS is the step size, and Y is the dequantized value.

3. MPEG-4 Second Quantization Method

3.1 Intra Block DC Coefficient

The inverse quantization can be expressed as,

F = dc_scaler x QF

where QF is the quantized coefficient, F is the dequantization output. The value of dc_scaler is decided by the following algorithms,

1. check if short_video_header is 1. If so, dc_scaler is 8 for both luminance blocks and chrominance blocks. If not, go to step 2.

2. dc_scaler is dependent on quantizer_scale, which is a quantization parameter available for dequantization. Based on whether the block is luminance or chrominance block, the detailed relationship follows the table below,


3.2 Other Coefficients (Inter Block DC and AC coefficients)

For the method illustrated here (second method), the quant_type shoud be set to 0. Again, the computation relies on quantiser_scale,


The sign of output |X| is the same as the corresponding QF, which gives us X[v][u] = Sign(QF[v][u]) x |X[v][u]|.

MPEG4–Motion Estimation and Motion Compensation

Side Note: First draft on Mar 20 2011.

This article covers the motion estimation and motion compensation used in MPEG-4 Simple Profile.

Video compression is essentially to reduce the redundancy in the raw video sequence. The temporal redundancy is reduced by motion estimation and motion compensation; the spatial redundancy is reduced by DCT and quantization; the statistical redundancy is reduced by Entropy coding.

1. Basic Concepts

Motion estimation is referring to the process of finding the best match reference block for a given input block being encoded. In a still video sequence, one would expect the reference block in the same position of the previous frame as the processed block in the current frame. However, object motion, camera motion and illumination change make the above statement invalid. Therefore, the search for best match reference block is carried out over certain search region, and the best matched block (often judged by producing minimal residual energy) is selected as reference block.

Once the reference block is found, it is substracted from the current encoding block to produce a residual block, which is processed further. This process is called Motion Compensation.

The displacement of the current encoding block and the reference block (also called predictor) is referred as Motion Vector. It is required to reconstruct the block in the decoder side.

2. Macroblock Based Motion Estimation and Compensation

A macroblock is a region consists of 16 x 16 luminance component, and 8 x 8 chrominance component.

Motion estimation and compensation is to minimize the energy in the residual block. It turns out the smaller the smaller the compensation block (which is different from macroblock) is, the more efficient the compensation is. e.g. Motion compensation using a 4 x 4 block achieves better result than using 8 x 8 block size.

MPEG4 SP supports up to 4 motion vectors per macroblock, which means the encoder can use the 8 x 8 block for luminance, and 4 x 4 for chrominance, instead of 16 x 16, and 8 x 8 using a single motion vector. However, as motion vector is also encoded into the bitstream, it might not be optimal to use 4 motion vectors in video sequences/macroblocks without much motion.

Another feature for SP is Unrestricted Motion Vectors. There’re cases where the best matching reference macroblock goes outside of the boundaries of the reference VOP (Video Object Plane, which is basically video frame for SP). The part that extends outside is extrapolated (padded). This happens when there’re objects moving in/out of the scene.

3. Sub-pixel Motion Compensation

With interpolation, sub-pixel values can be used for motion compensation. With the motion estimation and compensation goes into finer resolution, the results turn to be better. But again, there’s other side of it. It increases the computation complexity, and it requires more bits for motion vectors (as motion vectors are not integers any more).

MPEG4–Entropy Encoding

Side Note: First draft on 20 Mar 2011. Huffman invented Huffman coding when he was doing a Ph.D at MIT. It was his work in a graduate course taught by Robert Fano. This work has become Huffman’s most influential contribution. Well, you never know that an assignment can make the world different.

The content of this article is based on MPEG-4 Simple Profile.

Entropy encoding takes a series of symbols representing a video sequence into a bitstream for storage and transmission.

1. Scan of DCT Coefficients

There’re three scan oders defined in MPEG-4 Part 2: Zigzag scan, Alternate-horizontal scan, and Alternate-vertical scan.

To understand why there’re different scan orders, one needs to be clear what is the purpose of the scan. The scan is to reorder the DCT coefficients coming out from the Quantizer so that zero coefficients can be grouped together, which will result in more efficient entropy encoding.

Based on the probability study, non-zero DCT coefficients are clustered around the top-left DC coefficient. And for a non-differential encoding (intra-block), the zero coefficent is roughly symmetrical in horizontal and vertical directions. Therefore, Zigzag scan be used to group non-zero coefficients together, and zero coefficients together. For differential encoding (inter-block), depending on the motion prediction direction, the optimal scan order can be Alernate-horizontal or Alternate-vertical.


Different Scan Oders: Alternate-Horizontal, Alternate-Vertical and Zigzag

2. Run-Level-Last Encoding for DCT Coefficients

The scanned DCT coefficients are encoded into RLL symbols. Run is the number of zeros preceding a non-zero coefficient, level indicates the magnitude of the non-zero coefficient, and last is a flag for marking whether it’s the last non-zero value.

For example,

Given the input symbols: 2, 0, 0, –3, 0, 0,0,0,2,0,0… (all zeros)

The RLL encoded output: (0,2,0), (2,-3,0), (4,2,1)

3. Inputs to Entropy Encoder

DCT coefficients are the most important piece of the video frame information, but it’s not all. The inputs to an MPEG-4 entropy encoder includes video bitstream headers, motion vectors, synchronization markers,  and more.

But from the entropy encoder’s point of view, it’s all a series of symbols.

4. Pre-Calculated Huffman Encoding

MPEG-4 SP uses Huffman encoding for entropy encoding. Huffman codes can be generated dynamically given the probability of symbols to be encoded. The basic idea is to use short code for frequent symbols and long code for less-common symbols, therefore it is a kind of VLC (Variable Length Code). Details of how to generating Huffman code is not covered here.

MPEG-4 uses pre-calculated Huffman codes. Based on probility study, MPEG-4 standard defines different VLC codes for encoding/decoding different information of a video sequence, including intra-block, inter-block, motion vectors and more.

In addition, some information are encoded in VLC for common symbols only, the rare symbols are encoded as a combination of VLC + escape sequence + fixed length encoding. Details can be referred from the MPEG-4 standard.


VLC is sensitive to errors. A single bit error can cause the decoder to lose synchronization and fail to decode subsequent codes. This is called error propagation.

Reversible VLC is introduced to solve the issue because it can decoded in either a forward or a backward direction. Some of the video information is encoded according to MPEG-4 standard.