WAVE Audio File Format

WAV (Waveform Audio File Format, also known as WAVE) is a commonly used audio file format for storing raw audio samples on Windows. It follows the RIFF (Resource Interchange File Format) generic format.

0. The Format

A wave file is a RIFF file with a “WAVE” chunk. The WAVE chunk consists of two subchunks, namely “fmt ” and “data”. (Note that there’s a space in “fmt ”). Below we list out the meanings of each field in a WAVE file.





Chunk Tag




RIFF chunk descriptor

Chunk Size











subchunk id (fmt )



“fmt ”

fmt subchunk: format information about the audio data

subchunk size



audio format



1=>PCM, other values => data compressed

num of channels



1=> mono, 2=>stereo

sample rate



8000, 16000, 22050, 44100 etc.

byte rate



Sample rate*num of channels*bits per sample/8

block align



num of channels*bits per sample/8

bits per sample








8=>8bits, 16=>16bits



subchunk id (data)




Data subchunk: contains the raw audio data

subchunk size



Num of samples * num of channels * bits per sample/8



Audio data

Note that there’re totally 44 bytes (12 + 24 + 8) before the actual audio data.

1. Byte-by-Byte example

Below is a screenshot of a wave file shown in vim hex mode. We’ll go through the bytes one by one.


Figure 1. Bytes of a wave file recorded on Android 

5249 4646: RIFF

74d8 0400: 04d874 = 317556 bytes. I used “ls -l test.wav” to get the file size as 317564 bytes, which is equal to 317556 + 4 (size field) + 4 (RIFF field).

5741 5645: WAVE

666d 7420: fmt<I’m a space>

1000 0000: 00000010 = 16 bytes.

0100: 0001, which corresponds to PCM, values other than 1 indicate data is compressed.

0100: 0001, only one channel.

44ac 0000: 0000 ac44 = 44100, the sample rate is 44100Hz.

8858 0100: 0001 5888 = 88200 = sample rate * number of channel * bits per sample/8 = 44100 * 1 * 16 / 8 = 44100 * 2

0200: 0002, block align. 02 = number of channel * bits per sample / 8 = 1 * 16 / 8 = 2

1000: 0010 = 16, bits per sample

6461 7461: data

50d8 0400: 0004 d850 = 317520, data size. 317520 + 44 (total header bytes) = 317564, which matches with the total file size obtained using “ls -al”.


1. wikipedia page WAVE: http://en.wikipedia.org/wiki/WAV

2. WAVE PCM soundfile format: https://ccrma.stanford.edu/courses/422/projects/WaveFormat/

HTTP Video Delivery — HTTP Pseudo Streaming

HTTP Pseudo Streaming is the second method in HTTP Video Delivery. The method is also based on HTTP progressive download as the first method does, and it makes use of the partial download functionality of HTTP to allow user to seek the video to part that has not been downloaded yet.

HTTP Pseudo Streaming requires support from both the client side and server side. For the server side, plug-ins are available for Apache, litghttpd etc. At client side, custom players are required to resynchronize the video, read metadata etc. Two examples of players that supports HTTP Pseudo Streaming ar eJWPlayer and FlowPlayer.

For an example of HTTP Pseudo Streaming, open up any Youtube video. Youtube actually uses lighttpd for server side and its own customized player based on Flash technology for client side (Flash player and Flash based media player are different. Please refer here. )

Below is a screenshot of the Wireshark network traffic capture when I was watching a Youtube video,


Figure 1. Wireshark Capture of HTTP Request for HTTP Pseudo Streaming

As there’re lots of segmented packets, I selected a video packet, right clicked it and selected follow TCP stream to get the screen above. It’s a single HTTP request followed by a response from the Youtube HTTP server.

I tried to seek to a part that is not downloaded yet, then do “follow TCP stream” again. I found the HTTP header contains the following strings for the initial request and the seek,

burst=40&sver=3&signature=B3B26708552F1C9FE68 7AAB13EFE6D73F294624F.0F9EEB822A5CF4AE5443CC5798B2F415C16B75E4&
redirect_counter=1 HTTP/1.1


The seek HTTP request contains a string “begin=1032070”, which should be used at the HTTP server to jump to the corresponding portion of the video.

Same as the first method, HTTP Pseudo Streaming download the video clip to browser cache. For Google Chrome, one can find the video clip at,

Ubuntu Linux: /home/roman10/.cache/google-chrome/Default/Cache/

Windows: C:Usersroman10AppDataLocalGoogleChromeUser DataDefaultCache

Where roman10 is my username for both OS.

For the benefits and drawbacks of this method, and its comparison with other HTTP method, please refer to HTTP Video Delivery. Note that as HTTP psuedo streaming is not real streaming, it doesn’t support live video streaming.

HTTP Video Delivery–HTTP Progressive Download and Play

Progressive Download and Play is the first of the three methods in HTTP Video Delivery. The basic idea of this method is to embed the video through media player (e.g. JWPlayer, FlowPlayer etc.). When user request to play the video, a HTTP Get request will send to the web server, and the video will be downloaded through HTTP for play out.

This method is supported by web servers by default, as it treats a video as a normal file like an image, or a CSS file. The play out, buffering and other video play specifics are handled by the media player which usually consists of some javascript and flash objects. More info about these players be found here.

Below is an example of HTTP Progressive Download and Play Out video, it’s delivered using JWPlayer as client.


Once you click the button to start play, the video download is triggered. Below is a screenshot of the network traffic captured using Wireshark.


Figure 1. Wireshark Capture Screenshot

The video is actually downloaded to browser cache, as I used Google Chrome at Ubuntu Linux to watch the video, the file is saved to

/home/roman10/.cache/google-chrome/Default/Cache/ folder. Below is a screenshot of the files under the directory,


Figure 2. List of Files under Google Chrome Cache

I can either locate the video clip by time or by file size, and play out the video using vlc as shown below,


Figure 3. VLC Play Out of the Cached Video

If you’re using Windows, the video file can be found under,

C:Usersroman10AppDataLocalGoogleChromeUser DataDefaultCache, where roman10 is the username.

For the benefits and drawbacks of this method, and its comparison with other HTTP video delivery method, you can refer here.

You may also want to check out,

HTTP Pseudo Streaming

Dynamic Adaptive Streaming over HTTP

UDP vs TCP–In the Context of Video Streaming

1. TCP Congestion Control and Window Size

TCP maintains a congestion window, which indicates the number of packets that TCP can send before an acknowledge for the first packet sent is received.

The congestion window size is default to 2 times of the maximum segment size (MSS, the largest amount of data in bytes can be sent in a single TCP segment without the TCP header and IP header). TCP follows a slow start process to increase the congestion window by 1 MSS every time a packet acknowledgement is received until the congestion window size exceeds a threshold called ssthresh (This effectively doubles the congestion window every RTT). Then TCP enters congestion avoidance.

In congestion avoidance, as long as non-duplicate ACKs are received, the congestion window is increased by 1 MSS for every RTT. When a duplicate ACK is received, the chance of the packet is lost or being delayed is very high. Depends on the implementation, the actions taken by TCP varies,

  • Tahoe: Triple duplicate ACKs are treated the same as a timeout. Fast retransmission is performed and the congestion window is reduced to 1, and slow start starts again.
  • Reno: If triple duplicate ACKs are received, TCP reduce the congestion window by half, perfrom a fast retransmit, and enter Fast Recovery. If ACK is timeout, it also restart from slow start.

In Fast Recovery, TCP retransmits the missing packets and wait for an acknowledgement of the entire transmit window before returning to congestion avoidance. If there’s no ack, then it’s treated as timeout.

As the window size controls the number of unacknowledged packets that TCP can send, a single packet loss can reduce the window size significantly hence reduce the throughput significantly. Therefore, TCP makes the throughput unpredictable for video streaming application.

2. TCP Reliability and Retransmission

TCP is a reliable service. When a packet is lost, TCP will retransmit the packet. For live streaming application, the retransmitted packets may already be late for playback and take over bandwidth unnecessarily.

3. Multicast + TCP has Problems ???

Multicast can be effective in streaming applications (at least in theory). TCP is connection oriented, it requires two parties to establish a connection. Multicast at TCP layer is tough as it requires the end player to send ACK to the server who is streaming content. It is hard to scale.

4. TCP and UDP Packet Header Size

TCP has a typical header size of 20 bytes, while UDP has header of 8 bytes. It occupies less bandwidth.

JWPlayer and FlowPlayer vs. Adobe Flash Player

JWPlayer and FlowPlayer are video players that support Flash video. What confusing is they’re referred as Flash Video Players sometimes. (e.g. FlowPlayer claims itself as Flash Video Player for the Web.) Why do I need another Flash Video Player if I already have Adobe Flash Player installed?

Adobe Flash Player is actually a virtual machine that runs Flash files, which end with an extension of .swf. SWF files can contain animations, video/audio and web UI. Both JWPlayer and FlowPlayer consist SWF files which is downloaded to browsers’ cache and played by Adobe Flash Player. In other words, JWPlayer and FlowPlayer are “played” by Adobe Flash Player. It’s like Adobe Flash is the JVM (Java Virtual Machine), and JWPlayer and FlowPlayer are two Java programs.

JWPlayer actually supports more than just Flash, it also supports HTML5 video for iPhone and iPad device. In essence, JWPlayer and FlowPlayer are just a collection of javascript and SWF files that allow a website publisher to embed video into a web page, customize the outlook of the video display area, and control the behavior of the video play out etc. And Flash video is one (and probably most important one) of the technologies/platforms they support.

YCbCr Color Space–An Intro and its Applications

Color space is a complicated topic. Colors don’t really exist, like dust does. We human being use colors to describe what we see. The most common way to describe what we see in terms of color is using combination of red, green and blue, which is referred as RGB color space.

A color space is simply a model of representing what we see in tuples. YCbCr is one of the popular color space in computing. It represents colors in terms of one luminance component/luma (Y), and two chrominance components/chroma(Cb and Cr).


YUV, Y’UV, YCbCr, Y’CbCr, YPbPr… All these terms cause lots of confusion. YUV and Y’UV are traditionally for analog encoding of color information in television system, it uses different conversion constants from YCbCr; while YCbCr or Y’CbCr are used for digital encoding of color information in computing systems; and YPbPr is the analog counterpart of YCbCr. Nowadays, YUV is also used in the digital video context, in which case it is almost equivalent to YCbCr.

Here we negnect the terminologies and focuses on understanding the colorspace and its usage in the context of computing world. From this point onwards, we’ll call it YCbCr.

Why YCbCr Color Space?

Study shows human eyes are sensitive to luminance, but not so sensitive to chrominance. For example, given an image below,


Figure 1. A Color Image

One can use the matlab code below to display its Y, Cb, Cr component as color images or gray scale images.

%a program to display the image's ycbcr component

function [] = ycctest(imageName)

    rgb = imread(imageName);

    ycbcr = rgb2ycbcr(rgb);   %convert to yuv space

    %display the y component as a color image

    ycbcry = ycbcr;

    ycbcry(:,:,2) = 0;

    ycbcry(:,:,3) = 0;

    rgb1 = ycbcr2rgb(ycbcry);

    figure, imshow(rgb1);

    % display the cb component as a color image

    ycbcru = ycbcr;

    ycbcru(:,:,1) = 0;

    ycbcru(:,:,3) = 0;

    rgb2 = ycbcr2rgb(ycbcru);

    figure, imshow(rgb2);

    % display the cr component as a color image

    ycbcrv = ycbcr;

    ycbcrv(:,:,1) = 0;

    ycbcrv(:,:,2) = 0;

    rgb3 = ycbcr2rgb(ycbcrv);

    figure, imshow(rgb3);

    %display the y, cb, cr component as gray scale image





Save the program as ycctest.m, and save the image as test.jpg. Then execute the program in matlab by typing command,


You’ll get the 3 color images for Y, Cb and Cr component from left to right as,


Figure 2. Color Images formed by Y, Cb and Cr Components

You’ll also get another 3 gray-scale images for Y, Cb and Cr component from left to right,


Figure 3. Gray-scale Images formed by Y, Cb and Cr Components

It’s not difficult to tell that our eyes perceive more info from the left most images in figure 2 and figure 3, which is formed by Y component of figure 1.

YCbCr color space makes use of this fact to achieve more efficient representation of scenes/images. It does so by separating the luminance and chrominance components of a scene, and use less bits for chrominance than luminance. The details of how to use less bits to represent chrominance is covered in Sub-sampling section below.

How does the Conversion Work?

The YCbCr image can converted to/from RGB image. There’re several standards defined for the conversion at different context. The conversion below is based on the conversion used in JPEG image compression.

The conversion can be expressed as equations below.

From 8-bit RGB to 8-bit YCbCr:

Y = 0.299R + 0.587G + 0.114B

Cb = 128 – 0.168736R – 0.331264G + 0.5B

Cr = 128 + 0.5R – 0.418688G – 0.081312B

From 8-bit YCbCr to 8-bit RGB:

R = Y + 1.402 (Cr – 128)

G = Y – 0.34414 (Cb – 128) – 0.71414(Cr – 128)

B = Y + 1.772 (Cb – 128)

Color Sub-sampling

The representation of YCbCr separates the luminance and chrominance, so the computing system can encode the image in a way that less bits are allocated for chrominance . This is done through color subsampling, which simply encodes chrominance components with lower resolution.

Here we cover four commonly used subsampling schemes: 4:4:4, 4:2:2, 4:2:0, and 4:1:1.

These four schemes are illustrated by the figure below [1],


Figure 4. Color Subsampling

The 4:4:4 is actually full resolution in both horizontal and vertical directions, there’s no subsampling done. 4:2:2 requires 1/2 resolution in horizontal direction; 4:1:1 requires 1/4 resolution in horizontal direction; and 4:2:0 means 1/2 resolution in both horizontal and vertical directions.

There’re also 4:4:0 and 4:1:0 schemes, interested readers could refer to reference 2.

Note that the sub-sampling process is lossy when comparing the processed image with the original image. The amount of info left can be calculated by summing up all components and divide by 12 (or 16, if alpha/transparency component exists). For the four subsampling schemes we mentioned, the amount of info left is 100%, 8/16=50%, 6/16 = 37.5%, 6/16 = 37.5%. In this way, the computing system could represent the image with less bits.

But how the image is reconstructed for display? Normally the image requires a Y, Cb, Cr component at every pixel. The reconstruction is done through interpolation or simply use the CbCr value of nearby pixel if the pixel doesn’t have CbCr values. The details are not covered in this post.


YCbCr is a commonly used color space in digital video domain. Because the representation makes it easy to get rid of some redundant color information, it is used in image and video compression standards like JPEG, MPEG1, MPEG2 and MPEG4.


Basics of Video: http://lea.hamradio.si/~s51kq/V-BAS.HTM

Chrominance Subsampling: http://dougkerr.net/pumpkin/articles/Subsampling.pdf

JPEG File Interchange Format: http://www.w3.org/Graphics/JPEG/jfif3.pdf