Record WAVE Audio on Android

This post discusses how to record raw audio (PCM) and save it to wave file on Android. If you’re not familiar with WAVE audio file format, please refer to a previous post, WAVE Audio File Format.

The post is a follow up post for Record PCM Audio on Android. The code and working principle are similar. It is strongly suggested you read it first.

WAVE file is used to store PCM data, with 44-byte header. Recording WAVE audio is equivalent to recording PCM audio and adding the 44-byte header in front.

We used a RandomAccessFile to write the data. We first write 44-byte header. Because some fields are not known until we finish the recording, we simply write zeros. This is shown as below.

randomAccessWriter = new RandomAccessFile(filePath, "rw");

randomAccessWriter.setLength(0); // Set file length to 0, to prevent unexpected behavior in case the file already existed


randomAccessWriter.writeInt(0); // Final file size not known yet, write 0 


randomAccessWriter.writeBytes("fmt ");

randomAccessWriter.writeInt(Integer.reverseBytes(16)); // Sub-chunk size, 16 for PCM

randomAccessWriter.writeShort(Short.reverseBytes((short) 1)); // AudioFormat, 1 for PCM

randomAccessWriter.writeShort(Short.reverseBytes(nChannels));// Number of channels, 1 for mono, 2 for stereo

randomAccessWriter.writeInt(Integer.reverseBytes(sRate)); // Sample rate

randomAccessWriter.writeInt(Integer.reverseBytes(sRate*nChannels*mBitsPersample/8)); // Byte rate, SampleRate*NumberOfChannels*mBitsPersample/8

randomAccessWriter.writeShort(Short.reverseBytes((short)(nChannels*mBitsPersample/8))); // Block align, NumberOfChannels*mBitsPersample/8

randomAccessWriter.writeShort(Short.reverseBytes(mBitsPersample)); // Bits per sample


randomAccessWriter.writeInt(0); // Data chunk size not known yet, write 0

We then write the PCM data. This is discussed in detail in post Record PCM Audio on Android.

After the recording is done, we seek to the header and update a few header fields. This is shown as below.


try {; // Write size to RIFF header

    randomAccessWriter.writeInt(Integer.reverseBytes(36+payloadSize));; // Write size to Subchunk2Size field




} catch(IOException e) {

    Log.e(WavAudioRecorder.class.getName(), "I/O exception occured while closing output file");

    state = State.ERROR;


For the complete source code, one can refer to my github Android tutorial project.

Record PCM Audio on Android

This post discusses how to record PCM audio on Android with the class. If you’re not familiar with PCM, please read a previous post PCM Audio Format.

For the source code, please refer to AndroidPCMRecorder.

1. State Transition

The source code records the PCM audio with, which uses Android AudioRecord internally. Similar to the Android MediaRecorder class, PcmAudioRecorder follows a simple state machine as shown below.

Untitled drawing

Figure 1. State Transition of PcmAudioRecorder Class

As indicated in the diagram, we initialize a PcmAudioRecorder class by either calling the getInstance static method or the constructor to get into INITIALIZING state. We can then set the output file path and call prepare to get into PERPARED state. We can then start recording by calling  start method and finally call stop to stop the recording. At any state except ERROR, we can call reset to get back to INITIALIZING state. When we’re done with recording, we can call release to discard the PcmAudioRecorder object.

2. Filling the Buffer with Data and Write to File

One particular part of the code requires a bit attention is the updateListener. We register the listener with the AudioRecord object using setRecordPositionUpdateListener method. The listener is an interface with two abstract methods, namely onMarkerReached and onPeriodicNotification. We implemented the onPeriodicNotification method to pull the audio data from AudioRecord object and save it to the output file.

In order for the listener to work, we need to specify the notification period by calling AudioRecord.setPositionNotificationPeriod(int) to specify how frequently we want the listener to be triggered and pull data. The method accepts a single argument, which indicates the update period in number of frames. This leads us to next section.

3. Frame vs Sample

For PCM audio, a frame consists of the set of samples from all channels at a given point of time. In other words, the number of frames in a second is equal to sample rate.

However, when the audio is compressed (encoded further to mp3, aac etc.), a frame consists of compressed data for a whole series of samples with additional, non-sample data. For such audio formats, the sample rate and sample size refer to data after decoded to PCM, and it’s completely different from frame rate and frame size.

In our sample code, we set the update period for setPositionNotificationPeriod as number of frames in every 100 millisecond, therefore the listener will be triggered every 100 milliseconds, and we can pull data and update the recording file every 100 milliseconds.

Note that source code is modified based on

WAVE Audio File Format

WAV (Waveform Audio File Format, also known as WAVE) is a commonly used audio file format for storing raw audio samples on Windows. It follows the RIFF (Resource Interchange File Format) generic format.

0. The Format

A wave file is a RIFF file with a “WAVE” chunk. The WAVE chunk consists of two subchunks, namely “fmt ” and “data”. (Note that there’s a space in “fmt ”). Below we list out the meanings of each field in a WAVE file.





Chunk Tag




RIFF chunk descriptor

Chunk Size











subchunk id (fmt )



“fmt ”

fmt subchunk: format information about the audio data

subchunk size



audio format



1=>PCM, other values => data compressed

num of channels



1=> mono, 2=>stereo

sample rate



8000, 16000, 22050, 44100 etc.

byte rate



Sample rate*num of channels*bits per sample/8

block align



num of channels*bits per sample/8

bits per sample








8=>8bits, 16=>16bits



subchunk id (data)




Data subchunk: contains the raw audio data

subchunk size



Num of samples * num of channels * bits per sample/8



Audio data

Note that there’re totally 44 bytes (12 + 24 + 8) before the actual audio data.

1. Byte-by-Byte example

Below is a screenshot of a wave file shown in vim hex mode. We’ll go through the bytes one by one.


Figure 1. Bytes of a wave file recorded on Android 

5249 4646: RIFF

74d8 0400: 04d874 = 317556 bytes. I used “ls -l test.wav” to get the file size as 317564 bytes, which is equal to 317556 + 4 (size field) + 4 (RIFF field).

5741 5645: WAVE

666d 7420: fmt<I’m a space>

1000 0000: 00000010 = 16 bytes.

0100: 0001, which corresponds to PCM, values other than 1 indicate data is compressed.

0100: 0001, only one channel.

44ac 0000: 0000 ac44 = 44100, the sample rate is 44100Hz.

8858 0100: 0001 5888 = 88200 = sample rate * number of channel * bits per sample/8 = 44100 * 1 * 16 / 8 = 44100 * 2

0200: 0002, block align. 02 = number of channel * bits per sample / 8 = 1 * 16 / 8 = 2

1000: 0010 = 16, bits per sample

6461 7461: data

50d8 0400: 0004 d850 = 317520, data size. 317520 + 44 (total header bytes) = 317564, which matches with the total file size obtained using “ls -al”.


1. wikipedia page WAVE:

2. WAVE PCM soundfile format:

PCM Audio Format

Pulse Code Modulation (PCM) is a method to represent sampled analog signals in digital form, which is the standard form for digital audio representation in computers. In order to convert an analog signal to PCM, two steps are required.

  • sampling: the magnitude of the analog signal are sampled regularly at uniform intervals.
  • quantization: the value of each sample is rounded to the nearest value expressible by the bits allowed for each sample.

Two Basic Properties

Two basic properties determines how well a PCM sequence can represent the original signal.

  • sampling rate: the number of samples taken in a second
  • bit depth: the number of bits used to represent each sample, which determines the number of values each sample can take (e.g. 8 bits => 2^8 = 256 values)

PCM Types

  • Linear PCM: The straightforward method of PCM. The samples are taken linearly and represented on a linear scale (as opposed to Logarithmic PCM etc.). It is an uncompressed format, which can be compressed by different audio codec. When we talk about PCM, we’re generally referring to Linear PCM.
  • Logarithmic PCM: the amplitudes of samples are represented in logarithmic form. There are two major variants of log PCM, mu-law (u-law) and A-law.
  • Differential PCM (DPCM): sample value is encoded as difference from its previous sample value. This could reduce number of bits required for an audio sample.
  • Adaptive DPCM (ADPCM): the size of quantization step is varied so that the required bandwidth can be further reduced for a given signal-to-noise ratio.

Audio File Formats Support LPCM

LPCM audio is usually stored in aiff (.aiff, .aif, .aifc), wav (.wav, .wave), au (.au, .snd), and raw (.raw, .pcm) audio files.

Slice Resynchronization in MPEG4

MPEG4 Part-2 introduced three error resilience tools, including Resynchronization, Data Partitioning and Reversible VLC. This post discusses Resynchronization only.

The Problem

The bitstream of a MPEG4 video frame (and lots of other video codecs) is encoded using VLC (Variable Length Coding). Because the number of bits for each coefficient varies and the length is implicit, VLC bitstream is sensitive to errors. If an error causes wrong number of bits to be decoded for a coefficient, the bits for the next coefficient will be affected, and so on. The decoder essentially loses synchronization with the encoder. In this way, the error propagates and the video quality suffers.

GOB (Group of Blocks) in H.261 & H.263

H.261 and H.263 organize the macroblocks into groups, called Group of Blocks. Each GOB contains one or more rows of macroblocks and a GOB header with a resynchronization marker and other information that can be used to resynchronize the decoder.

The GOB approach is based on spatial periodic resynchronization — a resynchronization marker and other info of the GOB header is inserted when a particular macroblock position is reached at encoding. This approach resulted in different number of bits in each GOB because the encoded bits for each macroblock varies. In picture areas where more bits are used to encode the scene, the resynchronization markers are more sparse, thus makes it more difficult to conceal the error at those areas.

Slice in MPEG4 (Packet-Based Resynchronization)

MPEG4 adapts a video packet based resynchronization scheme. In the encoding process, a frame is divided into one or more video packet (also called slice sometimes). The length of each slice/packet is not based on number of macroblocks. Instead, if the number of bits exceeds a predetermined threshold, current slice is ended and a new slice is created at the start of next macroblock.

The structure of a slice is as below,

Resync Marker MB_number quant_scale HEC MB data

A resync marker is used to indicate the start of a new slice. It’s different from all possible VLC code words and the VOP start code. In addition, information that necessary to restart the decoding process is provided, including,

macroblock_number: macroblock position of the first macroblock in the video packet, which facilitates spatial resynchronization.

quantization_scale: quantization parameters needed to decode the first macroblock, which facilitates resynchronization of differential decoding.

HEC: Header Extension Code. A single bit indicating if additional information is following it.  When set to 1, additional info is available in the video packet header: modulo_time_base, vop_time_increment, vop_coding_type, intra_dc_vlc_thr, vop_fcode_forward and vop_fcode_backward.

Note that when HEC is equal to 1, the slice header contains all necessary information to decode the slice, thus the slice can be decoded independently. If HEC is set to 0, the decoder still needs some information from somewhere else to decode the slice.

When slice resynchronization tool is used, some of the encoding tools are modified to remove the dependencies among any two video packets.  One example is the predictive encoding must be confined within a video packet to prevent propagation of errors. In other words, a slice boundary is treated as a VOP boundary at AC/DC predication and motion vector predication.

Fixed-Interval Resynchronization

Packet-based Resynchronization produces video packets of similar length, but not exactly the same length. In case the error happens to result in a bit pattern same as resync marker, the decoder won’t be able to tell. This is normally known as start codes emulations.

To avoid this problem, MPEG4 also adopts a method called fixed interval resynchronization. It requires VOP start codes and video packet resynchronization markers appear only at legal fixed interval position in the bitstream. The fixed interval is achieved by stuffing the video packet with a leading ‘0’ and zero or more ‘1’s.

At decoding, the decoder only needs to search for VOP start code and resynchronization marker at the beginning of each fixed interval. Therefore, emulating a VOP start code or resynchronization marker in the middle of a fixed interval cannot confuse the decoder.


1. The MPEG-4 Book, by Fernando C.N. Pereira, Touradj Ebrahimi
2. MPEG-4 Standard, Part 2, Annex E.1 Error Resilience

A Quick Note on Error Recovery for Streaming Applications

There’re generally 3 ways to recover error caused by Internet transmission, including Retransmission, Redundant Data, and Error Concealment.


Retransmission is easy to understand. Just resend the packet that is lost or has error. It has the following pros and cons,


  • Retransmission resends the entire packet, so the data error/lost is repaired accurately.
  • It has low overhead when there’s enough bandwidth, as it doesn’t computation.


  • Retransmission increases the delay, as the sender needs to wait for some signal from receiver side about the packets needs to be retransmitted, or an timeout event has occurred. Also the retransmission itself take some time.
  • Retransmission takes bandwidth. If there’s congestion at the network, retransmission can make the congestion worse.

In video streaming application context, there’re other aspects that matters, and retransmission can be selective based on a few factors,

  • Important/Urgent packets are retransmitted first. For example, I frame packets are more important than B, P frame packets.
  • Packets are only retransmitted when there’s enough time. For example, if a packet has missed its play time, there’s no need to retransmit the packet.

Redundant Data

Redundant Data is well known as Forward Error Correction, a technique that sends additional information which can be used to recover lost/error packet at the receiver side.


  • Bandwidth requirement doesn’t change when there’s loss, so it won’t make the congestion worse if there’s one.
  • Receiver doesn’t needs sender’s action to recover.
  • Delay is better if the computation is fast. This is also related to the dependency of the FEC, as the recovery may requires waiting of several packets to arrive first.


  • It has constant overhead in terms of network bandwidth. Even when there’s no lost/error, redundant error still has to be sent.
  • In terms of burst data loss, this method could fail.

To improve FEC against burst data loss, there’s two methods can be used,

  1. Arrange packets in multiple dimensions.
  2. Interleaving, rearrange the packet order when doing FEC.

Interleaving itself has some pros and cons


  • It transfer burst data loss to random data loss, as the consecutive packets in the network transmission are not consecutive packets in FEC.
  • It doesn’t add any overhead to bandwidth or computation, only rearrangement is done.


  • It could cause delay, as the recovery needs to wait for packets in a longer distance in network transmission.

Error Concealment

Error concealment refers to the technique of trying to coneal the error instead of recover it accurately.


  • no overhead in terms of bandwidth, no matter there’s a congestion or not.
  • Delay can be small if the concealment computation is fast.


  • It may not always give a good result
  • The computation can be complicated if complex concealment algorithms are used.

The concealment techniques include splice, noise substitution, repetition, interpolation, regeneration etc.

Dynamic Adaptive Streaming over HTTP

This is the third method of the http video delivery. Unlike the first two methods: HTTP Progressive Download and Play, and HTTP Pseudo-streaming, this is a real streaming technology and it’s applicable for both live video and video on demand (VOD).

The basic idea of Dynamic Adaptive Streaming over HTTP (DASH) is to divide the video clip/stream into small chunks (called streamlets). These small chunks will be downloaded to the browsers’ cache and combined by the client (browser or browser plug in) for play out.

This technology is new and implemented by several companies in different ways using different names.

  • HTTP Live Streaming (Apple)
  • Smooth Streaming (Microsoft)
  • HTTP Dynamic Streaming (Adobe)
  • Adaptive Bitrate (Octoshape)

All methods are based on the basic idea mentioned above, but unfortunately they’re not compatible. HTTP Dynamic Streaming is supported on Adobe Flash Platform with Adobe Flash Media Server (server side) and Adobe Flash Player 10.1 and above (client side). Smooth Streaming is supported by Microsoft IIS Server (server side) and Silverlight (client side). HTTP Live Streaming is supported by a list of servers (e.g. Wowza Media Server) and clients (e.g. QuickTime Player) can be found here.

The reason that there’re more servers and clients available for Apple’s HTTP Live Streaming is that Apple’s iPhone and iPad devices support HTTP Live Streaming play out and Apple has made its HTTP Live Streaming specification available as RFC here.

Here we illustrate the basic idea using HTTP Dynamic Streaming from Adobe as an example. One can find demo videos here, or here.

Once I started to view the video, I checked my browser (I’m using Chrome on Ubuntu) cache at ~/.cache/google-chrome/Default/Cache/, I see a list of files (using ls –l command). When the video is being played, I kept refreshing the list (by typing ls –l repeatedly). I found there’s a new file created at about every 10 seconds.

Note that if you’re Windows 7, the browser cache should be at

C:Usersroman10AppDataLocalGoogleChromeUser DataDefaultCache
Where roman10 is my username.

Then I paused the video, n new files are created any more. If I resume the play back, new files started to appear again. These files are actually small video chunks with meta data. And the Flash player is combining them dynamically to a video stream as the video is playing out. I tried to play a single chunk using VLC player, it won’t work.

For advantages and disadvantages of DASH in comparison with other two HTTP methods, please refer http video delivery.

Update: Scheduling Logic is in Client (Player)

One important fact about DASH is the scheduling logic is implemented at the client/player side as HTTP is stateless and the server doesn’t keep track of a video session. Therefore, the player needs to measure the network condition and dynamically switch between video chunks of different size and quality.

HTTP Video Delivery — HTTP Pseudo Streaming

HTTP Pseudo Streaming is the second method in HTTP Video Delivery. The method is also based on HTTP progressive download as the first method does, and it makes use of the partial download functionality of HTTP to allow user to seek the video to part that has not been downloaded yet.

HTTP Pseudo Streaming requires support from both the client side and server side. For the server side, plug-ins are available for Apache, litghttpd etc. At client side, custom players are required to resynchronize the video, read metadata etc. Two examples of players that supports HTTP Pseudo Streaming ar eJWPlayer and FlowPlayer.

For an example of HTTP Pseudo Streaming, open up any Youtube video. Youtube actually uses lighttpd for server side and its own customized player based on Flash technology for client side (Flash player and Flash based media player are different. Please refer here. )

Below is a screenshot of the Wireshark network traffic capture when I was watching a Youtube video,


Figure 1. Wireshark Capture of HTTP Request for HTTP Pseudo Streaming

As there’re lots of segmented packets, I selected a video packet, right clicked it and selected follow TCP stream to get the screen above. It’s a single HTTP request followed by a response from the Youtube HTTP server.

I tried to seek to a part that is not downloaded yet, then do “follow TCP stream” again. I found the HTTP header contains the following strings for the initial request and the seek,

burst=40&sver=3&signature=B3B26708552F1C9FE68 7AAB13EFE6D73F294624F.0F9EEB822A5CF4AE5443CC5798B2F415C16B75E4&
redirect_counter=1 HTTP/1.1


The seek HTTP request contains a string “begin=1032070”, which should be used at the HTTP server to jump to the corresponding portion of the video.

Same as the first method, HTTP Pseudo Streaming download the video clip to browser cache. For Google Chrome, one can find the video clip at,

Ubuntu Linux: /home/roman10/.cache/google-chrome/Default/Cache/

Windows: C:Usersroman10AppDataLocalGoogleChromeUser DataDefaultCache

Where roman10 is my username for both OS.

For the benefits and drawbacks of this method, and its comparison with other HTTP method, please refer to HTTP Video Delivery. Note that as HTTP psuedo streaming is not real streaming, it doesn’t support live video streaming.

HTTP Video Delivery–HTTP Progressive Download and Play

Progressive Download and Play is the first of the three methods in HTTP Video Delivery. The basic idea of this method is to embed the video through media player (e.g. JWPlayer, FlowPlayer etc.). When user request to play the video, a HTTP Get request will send to the web server, and the video will be downloaded through HTTP for play out.

This method is supported by web servers by default, as it treats a video as a normal file like an image, or a CSS file. The play out, buffering and other video play specifics are handled by the media player which usually consists of some javascript and flash objects. More info about these players be found here.

Below is an example of HTTP Progressive Download and Play Out video, it’s delivered using JWPlayer as client.


Once you click the button to start play, the video download is triggered. Below is a screenshot of the network traffic captured using Wireshark.


Figure 1. Wireshark Capture Screenshot

The video is actually downloaded to browser cache, as I used Google Chrome at Ubuntu Linux to watch the video, the file is saved to

/home/roman10/.cache/google-chrome/Default/Cache/ folder. Below is a screenshot of the files under the directory,


Figure 2. List of Files under Google Chrome Cache

I can either locate the video clip by time or by file size, and play out the video using vlc as shown below,


Figure 3. VLC Play Out of the Cached Video

If you’re using Windows, the video file can be found under,

C:Usersroman10AppDataLocalGoogleChromeUser DataDefaultCache, where roman10 is the username.

For the benefits and drawbacks of this method, and its comparison with other HTTP video delivery method, you can refer here.

You may also want to check out,

HTTP Pseudo Streaming

Dynamic Adaptive Streaming over HTTP

UDP vs TCP–In the Context of Video Streaming

1. TCP Congestion Control and Window Size

TCP maintains a congestion window, which indicates the number of packets that TCP can send before an acknowledge for the first packet sent is received.

The congestion window size is default to 2 times of the maximum segment size (MSS, the largest amount of data in bytes can be sent in a single TCP segment without the TCP header and IP header). TCP follows a slow start process to increase the congestion window by 1 MSS every time a packet acknowledgement is received until the congestion window size exceeds a threshold called ssthresh (This effectively doubles the congestion window every RTT). Then TCP enters congestion avoidance.

In congestion avoidance, as long as non-duplicate ACKs are received, the congestion window is increased by 1 MSS for every RTT. When a duplicate ACK is received, the chance of the packet is lost or being delayed is very high. Depends on the implementation, the actions taken by TCP varies,

  • Tahoe: Triple duplicate ACKs are treated the same as a timeout. Fast retransmission is performed and the congestion window is reduced to 1, and slow start starts again.
  • Reno: If triple duplicate ACKs are received, TCP reduce the congestion window by half, perfrom a fast retransmit, and enter Fast Recovery. If ACK is timeout, it also restart from slow start.

In Fast Recovery, TCP retransmits the missing packets and wait for an acknowledgement of the entire transmit window before returning to congestion avoidance. If there’s no ack, then it’s treated as timeout.

As the window size controls the number of unacknowledged packets that TCP can send, a single packet loss can reduce the window size significantly hence reduce the throughput significantly. Therefore, TCP makes the throughput unpredictable for video streaming application.

2. TCP Reliability and Retransmission

TCP is a reliable service. When a packet is lost, TCP will retransmit the packet. For live streaming application, the retransmitted packets may already be late for playback and take over bandwidth unnecessarily.

3. Multicast + TCP has Problems ???

Multicast can be effective in streaming applications (at least in theory). TCP is connection oriented, it requires two parties to establish a connection. Multicast at TCP layer is tough as it requires the end player to send ACK to the server who is streaming content. It is hard to scale.

4. TCP and UDP Packet Header Size

TCP has a typical header size of 20 bytes, while UDP has header of 8 bytes. It occupies less bandwidth.