HTTP/HTTPS Load Testing with http_load–Part 2

This is the second part of HTTP/HTTPS load testing with http_load. Please read part 1 first.

2.3 Investigation of Wired Results

We add printf statements to print out the connect time and read response time. The original three lines of code are as below (they’re at different places of http_load.c file).

struct timeval now, nowt, nowtt;

handle_connect( cnum, &now, 1 );

handle_read( cnum, &now );

The modified code is as below.

struct timeval now, nowt, nowtt;

(void) gettimeofday( &nowt, (struct timezone*) 0 );

handle_connect( cnum, &now, 1 );

(void) gettimeofday( &nowtt, (struct timezone*) 0 );

long connect_t = delta_timeval(&nowt, &nowtt);

printf("handle_connect time: %ldn", connect_t);

(void) gettimeofday( &nowt, (struct timezone*) 0 );

handle_read( cnum, &now );

(void) gettimeofday( &nowtt, (struct timezone*) 0 );

long read_t = delta_timeval(&nowt, &nowtt);

printf("handle_read time: %ldn", read_t);

We run the test using surls.txt and surls2.txt again. And below are the results.

First it’s the surls.txt.


Then the surls2.txt
The connect time for github is > 30 times of the connect time to chrome.google.com. When we’re having multiple parallel connections, the connect takes so much time that reading the response is delayed. This has something to do with http_load’s design, which we’ll discuss next.

3. The Drawbacks of http_load

http_load has the following drawbacks.

3.1. Single thread multiple connections

http_load uses Linux select call to handle multiple connections with a single thread. This design is good when both connect and read response time are short, which is the case for most of HTTP request,  but not true for most of HTTPS case. As shown above, it takes almost half a second to connect to github.

When the connect time is long and there’re parallel connections, it’s likely the events are going to be queued up. In other words, when the response is received by http_load host, http_load cannot read it immediately because it’s busy with establishing other connections or reading other response. Therefore,  the response time obtained is not accurate as it includes the time that a response waited before http_load reads it.

A better design could a thread-pool approach. This could reduce the wait time as multiple responses can be read by multiple threads concurrently.

For current http_load, if we’re doing test with https, it’s better to set -parallel 1 and run multiple instances of http_load. Note that setting rate may not work as it can still introduce parallel connections.

3.2. Read all URLs into memory

http_load loads every URL into memory before it starts to send out HTTP request. This makes it easy to pick up a URL randomly for each request, but a big URL file may cause large amount of memory allocation and a long initialization time. I encountered cases where 10 seconds test takes more than 10 minutes to finish when the input file contains about half a million URLs.

Apache JMeter is a better alternative in terms of HTTPS testing, in my personal point of view.

References:
1. http_load website: http://www.acme.com/software/http_load/

HTTP/HTTPS Load Testing with http_load–Part 1

http_load is simple open source tool sends multiple HTTP requests in parallel. It runs a single thread picking up URLs randomly from an input file. Both HTTP and HTTPS are supported.

0. Build http_load

Build http_load is quite simple.

tar xvf http_load-12mar2006.tar.gz

  • Start terminal, go to the extracted directory, and type make to build it.
  • Start http_load by ./http_load command.

1. Use http_load for HTTP testing

1.1 Controlling Request Rate

http_load supports a few options. We can control the request rate by either -parallel or -rate. parallel indicates the number of concurrent connections. For example, if we specify -parallel 5, then http_load will try to establish and maintain (create new connection once a connection is closed) 5 connections, and send out one request through each connection.

Rate controls the request rate in a different manner. It refers to number of requests sent out per second. If we specify -rate 5, then http_load will make 5 requests per second. It will determine how many connections to open internally. We can specify either parallel or rate, but not both.

1.2 Controlling Test Time

Another important parameter is seconds, it allows us to specify how long the test should run. Or we can specify fetches, which indicates the number of requests to send for the entire test.

1.3 Specify Test URLs

The last parameter is always the url_file, a file which contains a list of URLs for test. http_load will read all URLs at the beginning and randomly pick up one for each request. There’re a few other parameters, one can refer to http_load help message for more info.

1.4 An Example

As an example, suppose we have a URL file named urls.txt contains the following URLs.

http://en.wikipedia.org/wiki/Hello_world_program
http://en.wikipedia.org/wiki/Computer_program
http://en.wikipedia.org/wiki/Computer_multitasking
http://en.wikipedia.org/wiki/Computer_process
http://en.wikipedia.org/wiki/Context_switch
http://en.wikipedia.org/wiki/Central_processing_unit
http://en.wikipedia.org/wiki/Integrated_circuit
http://en.wikipedia.org/wiki/Microchip_(disambiguation)
http://en.wikipedia.org/wiki/Microchip_implant_(animal)
http://en.wikipedia.org/wiki/Penguins

Then if we want to simulate 5 users sending requests for 10 seconds, we can specify the command as below,

./http_load -parallel 5 -seconds 10 urls.txt

http_load will output the results once the test is completed.

From the results, we can read the min, max and average response time, and throughput (No. of fetches per second) etc.

2. Use http_load for HTTPS Testing

2.1 Compile http_load with HTTPS support

By default, HTTPS support is disabled. We can enable the HTTPS by uncomment the lines below in the Makefile.

#SSL_TREE =     /usr/local/ssl
#SSL_DEFS =     -DUSE_SSL
#SSL_INC =      -I$(SSL_TREE)/include
#SSL_LIBS =     -L$(SSL_TREE)/lib -lssl –lcrypto

Once enabled the above lines, simply type the following commands in terminal to build it.

make clean
make

2.2 An Example

HTTPS test works the same as HTTP test. Suppose we have the URLs below in a file named surls.txt.

https://github.com/

And another file named surls2.txt.

https://chrome.google.com/webstore/category/home
https://chrome.google.com/webstore/category/popular
https://chrome.google.com/webstore/category/from_your_circles
https://chrome.google.com/webstore/category/trending
https://chrome.google.com/webstore/category/collections
https://chrome.google.com/webstore/category/collection/perfect_date
https://chrome.google.com/webstore/category/collection/football
https://chrome.google.com/webstore/detail/gocaejggjgdmkhmbinicknpbhagkblop
https://chrome.google.com/webstore/category/collection/editors_picks
https://chrome.google.com/webstore/category/app/3-games

We first test with surls.txt, and below is the result.

Then surls2.txt, and below is the result.

The result for surls.txt seems wired. As we’re increasing the number of parallel connections, the response time increases too. For website like github.com, it’s unlikely to be the server side issue. Could it be the client side issue? NO. Because test with surls2.txt shows the client can take care of 30 parallel connections just fine. So WHY? We investigate this issue in next post.

Shall We Use the Input Parameters as Loop Variable?

Several days ago, I’ve got a discussion with another developer. We talked about whether we should use the input parameters as loop variables in a C program.

To illustrate the problem, suppose we have two functions as below.

void testlocalb(int *a, int lcnt) {

    int i;

    for (i = 0; i < lcnt; ++i) {

    }

    *a = i;

}

 

void testlocal(int *a, int n) {

    for (*a = 0; *a < n; ++(*a)) {

    }

}

The first function use a local variable as the loop counter, while the second one use the input parameter as the loop counter. The questions is: which one is more efficient?

In terms of space complexity, the second function doesn’t use local variable, so it saves some memory space at the stack. But how about execution time?

1. The local variable will be put into registers, so the first function is more efficient???

I wrote three different programs to verify this is true or not. The code is as below,

#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

 

void testlocala(int *a, int n) {

    for (*a = 0; *a < n; ++(*a)) {

         //dummy: do nothing

    }

}

 

int main(int argc, char **argv) {

    int n = atoi(argv[1]);

    int rv;

    struct timeval stTime, edTime;

    gettimeofday(&stTime, NULL);

    testlocala(&rv, n);

    gettimeofday(&edTime, NULL);

    printf("%dn", rv);    

    printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

    return 0;

}

The program above use the input parameter as the loop counter.

#include <stdio.h>

#include <stdlib.h>

 

void testlocalb(int *a, int lcnt) {

    int i;

    for (i = 0; i < lcnt; ++i) {

    }

    *a = i;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n);

   gettimeofday(&edTime, NULL);

   printf("%dn", rv);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

The second program listed above use a local variable as the loop counter.

#include <stdio.h>

#include <stdlib.h>

 

void testlocalb(int *a, int lcnt) {

    register int i;

    for (i = 0; i < lcnt; ++i) {

    }

    *a = i;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n);

   gettimeofday(&edTime, NULL);

   printf("%dn", rv);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

The third program also uses the local variable, but declares it as register.

To check whether the loop counter is put into register, we compile the code into assembly code using the command below,

gcc -S local1.c

The loop essentially contains a read and a write access to the loop counter. The read happens at the comparison and the write happens at the loop counter increment. We show the assembly code for the three functions above as following.

Firstly, the loop counter increment, which is the write access.

First function: using input parameter as loop counter

movl    8(%ebp), %eax                           #1

movl    (%eax), %eax                             #2

leal    1(%eax), %edx                           #3

movl    8(%ebp), %eax                           #4

movl    %edx, (%eax)                             #5

Second function: using local variable

addl $1, -4(%ebp)

Third function: using local variable with register declaration

addl $1, %ebx

From the assembly code, we can tell that the local variable without register declaration is not put into register, while the third function with register declaration is put into register.

The first function is a bit complicated, but the ++(*a) are actually two operations, including * (the first two lines) and ++(line 3 to 5). The increment is actually performed at register by the third line and then write back to memory at 5th line.

Secondly, the comparison, which is the read access.

First function: using input parameter as loop counter

movl 8(%ebp), %eax
movl (%eax), %eax
cmpl 12(%ebp), %eax

Second function: using local variable

movl -4(%ebp), %eax
cmpl 12(%ebp), %eax

Third function: using local variable with register declaration

cmpl 12(%ebp), %ebx

Again, the first function requires two operations because we’re accessing it using a pointer, and the value is loaded from memory to register eax. The second function also loads the value from stack memory to register eax. The third function just do the comparison as the loop counter is already in register.

Which function is fast? It’s not difficult to tell that function 3 will be the fastest as it’s a single instruction performed at register for both read and write. How about the first one and second one?

Compile the code using the command below,

gcc -o local1 local1.c
gcc -o local2 local2.c
gcc -o local3 local3.c

Below is a sample execution result,

Figure 1. Running 3 testing programs with default compiler options

The third function runs fastest, as expected, follows by the first function and the second function goes last.

One can add code blocks inside the loop as below to further verify the performance difference is due to read or write.

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

 

++(*a);

--(*a);

++(*a);

--(*a);

++(*a);

--(*a);

++(*a);

--(*a);

++(*a);

--(*a);

On my machine, the result is that input parameter (function 1) performs better for both read and write.

To conclude the first doubt, the local variable is not put into register under the default compiler options. Declare the local variable as “register” gives hint to the compiler that the variable should be put into register and it helps sometimes. Without “register”, input parameter performs better than local variable for both read and write accesses.

2. What if the compiler optimization options are used?

With the optimization options enabled, the loop in the code above will be taken away. I changed the code to below to avoid removal of loop.

First function: using input parameter as loop counter

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

 

void testlocala(int *a, int n, int *b) {

    unsigned int x = 1, y = 1;

    for (*a = 0; *a < n; ++(*a)) {

        x += y;

        y = x - y;

        if (y > INT_MAX) {

            x = 1;

            y = 1;

        }

    }

    *b = y;

}

 

int main(int argc, char **argv) {

    int n = atoi(argv[1]);

    int rv, rv1;

    struct timeval stTime, edTime;

    gettimeofday(&stTime, NULL);

    testlocala(&rv, n, &rv1);

    gettimeofday(&edTime, NULL);

    printf("%d:%dn", rv, rv1);    

    printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

    return 0;

}

Second function: using local variable

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

 

void testlocalb(int *a, int lcnt, int *b) {

    int i;

    unsigned int x = 1, y = 1;

    for (i = 0; i < lcnt; ++i) {

        x += y;

        y = x - y;

        if (y > INT_MAX) {

            x = 1; 

            y = 1;

        }

    }

    *a = i;

    *b = y;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv, rv1;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n, &rv1);

   gettimeofday(&edTime, NULL);

   printf("%d:%dn", rv, rv1);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

Third function: using local variable with register declaration

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

 

void testlocalb(int *a, int lcnt, int *b) {

    register int i;

    unsigned int x = 1, y = 1;

    for (i = 0; i < lcnt; ++i) {

        x += y;

        y = x - y;

        if (y > INT_MAX) {

            x = 1;

            y = 1;

        }

    }

    *a = i;

    *b = y;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv, rv1;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n, &rv1);

   gettimeofday(&edTime, NULL);

   printf("%d:%dn", rv, rv1);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

We compile the 3 functions into assembly using the commands below,

gcc -O1 -S local11.c
gcc -O1 -S local22.c
gcc -O1 -S local33.c

The optimized assembly code for increment of loop counter and loop condition test for all three versions are the same,

addl $1, %ecx
cmpl %ebx, %ecx

The code for loading the loop counter values to register ecx are not listed for simplicity. The execution time for the 3 programs are as below,

function 1: input parameter,  3:172661
function 2: local variable, 4:4294877639
function 3: local variable with register declaration, 4:4294786759

We can also compile the code using O2 and O3 optimization options. The execution time are as below,

For O2,

function 1: input parameter,  3:669508
function 2: local variable, 3:572726
function 3: local variable with register declaration, 3:641155

For O3,

function 1: input parameter,  4:4294628948
function 2: local variable, 3:612397
function 3: local variable with register declaration, 3:688610

For O3 + funroll-loops, (gcc -O3 -funroll-loops local11.c -o local11)

function 1: input parameter,  4:4294060070
function 2: local variable, 3:69106
function 3: local variable with register declaration, 3:62729

In conclusion, the proper compiler flags can optimize the code, and the usage of input parameter and local variable can achieve similar run time.

Note that the experiment are carried once only. For more accurate results, one should execute the program many times and take the average.

References:
1. http://www.network-theory.co.uk/docs/gccintro/gccintro_50.html
2. http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Simple Profiling of MD5

MD5 (Message-Digest) is an algorithm commonly used as cryptographic hash function. It takes an arbitrary length of data and produces a 128-bit (16 bytes) hash value. MD5 is not recommended for secure applications any more because attack messages can be constructed to produce collision. However, without intentional attack, MD5 is extremely unlikely to produce a collision and the computation is fast (later part will provide test code). Therefore it is still useful as a hash function in situations where security is not a concern and collision is not desired.

Reference 2 provides links to lots of implementations of MD5 in various programming languages. This post takes the code from L. Peter Deutsch implementation added a test function to test how fast MD5 computes.

The test uses a a list of url strings stored in a file. It will read the string from file and compute the md5 for each URL. The test function is as below,

static int do_n_test(void) {

    FILE *testFile;

    int UNIQUE_URLS = 1000;

    unsigned char url[UNIQUE_URLS][4096];

    int numOfUrlsTested = 0;

    int i = 0, j, READ_FILE_TIMES = 1;   

    struct timeval stTime, edTime;

 

    testFile = fopen("./urllist.txt", "r");

    if (testFile == NULL) {

        printf("cannot open test filen");

        return;

    }

    for (; i < UNIQUE_URLS; ++i) {

        fgets(url[i], 4096, testFile);

    }

    fclose(testFile);

    gettimeofday(&stTime, NULL);

    for (i = 0; i < READ_FILE_TIMES; ++i) {

        for (j = 0; j < UNIQUE_URLS; ++j) {

            md5_state_t state;

            md5_byte_t digest[16];

    

            char hex_output[16*2 + 1];

            int di;

            md5_init(&state);

            md5_append(&state, (const md5_byte_t *)url[j], strlen(url[j]));

            md5_finish(&state, digest);

            /*for (di = 0; di < 16; ++di)

                sprintf(hex_output + di * 2, "%02x", digest[di]);

            printf("%d: %sn", numOfUrlsTested, hex_output);*/

            ++numOfUrlsTested;

        }

    }

    gettimeofday(&edTime, NULL);

    printf("%d: %u:%un", numOfUrlsTested, (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

The test file contains more than 3000 urls, but loading all URLs from the file may exceed the stack memory you can use. You can set the “UNIQUE_URLS” variable to adjust the number of unique URLs to load.  You can also adjust the “READ_FILE_TIMES” variable to set number of MD5 computations to run. The total number of MD5 computations will be “UNIQUE_URLS”x“READ_FILE_TIMES”.

To compile the code, use the command,

gcc -o test md5.c md5main.c –lm

To run test,

./test –test

On my Ubuntu machine with Intel i5 Quad Core, 1197 MHz, 4GB Memory, one MD5 computation takes about 4 macro seconds.

You can download the entire code from here. Note that the main MD5 computation code is from reference 2 by L. Peter Deutsch.

References:
1. MD5 Wikipedia Page: http://en.wikipedia.org/wiki/Md5
2. MD5 Homepage (unofficial): http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
3. Hash functions: An empirical comparison: http://www.strchr.com/hash_functions

Frequent Custom Status Bar Notification is Evil on Android

The Story Behind

In my recent app, video converter for Android, I add a notification progress report on the status bar. So users can view the video conversion progress any time. I also did several other changes, and then start running the final test before release.

Then I found the app is much slower compared with previous version. I couldn’t think of why. I turned on Power Tutor app, and found that the system process is consuming a lot of CPU. Notification bar is called through NotificationManager, which a system service. I guess it’s because I’m updating the status bar progress every second.

In the end, I changed the status progress update interval and app works as normal. But just to demonstrate frequent custom status bar notification is really evil, I programmed a simple testing app does nothing but status bar notification update.

Why I Say it’s Evil

The key part of the code is as below (you can get the entire code at the end of the post),

SystemClock.sleep(1000);

CharSequence title = "Freq noti is evil: " + mCount;

CharSequence content = "Freq notification update takes too much CPU";

if (CUSTOM_NOTI) {

    noti.contentView.setTextViewText(R.id.status_text, title);

    noti.contentView.setProgressBar(R.id.status_progress, 100, mCount%100, false);

} else {

    Intent notiIntent = new Intent(context, StatusBarNotificationActivity.class);

    PendingIntent pi = PendingIntent.getService(context, 0, notiIntent, 0);

    noti.setLatestEventInfo(context, title, content, pi);

}

//nm = (NotificationManager) getSystemService(Context.NOTIFICATION_SERVICE);

//nm = (NotificationManager) getApplicationContext().getSystemService(getApplicationContext().NOTIFICATION_SERVICE);

nm.notify(STATUS_BAR_NOTIFICATION, noti);

The testing app supports both custom and default status notification. The custom status bar notification contains an icon, a text view and a progress bar. It updates the notification once every second. 

The Power Tutor CPU power consumption curve is as below for default status notification,

Figure 1. CPU Power Consumption with Frequent Default Status Bar Notification Update

The left one is before testing app starts, the right one is when testing app is running. There’s almost no difference.

The Power Tutor CPU power consumption curve is as below for custom status notification,

 

Figure 2. CPU Power Consumption with Frequent Custom Status Bar Notification

The first one is before testing app starts, the second one is when testing app is running, and the third one is after testing app is finished. Clearly frequent update of the custom notification status consumes lots of CPU resources.

I also tried without updating the progress bar (so only the title text of the notification is updated), the consumption is less, but still much higher than default one.

Figure 3. Custom Notification without Progress bar

You can download the testing app here or from my github. Note that you might want to run the testing app for a while to see the results. At some of my tests, the CPU consumption is low initially, but raise high after a while.

The testing is done on Nexus One device with Android 2.3, the android sdk used for development is 2.1 and 4.0.

Power Tutor–Android App Measures Power Consumption of App

Mobile computing and smart phones cannot be what we see today without powerful and energy efficient batteries. Mobile application developers, however, don’t get used to think of building energy efficient apps. Sometimes, it can affect the application a lot.

I recently found a tool, Power Tutor, that runs on Android devices to measure the power performance of the device. This post is about how Power Tutor works and how it can help to develop Android apps. It doesn’t cover every detail of the theory (one can refer to the research paper in reference 1 if interested.), but briefly describes how it works in general.

Component-Based Measurement

Power Tutor measures the energy consumption by components, including CPU, WIFI, 3G, Display etc. Each component has different states, and the power consumption rate at each state differ. Power Tutor includes a set of parameters for each state describing how fast power is consumed at the state. The parameters are obtained by their developers from offline experiments. Use these parameters, they construct a power measurement model to measure the energy. Below are screenshots of Power Tutor measurement for a Android phone.

Figure 1. Power Measurement for Android Phone

The parameters they use in current app is obtained from several HTC phones, so it works best for those models (refer to reference 1 for model details). The method Power Tutor developers/researchers proposed is able to obtain the parameters on other phones, but the app doesn’t provide such a function.

Power Consumption for Each App

Power Tutor has a “Application Viewer” function to view the power consumption for each app.

On Android, each app/process is considered as a separate user with its own UID. Under /proc/uid_stat/<UID>/ directory, lots of information are available about the app/process, including data transmitted, memory usage etc. Power Tutor map process id to UID, then based on the stats found under the <UID> folder to decide the states for each component. Then based on the power measurement model, Power Tutor computes the energy consumption.

Power Tutor thinks the energy consumption of each app is independent. In other words, Power Tutor assumes app A consumes the same amount of energy with or without app B running. In this way, based on statistics for each component for each UID/app, Power Tutor obtained the power consumption for each UID/app. Then summing them up to get the power consumption for the entire system.

If your app is running as super user program, the Android system assigns a UID of 0 to it. In this case, it breaks the Android rule of “each app is treated as a separate user”. Power Tutor shows UID 0 as “kernel” process.

Below are a screenshot of the stats for all running apps and a screenshot of the video converter app power consumption,

Figure 2. App-Based Power Measurement

How to Use it to Help Development

First of all, it can used to measure the power performance of your app. Because Power Tutor breaks down the power consumption to components, you can also get an idea what kind of activity is consuming the battery in your app. Is it network communication through WiFi? or too much CPU consumption?

Also Power Tutor measures the power based on statistics obtained from the /proc and /sys directories. It indirectly reflects the usage of hardware components. Is your app CPU bound, if your app is constantly consumes a lot of CPU component power? I give an example in another post here.

References:
1. Power Tutor website: http://powertutor.org/

Use memset to Avoid Loops in C

The Usage of memset

memset has the following definition,

void* memset(void* s, int c, size_t n);

The method set each of the first n bytes of memory pointed by s to c and return s. Note that although the second parameter has type of int, it’s converted to unsigned char before the memory operation is carried out. Therefore, c has to be in the range of [0, 255].

memset works on bytes. If you use memset for array of ints, or doubles, most of the case it won’t work. memset works on the following situations (there may be more cases),

  • initialize char, int, float arrays to 0s.
  • initialize char to any arbitrary value.
  • initialize int to values with same byte value for all 4 bytes. (e.g. 0x01010101 = 16843009)

But we seldom encounter this case.

memset and C Arrays

If the array is static, the memory is allocated at stack. The memory is continuous for both 1-D and multi-dimensional arrays.

If the array is dynamic, the memory is allocated at heap. If you’re allocating array memory using malloc, malloc can only assue the memory allocated on each call is continous. So if the array allocated is 1-D, the memory is continuous, otherwise, it’s not guaranteed (since you’re calling malloc multiple times).

memset works on a block of continuous memory. For static array, say int a[10][10], you can simply use memset(a, 0, sizeof(a[0][0])*10*10) to initialize all elements to 0. For dynamic array, the example below gives the idea,

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

int main() {

   int **a;

   int i, j;

   a = (int **)malloc(sizeof(int *)*10);

   for (i = 0; i < 10; ++i) {

       a[i] = (int*)malloc(sizeof(int)*10);

   }

   //memset to all 0s

   for (i = 0; i < 10; ++i) {

       memset(a[i], 0, sizeof(a[i][0])*10);

   }

   //print out the value

   for (i = 0; i < 10; ++i) {

       for (j = 0; j < 10; ++j) {

           printf("%d ", a[i][j]);

       }

       printf("n");

   }

   return 0;

}

Essentially, one memset is needed for one malloc.

Use memset instead of Loops

memset is usually optimized by compilers to produce more efficient binary code than loops. Below is a testing code to compare the performance of loop initialization and memset.

/*

this is a small test program tests which method to get the file size is faster

*/

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#define D1 50

#define D2 90

#define D3 90

unsigned char (*pData)[D1][D2][D3];

unsigned char data1[D1][D2][D3];

unsigned char data2[D1][D2][D3];

int main() {

   struct timeval stTime, edTime;

   int LOOP_COUNT = 100;

   int cnt, i, j, k;

   memset(data1, 0x01, D1*D2*D3);

   memset(data2, 0x00, D1*D2*D3);

   pData = &data1;

   //using memset 

   gettimeofday(&stTime, NULL);

   /*for (cnt = 0; cnt < LOOP_COUNT; ++cnt) {

       for (i = 0; i < D1; ++i) {

           for (j = 0; j < D2; ++j) {

       memset(&(*pData)[i][j][0], 1, sizeof(unsigned char)*D3);

           }

       }

   }*/

   for (cnt = 0; cnt < LOOP_COUNT; ++cnt) {

       memset(&(*pData[0][0][0]), 1, sizeof(unsigned char)*D1*D2*D2);

   }

   gettimeofday(&edTime, NULL);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

   //using loop

   pData = &data2;

   gettimeofday(&stTime, NULL);

   for (cnt = 0; cnt < LOOP_COUNT; ++cnt) {

       for (i = 0; i < D1; ++i) {

           for (j = 0; j < D2; ++j) {

       for (k = 0; k < D3; ++k) {

           (*pData)[i][j][k] = 1;

       }

           }

       }

   }

   gettimeofday(&edTime, NULL);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

   //(*pData)[_stFrame][_edH][_edW] = 88;    //uncomment see it fails the check below

   //to check two methods get same results

   for (i = 0; i < D1; ++i) {

       for (j = 0; j < D2; ++j) {

   for (k = 0; k < D3; ++k) {

               if (data1[i][j][k] != data2[i][j][k]) {

                   printf("it doesn't workn");

                   return;

               }

           }

       }

   }

}

Compile the code with “gcc -o test test.c” and run the code with “./test” on a Levono Laptop Dual Core 2.0 GHz, 3G memory Ubuntu 10.04 system gives the results below,

0:6470
0:121570

memset is almost 20 times as fast as the loop method. Note that the commented code contains a test for doing memset for only the inner most loop, it actually still beats the loop method. This means even for dynamic allocated array which we can only use memset for inner most loop, it can still improve the performance over all loops method.

References:

Arrays and Pointers: http://www.lysator.liu.se/c/c-faq/c-2.html

Using mmap for Random Access of Files

mmap allows a program to map a section of the file into the program’s memory space, and access it with a pointer. This post illustrates mmap for random access of files.

0. Description

The mmap system call is defined in sys/mman.h as below,

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);

 

The input parameters have the following meanings,

  • addr: the address we want to map to. Usually set to zero to let the OS decide the address for us.
  • len: the length of data we want to map.
  • prot: the access rights the process want to have to the mapped memory. The values can be PROT_READ, PROT_WRITE and PROT_EXEC, which corresponds to read, write and execute permissions respectively. Note that one can specify multiple permissions by ORing the values.
  • flags: Set to MAP_SHARED if you want to share the file with other processes. Set to MAP_PRIVATE will get your process a copy of the mapped region, and your changes won’t be reflected in the original file. There’re other flags, you can refer to man page.
  • fildes: the file descriptor of the file you want to map. It can be obtained using the open system call as below. Note that the access mode should be the same as prot flags you set in mmap call.

    int fd = open(“test”, O_RDWR);

  • off: the offset of the file to start mmap. Note that this must be a multiple of virtual memory page size, which can be get by a call to getpagesize() or sysconf(_SC_PAGE_SIZE)

On success, mmap returns a pointer to the beginning of the mapped data region. Otherwise, it returns a value of MAP_FAILED and sets errno.

After mmap the file, we got a pointer to the mapped data. We can use the pointer to jump to a specific byte to access the data. The example given in part 1 illustrates this by reading a specific byte you specify.

After we’re done with mmap, we can unmap by mummap call,

int munmap(void *addr, size_t len);

 

  • addr: the address of unmap, which should be the address returned by mmap call.
  • len: the length to unmap, which should be the same as the len parameter passed in mmap call.

Note that a mapped file is unmapped automatically upon process termination.

1. Example

Below is an example illustrating mmap,

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main(int argc, char *argv[])
{
    int fd, offset;
    unsigned char *data;
    struct stat sbuf;
    FILE *f;
    int i;
    unsigned char writeBytes[10];

    if (argc != 2) {
        fprintf(stderr, "usage: mmapdemo offsetn");
        exit(1);
    }

    /*create the test file*/
    for (i = 0; i < 10; ++i) {
    writeBytes[i] = (unsigned char)i;
    }
    f = fopen("./test", "w");
    for (i = 0; i < 10; ++i) {
        fwrite(writeBytes, 1, 10, f);
    }
    fclose(f);

    if ((fd = open("./test", O_RDONLY)) == -1) {
        perror("open");
        exit(1);
    }
    if (stat("./test", &sbuf) == -1) {
        perror("stat");
        exit(1);
    }
    printf("file size: %ldn", sbuf.st_size);
    offset = atoi(argv[1]);
    if (offset < 0 || offset > sbuf.st_size-1) {
        fprintf(stderr, "mmapdemo: offset must be in the range 0-%ldn", (sbuf.st_size-1));
        exit(1);
    }

    data = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)    ;
    if (data == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    printf("byte at offset %d is %dn", offset, data[offset]);
    munmap(data, sbuf.st_size);

    return 0;
}

The code creates a test file first. It opens the file to get a file descriptor. stat is called to get the file size. mmap is then applied to map the entire file into memory. After that, we can access a specific byte using the pointer returned by printing its value. Note that we can define the data pointer as pointer to some other data types. If we define as int, then we can access the data as ints.

Copy, paste and save the file to mmaptest.c, and then compile the code using,

gcc -o mmaptest mmaptest.c

Below are some sample execution,

./mmaptest 99

file size: 100

byte at offset 99 is 9

./mmaptest 2

file size: 100

byte at offset 2 is 2

./mmaptest 50

file size: 100

byte at offset 50 is 0

mmap can perform better than fseek, fread in certain cases. mmap can also be used to share file between different processes, which can reduce memory access significantly as it avoids loading the same file for each process.

Update 1: mmap can perform faster when dealing with big files. The normal read/write uses system call and frequent reads and writes cause lots of context switch between kernel space and user space. These context switches can be a significant overhead.

Update 2: mmap doesn’t load the entire file into memory. It adopts lazy access approach — when the data is needed, it is loaded into memory. This leads to another optimization for mmap based file accessing — pre-loading the content before the content is actually needed.