HTTP/HTTPS Load Testing with http_load–Part 2

This is the second part of HTTP/HTTPS load testing with http_load. Please read part 1 first.

2.3 Investigation of Wired Results

We add printf statements to print out the connect time and read response time. The original three lines of code are as below (they’re at different places of http_load.c file).

struct timeval now, nowt, nowtt;

handle_connect( cnum, &now, 1 );

handle_read( cnum, &now );

The modified code is as below.

struct timeval now, nowt, nowtt;

(void) gettimeofday( &nowt, (struct timezone*) 0 );

handle_connect( cnum, &now, 1 );

(void) gettimeofday( &nowtt, (struct timezone*) 0 );

long connect_t = delta_timeval(&nowt, &nowtt);

printf("handle_connect time: %ldn", connect_t);

(void) gettimeofday( &nowt, (struct timezone*) 0 );

handle_read( cnum, &now );

(void) gettimeofday( &nowtt, (struct timezone*) 0 );

long read_t = delta_timeval(&nowt, &nowtt);

printf("handle_read time: %ldn", read_t);

We run the test using surls.txt and surls2.txt again. And below are the results.

First it’s the surls.txt.


Then the surls2.txt
The connect time for github is > 30 times of the connect time to chrome.google.com. When we’re having multiple parallel connections, the connect takes so much time that reading the response is delayed. This has something to do with http_load’s design, which we’ll discuss next.

3. The Drawbacks of http_load

http_load has the following drawbacks.

3.1. Single thread multiple connections

http_load uses Linux select call to handle multiple connections with a single thread. This design is good when both connect and read response time are short, which is the case for most of HTTP request,  but not true for most of HTTPS case. As shown above, it takes almost half a second to connect to github.

When the connect time is long and there’re parallel connections, it’s likely the events are going to be queued up. In other words, when the response is received by http_load host, http_load cannot read it immediately because it’s busy with establishing other connections or reading other response. Therefore,  the response time obtained is not accurate as it includes the time that a response waited before http_load reads it.

A better design could a thread-pool approach. This could reduce the wait time as multiple responses can be read by multiple threads concurrently.

For current http_load, if we’re doing test with https, it’s better to set -parallel 1 and run multiple instances of http_load. Note that setting rate may not work as it can still introduce parallel connections.

3.2. Read all URLs into memory

http_load loads every URL into memory before it starts to send out HTTP request. This makes it easy to pick up a URL randomly for each request, but a big URL file may cause large amount of memory allocation and a long initialization time. I encountered cases where 10 seconds test takes more than 10 minutes to finish when the input file contains about half a million URLs.

Apache JMeter is a better alternative in terms of HTTPS testing, in my personal point of view.

References:
1. http_load website: http://www.acme.com/software/http_load/

HTTP/HTTPS Load Testing with http_load–Part 1

http_load is simple open source tool sends multiple HTTP requests in parallel. It runs a single thread picking up URLs randomly from an input file. Both HTTP and HTTPS are supported.

0. Build http_load

Build http_load is quite simple.

tar xvf http_load-12mar2006.tar.gz

  • Start terminal, go to the extracted directory, and type make to build it.
  • Start http_load by ./http_load command.

1. Use http_load for HTTP testing

1.1 Controlling Request Rate

http_load supports a few options. We can control the request rate by either -parallel or -rate. parallel indicates the number of concurrent connections. For example, if we specify -parallel 5, then http_load will try to establish and maintain (create new connection once a connection is closed) 5 connections, and send out one request through each connection.

Rate controls the request rate in a different manner. It refers to number of requests sent out per second. If we specify -rate 5, then http_load will make 5 requests per second. It will determine how many connections to open internally. We can specify either parallel or rate, but not both.

1.2 Controlling Test Time

Another important parameter is seconds, it allows us to specify how long the test should run. Or we can specify fetches, which indicates the number of requests to send for the entire test.

1.3 Specify Test URLs

The last parameter is always the url_file, a file which contains a list of URLs for test. http_load will read all URLs at the beginning and randomly pick up one for each request. There’re a few other parameters, one can refer to http_load help message for more info.

1.4 An Example

As an example, suppose we have a URL file named urls.txt contains the following URLs.

http://en.wikipedia.org/wiki/Hello_world_program
http://en.wikipedia.org/wiki/Computer_program
http://en.wikipedia.org/wiki/Computer_multitasking
http://en.wikipedia.org/wiki/Computer_process
http://en.wikipedia.org/wiki/Context_switch
http://en.wikipedia.org/wiki/Central_processing_unit
http://en.wikipedia.org/wiki/Integrated_circuit
http://en.wikipedia.org/wiki/Microchip_(disambiguation)
http://en.wikipedia.org/wiki/Microchip_implant_(animal)
http://en.wikipedia.org/wiki/Penguins

Then if we want to simulate 5 users sending requests for 10 seconds, we can specify the command as below,

./http_load -parallel 5 -seconds 10 urls.txt

http_load will output the results once the test is completed.

From the results, we can read the min, max and average response time, and throughput (No. of fetches per second) etc.

2. Use http_load for HTTPS Testing

2.1 Compile http_load with HTTPS support

By default, HTTPS support is disabled. We can enable the HTTPS by uncomment the lines below in the Makefile.

#SSL_TREE =     /usr/local/ssl
#SSL_DEFS =     -DUSE_SSL
#SSL_INC =      -I$(SSL_TREE)/include
#SSL_LIBS =     -L$(SSL_TREE)/lib -lssl –lcrypto

Once enabled the above lines, simply type the following commands in terminal to build it.

make clean
make

2.2 An Example

HTTPS test works the same as HTTP test. Suppose we have the URLs below in a file named surls.txt.

https://github.com/

And another file named surls2.txt.

https://chrome.google.com/webstore/category/home
https://chrome.google.com/webstore/category/popular
https://chrome.google.com/webstore/category/from_your_circles
https://chrome.google.com/webstore/category/trending
https://chrome.google.com/webstore/category/collections
https://chrome.google.com/webstore/category/collection/perfect_date
https://chrome.google.com/webstore/category/collection/football
https://chrome.google.com/webstore/detail/gocaejggjgdmkhmbinicknpbhagkblop
https://chrome.google.com/webstore/category/collection/editors_picks
https://chrome.google.com/webstore/category/app/3-games

We first test with surls.txt, and below is the result.

Then surls2.txt, and below is the result.

The result for surls.txt seems wired. As we’re increasing the number of parallel connections, the response time increases too. For website like github.com, it’s unlikely to be the server side issue. Could it be the client side issue? NO. Because test with surls2.txt shows the client can take care of 30 parallel connections just fine. So WHY? We investigate this issue in next post.

Shall We Use the Input Parameters as Loop Variable?

Several days ago, I’ve got a discussion with another developer. We talked about whether we should use the input parameters as loop variables in a C program.

To illustrate the problem, suppose we have two functions as below.

void testlocalb(int *a, int lcnt) {

    int i;

    for (i = 0; i < lcnt; ++i) {

    }

    *a = i;

}

 

void testlocal(int *a, int n) {

    for (*a = 0; *a < n; ++(*a)) {

    }

}

The first function use a local variable as the loop counter, while the second one use the input parameter as the loop counter. The questions is: which one is more efficient?

In terms of space complexity, the second function doesn’t use local variable, so it saves some memory space at the stack. But how about execution time?

1. The local variable will be put into registers, so the first function is more efficient???

I wrote three different programs to verify this is true or not. The code is as below,

#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

 

void testlocala(int *a, int n) {

    for (*a = 0; *a < n; ++(*a)) {

         //dummy: do nothing

    }

}

 

int main(int argc, char **argv) {

    int n = atoi(argv[1]);

    int rv;

    struct timeval stTime, edTime;

    gettimeofday(&stTime, NULL);

    testlocala(&rv, n);

    gettimeofday(&edTime, NULL);

    printf("%dn", rv);    

    printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

    return 0;

}

The program above use the input parameter as the loop counter.

#include <stdio.h>

#include <stdlib.h>

 

void testlocalb(int *a, int lcnt) {

    int i;

    for (i = 0; i < lcnt; ++i) {

    }

    *a = i;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n);

   gettimeofday(&edTime, NULL);

   printf("%dn", rv);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

The second program listed above use a local variable as the loop counter.

#include <stdio.h>

#include <stdlib.h>

 

void testlocalb(int *a, int lcnt) {

    register int i;

    for (i = 0; i < lcnt; ++i) {

    }

    *a = i;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n);

   gettimeofday(&edTime, NULL);

   printf("%dn", rv);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

The third program also uses the local variable, but declares it as register.

To check whether the loop counter is put into register, we compile the code into assembly code using the command below,

gcc -S local1.c

The loop essentially contains a read and a write access to the loop counter. The read happens at the comparison and the write happens at the loop counter increment. We show the assembly code for the three functions above as following.

Firstly, the loop counter increment, which is the write access.

First function: using input parameter as loop counter

movl    8(%ebp), %eax                           #1

movl    (%eax), %eax                             #2

leal    1(%eax), %edx                           #3

movl    8(%ebp), %eax                           #4

movl    %edx, (%eax)                             #5

Second function: using local variable

addl $1, -4(%ebp)

Third function: using local variable with register declaration

addl $1, %ebx

From the assembly code, we can tell that the local variable without register declaration is not put into register, while the third function with register declaration is put into register.

The first function is a bit complicated, but the ++(*a) are actually two operations, including * (the first two lines) and ++(line 3 to 5). The increment is actually performed at register by the third line and then write back to memory at 5th line.

Secondly, the comparison, which is the read access.

First function: using input parameter as loop counter

movl 8(%ebp), %eax
movl (%eax), %eax
cmpl 12(%ebp), %eax

Second function: using local variable

movl -4(%ebp), %eax
cmpl 12(%ebp), %eax

Third function: using local variable with register declaration

cmpl 12(%ebp), %ebx

Again, the first function requires two operations because we’re accessing it using a pointer, and the value is loaded from memory to register eax. The second function also loads the value from stack memory to register eax. The third function just do the comparison as the loop counter is already in register.

Which function is fast? It’s not difficult to tell that function 3 will be the fastest as it’s a single instruction performed at register for both read and write. How about the first one and second one?

Compile the code using the command below,

gcc -o local1 local1.c
gcc -o local2 local2.c
gcc -o local3 local3.c

Below is a sample execution result,

Figure 1. Running 3 testing programs with default compiler options

The third function runs fastest, as expected, follows by the first function and the second function goes last.

One can add code blocks inside the loop as below to further verify the performance difference is due to read or write.

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

d = *a;

 

++(*a);

--(*a);

++(*a);

--(*a);

++(*a);

--(*a);

++(*a);

--(*a);

++(*a);

--(*a);

On my machine, the result is that input parameter (function 1) performs better for both read and write.

To conclude the first doubt, the local variable is not put into register under the default compiler options. Declare the local variable as “register” gives hint to the compiler that the variable should be put into register and it helps sometimes. Without “register”, input parameter performs better than local variable for both read and write accesses.

2. What if the compiler optimization options are used?

With the optimization options enabled, the loop in the code above will be taken away. I changed the code to below to avoid removal of loop.

First function: using input parameter as loop counter

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

 

void testlocala(int *a, int n, int *b) {

    unsigned int x = 1, y = 1;

    for (*a = 0; *a < n; ++(*a)) {

        x += y;

        y = x - y;

        if (y > INT_MAX) {

            x = 1;

            y = 1;

        }

    }

    *b = y;

}

 

int main(int argc, char **argv) {

    int n = atoi(argv[1]);

    int rv, rv1;

    struct timeval stTime, edTime;

    gettimeofday(&stTime, NULL);

    testlocala(&rv, n, &rv1);

    gettimeofday(&edTime, NULL);

    printf("%d:%dn", rv, rv1);    

    printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

    return 0;

}

Second function: using local variable

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

 

void testlocalb(int *a, int lcnt, int *b) {

    int i;

    unsigned int x = 1, y = 1;

    for (i = 0; i < lcnt; ++i) {

        x += y;

        y = x - y;

        if (y > INT_MAX) {

            x = 1; 

            y = 1;

        }

    }

    *a = i;

    *b = y;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv, rv1;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n, &rv1);

   gettimeofday(&edTime, NULL);

   printf("%d:%dn", rv, rv1);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

Third function: using local variable with register declaration

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

 

void testlocalb(int *a, int lcnt, int *b) {

    register int i;

    unsigned int x = 1, y = 1;

    for (i = 0; i < lcnt; ++i) {

        x += y;

        y = x - y;

        if (y > INT_MAX) {

            x = 1;

            y = 1;

        }

    }

    *a = i;

    *b = y;

}

 

int main(int argc, char **argv) {

   int n = atoi(argv[1]);

   int rv, rv1;

   struct timeval stTime, edTime;

   gettimeofday(&stTime, NULL);

   testlocalb(&rv, n, &rv1);

   gettimeofday(&edTime, NULL);

   printf("%d:%dn", rv, rv1);

   printf("%u:%un", (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

We compile the 3 functions into assembly using the commands below,

gcc -O1 -S local11.c
gcc -O1 -S local22.c
gcc -O1 -S local33.c

The optimized assembly code for increment of loop counter and loop condition test for all three versions are the same,

addl $1, %ecx
cmpl %ebx, %ecx

The code for loading the loop counter values to register ecx are not listed for simplicity. The execution time for the 3 programs are as below,

function 1: input parameter,  3:172661
function 2: local variable, 4:4294877639
function 3: local variable with register declaration, 4:4294786759

We can also compile the code using O2 and O3 optimization options. The execution time are as below,

For O2,

function 1: input parameter,  3:669508
function 2: local variable, 3:572726
function 3: local variable with register declaration, 3:641155

For O3,

function 1: input parameter,  4:4294628948
function 2: local variable, 3:612397
function 3: local variable with register declaration, 3:688610

For O3 + funroll-loops, (gcc -O3 -funroll-loops local11.c -o local11)

function 1: input parameter,  4:4294060070
function 2: local variable, 3:69106
function 3: local variable with register declaration, 3:62729

In conclusion, the proper compiler flags can optimize the code, and the usage of input parameter and local variable can achieve similar run time.

Note that the experiment are carried once only. For more accurate results, one should execute the program many times and take the average.

References:
1. http://www.network-theory.co.uk/docs/gccintro/gccintro_50.html
2. http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Simple Profiling of MD5

MD5 (Message-Digest) is an algorithm commonly used as cryptographic hash function. It takes an arbitrary length of data and produces a 128-bit (16 bytes) hash value. MD5 is not recommended for secure applications any more because attack messages can be constructed to produce collision. However, without intentional attack, MD5 is extremely unlikely to produce a collision and the computation is fast (later part will provide test code). Therefore it is still useful as a hash function in situations where security is not a concern and collision is not desired.

Reference 2 provides links to lots of implementations of MD5 in various programming languages. This post takes the code from L. Peter Deutsch implementation added a test function to test how fast MD5 computes.

The test uses a a list of url strings stored in a file. It will read the string from file and compute the md5 for each URL. The test function is as below,

static int do_n_test(void) {

    FILE *testFile;

    int UNIQUE_URLS = 1000;

    unsigned char url[UNIQUE_URLS][4096];

    int numOfUrlsTested = 0;

    int i = 0, j, READ_FILE_TIMES = 1;   

    struct timeval stTime, edTime;

 

    testFile = fopen("./urllist.txt", "r");

    if (testFile == NULL) {

        printf("cannot open test filen");

        return;

    }

    for (; i < UNIQUE_URLS; ++i) {

        fgets(url[i], 4096, testFile);

    }

    fclose(testFile);

    gettimeofday(&stTime, NULL);

    for (i = 0; i < READ_FILE_TIMES; ++i) {

        for (j = 0; j < UNIQUE_URLS; ++j) {

            md5_state_t state;

            md5_byte_t digest[16];

    

            char hex_output[16*2 + 1];

            int di;

            md5_init(&state);

            md5_append(&state, (const md5_byte_t *)url[j], strlen(url[j]));

            md5_finish(&state, digest);

            /*for (di = 0; di < 16; ++di)

                sprintf(hex_output + di * 2, "%02x", digest[di]);

            printf("%d: %sn", numOfUrlsTested, hex_output);*/

            ++numOfUrlsTested;

        }

    }

    gettimeofday(&edTime, NULL);

    printf("%d: %u:%un", numOfUrlsTested, (unsigned int)(edTime.tv_sec - stTime.tv_sec), (unsigned int)(edTime.tv_usec - stTime.tv_usec));

}

The test file contains more than 3000 urls, but loading all URLs from the file may exceed the stack memory you can use. You can set the “UNIQUE_URLS” variable to adjust the number of unique URLs to load.  You can also adjust the “READ_FILE_TIMES” variable to set number of MD5 computations to run. The total number of MD5 computations will be “UNIQUE_URLS”x“READ_FILE_TIMES”.

To compile the code, use the command,

gcc -o test md5.c md5main.c –lm

To run test,

./test –test

On my Ubuntu machine with Intel i5 Quad Core, 1197 MHz, 4GB Memory, one MD5 computation takes about 4 macro seconds.

You can download the entire code from here. Note that the main MD5 computation code is from reference 2 by L. Peter Deutsch.

References:
1. MD5 Wikipedia Page: http://en.wikipedia.org/wiki/Md5
2. MD5 Homepage (unofficial): http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
3. Hash functions: An empirical comparison: http://www.strchr.com/hash_functions

Using mmap for Random Access of Files

mmap allows a program to map a section of the file into the program’s memory space, and access it with a pointer. This post illustrates mmap for random access of files.

0. Description

The mmap system call is defined in sys/mman.h as below,

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);

 

The input parameters have the following meanings,

  • addr: the address we want to map to. Usually set to zero to let the OS decide the address for us.
  • len: the length of data we want to map.
  • prot: the access rights the process want to have to the mapped memory. The values can be PROT_READ, PROT_WRITE and PROT_EXEC, which corresponds to read, write and execute permissions respectively. Note that one can specify multiple permissions by ORing the values.
  • flags: Set to MAP_SHARED if you want to share the file with other processes. Set to MAP_PRIVATE will get your process a copy of the mapped region, and your changes won’t be reflected in the original file. There’re other flags, you can refer to man page.
  • fildes: the file descriptor of the file you want to map. It can be obtained using the open system call as below. Note that the access mode should be the same as prot flags you set in mmap call.

    int fd = open(“test”, O_RDWR);

  • off: the offset of the file to start mmap. Note that this must be a multiple of virtual memory page size, which can be get by a call to getpagesize() or sysconf(_SC_PAGE_SIZE)

On success, mmap returns a pointer to the beginning of the mapped data region. Otherwise, it returns a value of MAP_FAILED and sets errno.

After mmap the file, we got a pointer to the mapped data. We can use the pointer to jump to a specific byte to access the data. The example given in part 1 illustrates this by reading a specific byte you specify.

After we’re done with mmap, we can unmap by mummap call,

int munmap(void *addr, size_t len);

 

  • addr: the address of unmap, which should be the address returned by mmap call.
  • len: the length to unmap, which should be the same as the len parameter passed in mmap call.

Note that a mapped file is unmapped automatically upon process termination.

1. Example

Below is an example illustrating mmap,

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main(int argc, char *argv[])
{
    int fd, offset;
    unsigned char *data;
    struct stat sbuf;
    FILE *f;
    int i;
    unsigned char writeBytes[10];

    if (argc != 2) {
        fprintf(stderr, "usage: mmapdemo offsetn");
        exit(1);
    }

    /*create the test file*/
    for (i = 0; i < 10; ++i) {
    writeBytes[i] = (unsigned char)i;
    }
    f = fopen("./test", "w");
    for (i = 0; i < 10; ++i) {
        fwrite(writeBytes, 1, 10, f);
    }
    fclose(f);

    if ((fd = open("./test", O_RDONLY)) == -1) {
        perror("open");
        exit(1);
    }
    if (stat("./test", &sbuf) == -1) {
        perror("stat");
        exit(1);
    }
    printf("file size: %ldn", sbuf.st_size);
    offset = atoi(argv[1]);
    if (offset < 0 || offset > sbuf.st_size-1) {
        fprintf(stderr, "mmapdemo: offset must be in the range 0-%ldn", (sbuf.st_size-1));
        exit(1);
    }

    data = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)    ;
    if (data == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    printf("byte at offset %d is %dn", offset, data[offset]);
    munmap(data, sbuf.st_size);

    return 0;
}

The code creates a test file first. It opens the file to get a file descriptor. stat is called to get the file size. mmap is then applied to map the entire file into memory. After that, we can access a specific byte using the pointer returned by printing its value. Note that we can define the data pointer as pointer to some other data types. If we define as int, then we can access the data as ints.

Copy, paste and save the file to mmaptest.c, and then compile the code using,

gcc -o mmaptest mmaptest.c

Below are some sample execution,

./mmaptest 99

file size: 100

byte at offset 99 is 9

./mmaptest 2

file size: 100

byte at offset 2 is 2

./mmaptest 50

file size: 100

byte at offset 50 is 0

mmap can perform better than fseek, fread in certain cases. mmap can also be used to share file between different processes, which can reduce memory access significantly as it avoids loading the same file for each process.

Update 1: mmap can perform faster when dealing with big files. The normal read/write uses system call and frequent reads and writes cause lots of context switch between kernel space and user space. These context switches can be a significant overhead.

Update 2: mmap doesn’t load the entire file into memory. It adopts lazy access approach — when the data is needed, it is loaded into memory. This leads to another optimization for mmap based file accessing — pre-loading the content before the content is actually needed.