HTTP Session Management

HTTP is a stateless protocol, which means the server doesn’t store state of each client. This has made HTTP more scalable. However, there’re cases where storing the states of client is desired. For example, in a online shop, we add a few items to the shopping chart, and then click “check out” to go to payment page. Without storing the client states, all your added items will disappear when you browse to payment page.

Three solutions are commonly used to solve the problem, including cookies, request parameters and session management.

Cookies

Cookie is a header field in the HTTP request that is sent to the HTTP server. Below is an example of a HTTP header with Cookies,

GET / HTTP/1.1
Host: www.google.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie:PREF=ID=8e70839135a05b7d:TM=1322186390:LM=1322186390:S=x7G8VLPvu7T36ky3

Upon receiving the request, HTTP server can read the Cookie to get the state information it wants. Cookies can be modified both on client and server. It is usually stored as text files on client’s machine. Some browsers (e.g. Chrome) also use SQLite to store it.

Because cookies are simple text files stored at client side, it is possible to view them, modify them and send it to the server. Therefore, it pose a security concern. In addition, if there’s lots of information stored in the cookie, maintaining them increases the bandwidth and affects the server performance.

Request Parameters

There’re two methods generally used to append information to request parameters, adding hidden fields or rewriting URL. If hidden fields are used, the URL shown on the browser is not changed. But users can still see the hidden fields by looking at the HTTP source file. If this approach is used to create a web application, then same hidden fields are used across multiple pages. This makes hidden fields difficult to maintain.

Rewriting URL appends additional information to the URL. The HTTP server can retrieve the state information from the URL. If there’s lots of information, this approach can consume lots of bandwidth and affect server performance.

HTTP Session Management

For HTTP Session Management, the server stores the state information for each client. Each session is associated with a unique session id. The server can store the information in whatever way it wants, in memory, in files, or even in database.

The first a client access the server, the server allocates a session ID for the client and creates a new session. For every subsequent request, the client will send the session ID along with the request. The server use the session id to locate all state information related to the session.

Note that the state information is stored at the client side. However, the client needs to remember the session id. It uses the two methods mentioned above (Cookies and Request Parameters) to store and send the session id.

TCP TIME_WAIT State and Address Already in Use Error

0. The Problem
Recently I am working on a project consists of TCP socket programming on Linux. I encountered errno 98 (address already in use) and 99 (cannot assign requested address) frequently. I wrote a small test program to reproduce the issue. The test code is as below,

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <errno.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <sys/select.h>

#include <sys/time.h>

#include <fcntl.h>

#include <netinet/in.h>

#include <netdb.h> 

#include <arpa/inet.h>

#include <pthread.h>

#include <signal.h>

#include <sys/time.h>

 

unsigned short PROXY_SERVER_PORT = 0x5B77;        

#define PROXY_CLIENT_ST_PORT 0x6B77                

unsigned int bindError = 0;

 

int main(void) {

    int socketFd;

    int one = 1;

    int mapIdx = 0;

    int i, j; 

    int conTry = 0;

    struct sockaddr_in localAddr, serv_addr;

    struct addrinfo hints, *res;

    int err;

    //char *hostname = "www.google.com";

    char *hostname = "74.125.235.18";

    memset(&hints, 0, sizeof(hints));

    hints.ai_socktype = SOCK_STREAM;

    hints.ai_family = AF_INET;

 

    if ((err = getaddrinfo(hostname, NULL, &hints, &res)) != 0) {

        printf("error %dn", err);

        return 1;

    }

 

    bzero((char *) &serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;

    serv_addr.sin_addr.s_addr = ((struct sockaddr_in *)(res->ai_addr))->sin_addr.s_addr;    

    serv_addr.sin_port = htons(80);

 

    printf("ip address : %sn", inet_ntoa(serv_addr.sin_addr));

    freeaddrinfo(res);

 

    printf("before loop...n");

    for (i = 0; i < 3; ++i) {

        socketFd = socket(AF_INET, SOCK_STREAM, 0);

        //if the line below is enabled, error occurs at connect

        setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);

        if (socketFd < 0)  {

            printf("ERROR opening proxy tcp client socket %d: %d", mapIdx, errno);

            exit(1);

        }

        memset(&localAddr, 0, sizeof(struct sockaddr_in));

        localAddr.sin_family = AF_INET;

        localAddr.sin_addr.s_addr = INADDR_ANY;

        localAddr.sin_port = htons((PROXY_CLIENT_ST_PORT + mapIdx + bindError)%65535);

        printf("binding...n");

        for (j = 1; ;++j) {

            if (bind(socketFd, (struct sockaddr *)&localAddr, sizeof(struct sockaddr_in)) < 0) {

                printf("Error %d, tcp proxy client binding to %dn", errno, ntohs(localAddr.sin_port));

                perror("error binding tcp proxy client: ");

                ++bindError;

                localAddr.sin_port = htons((PROXY_CLIENT_ST_PORT + mapIdx + bindError)%65535);

            } else {    

                break;

            }

        }

        printf("bound socket %d to tcp client port %un", socketFd, ntohs(localAddr.sin_port));

        printf("connecting....n");

        if (connect(socketFd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0) {

            printf("error code: %dn", errno);

            perror("error connecting to ori server from proxy tcp client: ");

            if (errno == EADDRNOTAVAIL) {

                printf("The specified address %d is not available from the local machine.n", ntohs(localAddr.sin_port));

                exit(1);

            }

        }

        printf("connected: %dn", mapIdx);

        close(socketFd);

        mapIdx++;

    }

    return 0;

}

Save the code to test.c, then you can compile it using “gcc -o test test.c”. Running the program multiple times will likely to give you one error, enable/disable line “setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);“ will give you the other error.

When the error occurs, run netstat command “netstat | grep 27511” (from the program output, I know the error occurs at port 27511). Below is the screenshot,

Figure 1. Cannot Assign Request Address

As shown in figure 1, the tcp port is in TIME_WAIT state.

1. The TCP State Transition at Closure

To close a established TCP connection, both endpoints send FIN packets to indicate there’s no more data. Upon receiving the other party’s FIN packet, both endpoints need to ACK it.

The FIN packets are sent when a program calls exit(), close() or shutdown(). The ACKs are handled by the kernel after close() is completed. Therefore, it is possible that the program finishes before the kernel releases the associated network resource. And another process won’t be able to use it until kernel has freed it.

Below is a figure of detailed state transitions for an endpoint when TCP connection closes. It follows different paths depending on which side initiated the closure.

Figure 2. TCP State Transition at Closure (diagram from reference 4)

Note that TIME_WAIT only occurs at the endpoint which initiated the closure.

2. Why TIME_WAIT

After the TCP connection is closed, there might still be live packets in the network. If a new connection is established with the exact same (client IP, client port, server IP, server port) tuple, the packets from the previous connection will be treated for the new connection.

To avoid this, TIME_WAIT time is generally set to twice the packets maximum age. The value is long enough that the packets for the old connection will be dead after the time expires. Note that setting TIME_WAIT at one endpoint would be enough to make sure no two exactly same (client IP, client port, server IP, server port) tuples appear.

3. How to Avoid the Problem

TIME_WAIT only occurs at the side which initiates the TCP connection closure, so a natural solution would be avoid calling close(). If you have control over both client and server, you may want to let the client close first, so the server won’t ends of lots of TIME_WAIT ports.

As indicated in the testing program, setsockopt() with SO_REUSEADDR allows you to bind the a socket to a port which in TIME_WAIT. If you use the socket as a client side, and try connecting to the same (server address, server port) tuple, you’ll fail at connect stage. However, connecting to other (server address, server port) is allowed.

If you use the socket as a server socket, you can also use SO_REUSEADDR. I’ve not tested if the same (client address, client port) tries to connect, what will happen. But I guess the connection request will be denied.

It’s also possible to modify the TIME_WAIT values on some operating systems.

References:

1. Setting TIME_WAIT TCP, stackoverflow: http://stackoverflow.com/questions/337115/setting-time-wait-tcp

2. TIME_WAIT and its design implications for protocols and scalable client server systems: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

3. The TIME_WAIT state in TCP and Its Effect on Busy Servers: http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/

4. Bind: Address Already in Use or How to Avoid this Error when Closing TCPConnections

How to Associate Process with Port Number using lsof — in Linux and Android

Sometimes it’s good to know port numbers used by certain processes. For developers, we can use this information to do application based packet filtering, network statistics detailed for each app, etc. For admins, we can use it to diagnose the network, detect malware etc.

Unfortunately, there’s no direct API or tools that provided by Linux can do this. But Linux does provide a tool called lsof (list open files), which lists out file information opened by processes. And in Linux’s philosophy, everything is a file. Network socket are no exception to this rule.

Yes, we can use lsof to associate process with port number.

For example, if I what to know which port my Chrome browse is using. First I use ps -ax to get the process id (let’s say, it’s 6903). Then I use the following command to get the UDP port opened by it,

lsof -p 6903 | grep UDP

Below is a screenshot shown the results,

The command gets the files open by process 6903, and then filtered results with “UDP”, so only the files open by process 6903 contains “UDP” will be displayed.

For TCP traffic,

lsof -p 6903 | grep TCP

And the screenshot,
As shown in figures, both the source and destination port numbers are displayed.

Lsof on Android

My phone is rooted and installed with BusyBox, so I don’t know whether lsof comes with Android by default or installed by BusyBox. But anyway, the one on my phone doesn’t work. Below is a screenshot,

Lots of things cannot be displayed properly and are replaced by question mark. Fortunately I found a workable binary from reference 2. You can also download the binary here.
Use adb command to put the binary on your phone and change it with executable permission. A sample execution gives the screenshot below,

Note that the “2>/dev/null” redirection is to discard some error messages printed by lsof.

References:

1. Inside Geinimi Android Trojan. Chapter Two: How to check remotely the presence of the trojan. http://www.mobile-malware.com/2011/03/inside-geinimi-android-trojan-chapter_10.html
2. Linux man page for lsof: http://linux.die.net/man/8/lsof

Using telnet to Help Debugging TCP-based Application

Telnet can refer to either the Telnet network protocol used to provide a bidirectional interactive text-oriented communication facility or the telnet program that implements the client side of the Telnet protocol. This post focuses on the telnet program.

Telnet doesn’t encrypt data sent (including password) over the connection, so it’s not secure. SSH has been widely used to replace telnet for remote login/control.

Telnet data are transmitted over TCP. Except some special characters, telnet can be used to establish a raw TCP connection. Because of this, telnet can be used to debug TCP-based applications, such as HTTP server, FTP server, etc. You can also use it to test TCP servers developed by yourself.

0. An Example of Using Telnet to Debug HTTP

We can issue HTTP request to a HTTP server and examine the response. Below is an example of using HTTP to connect to google web server.

telnet www.google.com 80
Trying 74.125.235.51…
Connected to www.l.google.com.
Escape character is ‘^]’.
GET /index.html HTTP/1.1

HTTP/1.1 302 Found
Location: http://www.google.com.sg/index.html
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=0cf4ad941da7097d:FF=0:TM=1323450677:LM=1323450677:S=4CGr4KBGwGHxlgql; expires=Sun, 08-Dec-2013 17:11:17 GMT; path=/; domain=.google.com
Date: Fri, 09 Dec 2011 17:11:17 GMT
Server: gws
Content-Length: 232
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
<HTML><HEAD><meta http-equiv=”content-type” content=”text/html;charset=utf-8″>
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF=”http://www.google.com.sg/index.html”>here</A>.
</BODY></HTML>

1. An Example of Using Telnet to Debug a Simple TCP Server

If we build a TCP server application, we can use telnet to test it. We use the TCP server program here for illustration. Below is the execution of running the server on port 12345 and using telnet to connect to it.

At server:

./server 12345
connection accepted
received: hello world
sent: hello world
received: hello roman10
sent: hello roman10
received: telnet sample
sent: telnet sample
received: for simple tcp server
sent: for simple tcp server
received: we’re going to finish
sent: we’re going to finish

At another terminal running telnet:

telnet 127.0.0.1 12345
Trying 127.0.0.1…
Connected to 127.0.0.1.
Escape character is ‘^]’.
hello world
hello world
hello roman10
hello roman10
telnet sample
telnet sample
for simple tcp server
for simple tcp server
we’re going to finish
we’re going to finish
Connection closed by foreign host.

Alternatives to Telnet

Telnet can be used as raw TCP client at many cases, but there’re some special characters telnet cannot handle like raw TCP client. Fortunately, there’re true raw TCP clients can be used as alternatives to telnet, including netcat and socat on Linux/UNIX, and PuTTY on Windows. We can use similar method to debug TCP-based applications using those programs.

Reference:
Telnet on Wikipedia: http://en.wikipedia.org/wiki/Telnet

Simple TCP Socket Client and Server Communication in C Under Linux

This post doesn’t provide details about how Linux socket works, its design etc. It mainly for providing source code of simple TCP socket client and server in C. I’m writing this because I found myself need simple TCP client and server for testing from time to time. 🙂
1. The TCP Server Code

You can refer to TCP Server code below (save it as tcpserver.c),

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <unistd.h>

#include <fcntl.h>

#include <sys/types.h> 

#include <sys/socket.h>

#include <netinet/in.h>

 

void nonblock(int sockfd)

{

    int opts;

    opts = fcntl(sockfd, F_GETFL);

    if(opts < 0)

    {

        fprintf(stderr, "fcntl(F_GETFL) failedn");

    }

    opts = (opts | O_NONBLOCK);

    if(fcntl(sockfd, F_SETFL, opts) < 0) 

    {

        fprintf(stderr, "fcntl(F_SETFL) failedn");

    }

}

 

int main(int argc, char *argv[])

{

     int BUFLEN = 2000;

     int sockfd, newsockfd, portno;

     socklen_t clilen;

     char buffer[BUFLEN];

     struct sockaddr_in serv_addr, cli_addr;

     int n, i;

     int one = 1;

 

     if (argc < 2) {

         fprintf(stderr,"please specify a port numbern");

         exit(1);

     }

     sockfd = socket(AF_INET, SOCK_STREAM, 0);

     if (sockfd < 0) {

        perror("ERROR create socket");

        exit(1);

     }

     setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);    //allow reuse of port

     //bind to a local address

     bzero((char *) &serv_addr, sizeof(serv_addr));

     portno = atoi(argv[1]);

     serv_addr.sin_family = AF_INET;

     serv_addr.sin_addr.s_addr = INADDR_ANY;

     serv_addr.sin_port = htons(portno);

     if (bind(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {

        perror("ERROR on bind");

        exit(1);

     }

     //listen marks the socket as passive socket listening to incoming connections, 

     //it allows max 5 backlog connections: backlog connections are pending in queue

     //if pending connections are more than 5, later request may be ignored

     listen(sockfd,5);

     //accept incoming connections

     clilen = sizeof(cli_addr);

     newsockfd = accept(sockfd, (struct sockaddr *) &cli_addr, &clilen);

     //nonblock(newsockfd);        //if we want to set the socket as nonblock, we can uncomment this

     if (newsockfd < 0) {

        perror("ERROR on accept");

        exit(1);

     }

     printf("connection acceptedn");

     for (i = 0; i < 5; ++i) {

         bzero(buffer,BUFLEN);

         n = read(newsockfd,buffer,BUFLEN);

         if (n < 0) {

            perror("ERROR read from socket");

         }

         printf("received: %s",buffer); 

         n = write(newsockfd, buffer, n);

         printf("sent: %s", buffer);

         if (n < 0) {

            perror("ERROR write to socket");

         }

     }

     close(newsockfd);

     close(sockfd);

     return 0; 

}

The code first create a socket of SOCK_STREAM type in AF_INET domain. SOCK_STREAM corresponds to TCP, and AF_INET refers to IPv4.

It calls setsockopt to make the socket resuable. For example, your server open up a socket at a port number, and then it exits. Without this line of code, sometimes you may not bind the socket to same port.

The program then binds the socket to a local IP address with a specific port number. listen(sockfd, 5) will set the socket as passive socket listening to incoming connections. 5 means the maximum number of backlog is 5. Backlog connections are pending connect request in queue. If pending request are more than 5, later request may be ignored.

The code then calls accept method to accept incoming connections. This is a block call, and it returns when a new connect request is received, or an error occurred.

In the for loop, the tcp server listens to incoming packets and echo the message back. Note that the nonblock(newsockfd) has been commented out. You can uncomment it to enable nonblock read and write, but you may want to modify the read and write part to make it work properly.

2. The TCP Client Code
The TCP client code is as below (save it as tcpclient.c),

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <fcntl.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <netinet/in.h>

#include <netdb.h> 

 

void nonblock(int sockfd)

{

    int opts;

    opts = fcntl(sockfd, F_GETFL);

    if(opts < 0)

    {

        fprintf(stderr, "fcntl(F_GETFL) failedn");

    }

    opts = (opts | O_NONBLOCK);

    if(fcntl(sockfd, F_SETFL, opts) < 0) 

    {

        fprintf(stderr, "fcntl(F_SETFL) failedn");

    }

}

 

int main(int argc, char *argv[])

{

    int BUFLEN = 2000;

    int sockfd, portno, n;

    struct sockaddr_in serv_addr;

    struct hostent *server;

    int i;

 

    char buffer[BUFLEN];

    if (argc < 3) {

       fprintf(stderr,"usage: %s hostname_or_ip portn", argv[0]);

       exit(0);

    }

    portno = atoi(argv[2]);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);

    if (sockfd < 0) {

        perror("ERROR creating socket");

        exit(1);

    }

    //get the address info by either host name or IP address

    server = gethostbyname(argv[1]);

    if (server == NULL) {

        fprintf(stderr,"ERROR, no such hostn");

        exit(1);

    }

    bzero((char *) &serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;

    bcopy((char *)server->h_addr, (char *)&serv_addr.sin_addr.s_addr, server->h_length);

    serv_addr.sin_port = htons(portno);

    if (connect(sockfd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0)  {

        perror("ERROR connecting");

        exit(1);

    }

    printf("connection establishedn");

    //nonblock(sockfd);    //uncomment this line if we want to make the socket non-block

    for (i = 0; i < 5; ++i) {

        printf("Please enter the message: ");

        bzero(buffer,BUFLEN);

        fgets(buffer,BUFLEN,stdin);

        n = write(sockfd,buffer,strlen(buffer));

        printf("sent: %s", buffer);

        if (n < 0) {

             perror("ERROR writing to socket");

        }

        bzero(buffer,BUFLEN);

        n = read(sockfd,buffer,BUFLEN);

        if (n < 0) {

             perror("ERROR reading from socket");

        }

        printf("received: %s",buffer);

    }

    close(sockfd);

    return 0;

}

The code creates a TCP socket. It then tries to connect to a server address with specified IP and port number. The connect is a block call, it returns when the connection is established, which means it receives the ACK from tcp server, which is third packet of TCP 3-way handshake (SYN, ACK+SYN, ACK).

In the loop the tcp client sends and receives messages. Also the nonblock code has been disabled.

3. Compile and Run
To compile the code, use the commands below,

gcc -o client tcpclient.c
gcc -o server tcpserver.c

A sample run could be,

./server 12345 (at one terminal)
./client localhost 12345 (at another terminal)

How to Configure, Install and Use libnefilter_queue on Linux

According to libnetfilter_queue home page, libnetfilter_queue is a userspace library that allows one to retrieve and manipulate the packets that have been queued by kernel packet filter. It is supposed to replace the old ip_queue/libipq mechanism.
0. Dependencies
libnetfilter_queue requires a kernel that includes nfnetlink_queue subsystem. If you Linux kernel is 2.6.14 or later, the subsystem is normally enabled.

You can confirm this by looking into your kernel configuration file. The configuration file is normally located at your system /boot/ directory, with the name like config-<your kernel version>. Open the file, and look for CONFIG_NETFILTER_NETLINK_QUEUE and CONFIG_NETFILTER_ADVANCED.  Make sure the two lines are not commented out.

In addition, libnetfilter_queue library depends on libnfnetlink. A lower-level library for netfitler related kernel/userspace communication. And since this library depends on nfnetlink kernel subsystem, you’ll need to ensure CONFIG_NETFITLER_NETLINK is not commented out in your kernel configuration file.

In summary, you’ll need to check CONFIG_NETFILTER_ADVANCED, CONFIG_NETFITLER_NETLINK and CONFIG_NETFILTER_NETLINK_QUEUE in you kernel configuration file, and install libnfnetlink and libnetfilter_queue user space libraries.
2. Installation
This is simple. First, you need to install libnfnetlink library. Download the tar file here.
Then go the directory where the file is downloaded, follow the commands below,

tar -xvf libnfnetlink-1.0.0.tar.bz2

cd libnfnetlink-1.0.0/

./configure

make

sudo make install

Next, you need to install libnetfilter_queue library. Download the tar file here.  Then follow the same procedure above. Build and install the library.

After installation, issue sudo ldconfig command to create necessary links and cache to the newly installed libraries.
3. Understand the Sample Code
There’re not many tutorials and examples around, but libnetfilter_queue has provided a simple example and some documentation. You can find the sample code at the utiles/ nfqnl_test.c of the libnetfilter_queue folder you downloaded.

The basic idea of the code is to set up libnetfiter_queue library, and bind the program to a queue. You can refer to documentation here and here to help you understand the code.
To compile the sample code, use the command below,

gcc -Wall -o test nfqnl_test.c -lnfnetlink -lnetfilter_queue

To run the code, use the command below,

sudo ./test

Note that you’ll need to set up a queue in kernel packet filter table in order to see how the program working. Suppose we want to queue all TCP packets sending out from our local machine, you’ll need to enter the command below,

sudo iptables -A OUTPUT -p tcp -j NFQUEUE –queue-num 0

Now you can see the test program is outputing some information about the packet,

…..
hw_protocol=0x0000 hook=3 id=422 outdev=2 payload_len=52

entering callback

pkt received

hw_protocol=0x0000 hook=3 id=423 outdev=2 payload_len=52

entering callback

To stopping running the program, kill test and then issue the command

iptables –flush

4. Additional Notes
libnetfilter_queue can be quite powerful combined with iptables rules. It doesn’t only allow you to receive the packet, but also provide the ability to modify the packet and inject the modified packet back to kernel. With these APIs, you can implement user space NATing, packet sniffing/capturing programs etc.

References:
1. libnetfilter_queue home page: http://www.netfilter.org/projects/libnetfilter_queue/

How to Calculate IP/TCP/UDP Checksum–Part 3 Usage Example and Validation

This is a follow up of the previous post IP/TCP/UDP Checksum Calculation part 1 theory, and part 2 implementation.

This post gives an example using libnetfiler_queue library and the checksum code we implemented in part 2 to illustrate how to use the checksum code and verify that our code is actually computing correctly.

Note this post is not a post for libnetfilter_queue library. So the usage of this library is not covered here. I’ve written separate posts for libnetfilter_queue.

The code (let’s call it test.c) is as below,

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netinet/in.h>
#include <linux/types.h>
#include <linux/netfilter.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <libnetfilter_queue/libnetfilter_queue.h>

#include "checksum.h"

static int cb(struct nfq_q_handle *qh, struct nfgenmsg *nfmsg, struct nfq_data *nfa, void *data) {
    printf("entering callbackn");
    struct nfqnl_msg_packet_hdr *ph;
    int payload_len;
    unsigned char *payloadData;
    struct iphdr *ipHeader;
    struct tcphdr *tcpHeader;
    struct udphdr *udpHeader;
    unsigned short ipCheck, udpCheck, tcpCheck;
    ph = nfq_get_msg_packet_hdr(nfa);
    u_int32_t id = ntohl(ph->packet_id);
    payload_len = nfq_get_payload(nfa, &payloadData);
    printf("ip datagram len = %dn", payload_len);
    ipHeader = (struct iphdr *)payloadData;
    ipCheck = ipHeader->check;
    printf("ip checksum: %04xn", ipHeader->check);
    //calculate ip checksum, and see if the calculation matches
    compute_ip_checksum(ipHeader);
    printf("calculated ip checksum: %04xn", ipHeader->check);
    if (ipCheck != ipHeader->check) {
    printf("-------------ip checksum calculation is wrong-----------n");
    }
    if (ipHeader->protocol == IPPROTO_TCP) {
        tcpHeader = (struct tcphdr *)(payloadData + (ipHeader->ihl<<2));
    tcpCheck = tcpHeader->check;
    printf("tcp checksum: %04xn", tcpHeader->check);
    //calculate tcp checksum, and see if the calculation matches with original tcp checksum
        compute_tcp_checksum(ipHeader, (unsigned short*)tcpHeader);
    printf("calculated tcp checksum: %04xn", tcpHeader->check);
    if (tcpHeader->check != tcpCheck) {
        printf("-----------calculation is wrong-------n");
    }
    } else if (ipHeader->protocol == IPPROTO_UDP) {
    udpHeader = (struct udphdr *)(payloadData + (ipHeader->ihl<<2));
    udpCheck = udpHeader->check;
    printf("udp checksum: %04xn", udpHeader->check);
    //calculate udp checksum, and see if the calculation matches with original udp checksum
    compute_udp_checksum(ipHeader, (unsigned short*)udpHeader);
    printf("calculated udp checksum: %04xn", udpHeader->check);
        if (udpHeader->check != udpCheck) {
        printf("-----------calculation is wrong-------n");
    }
    }
    //issue a verdict on a packet
    //qh: netfilter queue handle; id: ID assigned to packet by netfilter; verdict: verdict to return to netfilter, data_len: number
    //of bytes of data pointed by buf, buf: the buffer that contains the packet data (payload)
// return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);
    return nfq_set_verdict(qh, id, NF_ACCEPT, payload_len, payloadData);
}

int main(int argc, char **argv) {
    struct nfq_handle *h;
    struct nfq_q_handle *qh;
    //struct nfnl_handle *nh;
    int fd;
    int rv;
    char buf[4096] __attribute__((aligned));

    /*netfilter_queue library set up step 1: call nfq_open to open a NFQUEUE handler*/
    printf("opening library handlen");
    h = nfq_open();
    if (!h) {
        fprintf(stderr, "error during nfq_open()n");
        exit(1);
    }
    /*set up step 2: tell the kernel that userspace queuing is handled by NFQUEUE for the selected protocol. This is 
 made by calling nfq_unbind_pf and nfq_bind_pf with protocol info. The idea behind this is to enable simulataneously loaded modules to be used for queuing*/
    printf("unbinding existing nf_queue handler for AF_INET (if any)n");
    if (nfq_unbind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_unbind_pf()n");
        exit(1);
    }
    printf("binding nfnetlink_queue as nf_queue handler for AF_INETn");
    if (nfq_bind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_bind_pf()n");
        exit(1);
    }
    /*after the above two steps, we can set up and use a queue*/
    /*bind the program to a specific queue, cb is callback function to call for each queued packet
 *h: netfilter queue handle, 0: the number of the queue to bind to, cb: callback function for each queued packet, data: custom data to pass to callback function*/
    printf("binding this socket to queue '0'n");
    qh = nfq_create_queue(h, 0, &cb, NULL);
    if (!qh) {
        fprintf(stderr, "error during nfq_create_queue()n");
        exit(1);
    }
    /*set mode: NFQNL_COPY_PACKET: copy entire packet, it defines the part of data that nfqueue copies to userspace
 0xffff: size of the packet that we want to get*/
    printf("setting copy_packet moden");
    if (nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff) < 0) {
        fprintf(stderr, "cannot set packet_copy moden");
        exit(1);
    }
    /*handle the incoming packets*/
    fd = nfq_fd(h);
    while ((rv = recv(fd, buf, sizeof(buf), 0)) && rv >= 0) {
    printf("pkt receivedn");
    nfq_handle_packet(h, buf, rv);
    }
    printf("unbinding from queue 0n");
    nfq_destroy_queue(qh);
    /*the program has finished with libnetfilter_queue, it can call nfq_close to free all associated resources*/
    printf("closing library handlen");
    nfq_close(h);
    exit(0);
}

The code basically use libnetfilter_queue to get the IP datagram of TCP/UDP packets, and retrieve their IP, TCP/UDP checksum. It then use the checksum code we implemented in part 2 to recalculate the checksum. If the calculated the checksum matches with the checksum we retrieved from the original packet, then we’re sure the computation is done correctly.

Compile and Run the Code

To compile the code, you can run the command below,

gcc -Wall -o testsum checksum.c test.c -lnfnetlink -lnetfilter_queue

Note that you’ll need libnfnetlink and libnetfilter_queue to compile the code. Please google for more info.

To run the code, save the following command to run.sh, and then use sudo ./run.sh,

#!/bin/sh
sudo iptables -t mangle –flush
sudo iptables -t mangle -A OUTPUT -p tcp  -j NFQUEUE –queue-num 0
sudo iptables -t mangle -A OUTPUT -p udp -j NFQUEUE –queue-num 0
sudo ./testsum

Note that after you finish running the code and killed the program, you won’t be able to access Internet, you’ll need to run the following command to bring your Internet back,

sudo iptables -t mangle –flush

Note that the above procedure basically set up Linux iptable rules to queue TCP and UDP packets, so the test.c program code can receive them at userspace.

Download

You can download the source code and scripts from here.

How to Calculate IP/TCP/UDP Checksum–Part 2 Implementation

This is a follow up of the previous post, how to calculate IP/TCP/UDP checksum part 1 — theory.

IP Header Checksum Calculation Implementation
To calculate the IP checksum, one can use the code below,

/* set ip checksum of a given ip header*/

void compute_ip_checksum(struct iphdr* iphdrp){

  iphdrp->check = 0;

  iphdrp->check = compute_checksum((unsigned short*)iphdrp, iphdrp->ihl<<2);

}

/* Compute checksum for count bytes starting at addr, using one's complement of one's complement sum*/

static unsigned short compute_checksum(unsigned short *addr, unsigned int count) {

  register unsigned long sum = 0;

  while (count > 1) {

    sum += * addr++;

    count -= 2;

  }

  //if any bytes left, pad the bytes and add

  if(count > 0) {

    sum += ((*addr)&htons(0xFF00));

  }

  //Fold sum to 16 bits: add carrier to result

  while (sum>>16) {

      sum = (sum & 0xffff) + (sum >> 16);

  }

  //one's complement

  sum = ~sum;

  return ((unsigned short)sum);

}

The method compute_ip_checksum initialize the checksum field of IP header to zeros. Then calls a method compute_checksum. The mothod compute_checksum accepts the computation data and computation length as two input parameters. It sum up all 16-bit words, if there’s odd number of bytes, it adds a padding byte. After summing up all words, it folds the sum to 16 bits by adding the carrier to the results. At last, it takes the one’s complement of sum and cast it to 16-bit unsigned short type. You can refer to part 1 for more detailed description of the algorithm.

Note that the data structure iphdr and tcphdr and udphdr are Linux data structures representing IP header, TCP header and UDP header respectively. You may want to Google for more information in order to understand the code.

TCP Header Checksum Calculation Implementation
To calculate the TCP checksum, you can use the code below,

/* set tcp checksum: given IP header and tcp segment */

void compute_tcp_checksum(struct iphdr *pIph, unsigned short *ipPayload) {

    register unsigned long sum = 0;

    unsigned short tcpLen = ntohs(pIph->tot_len) - (pIph->ihl<<2);

    struct tcphdr *tcphdrp = (struct tcphdr*)(ipPayload);

    //add the pseudo header 

    //the source ip

    sum += (pIph->saddr>>16)&0xFFFF;

    sum += (pIph->saddr)&0xFFFF;

    //the dest ip

    sum += (pIph->daddr>>16)&0xFFFF;

    sum += (pIph->daddr)&0xFFFF;

    //protocol and reserved: 6

    sum += htons(IPPROTO_TCP);

    //the length

    sum += htons(tcpLen);

 

    //add the IP payload

    //initialize checksum to 0

    tcphdrp->check = 0;

    while (tcpLen > 1) {

        sum += * ipPayload++;

        tcpLen -= 2;

    }

    //if any bytes left, pad the bytes and add

    if(tcpLen > 0) {

        //printf("+++++++++++padding, %dn", tcpLen);

        sum += ((*ipPayload)&htons(0xFF00));

    }

      //Fold 32-bit sum to 16 bits: add carrier to result

      while (sum>>16) {

          sum = (sum & 0xffff) + (sum >> 16);

      }

      sum = ~sum;

    //set computation result

    tcphdrp->check = (unsigned short)sum;

}

The mothod sums the pseudo TCP header first, then the IP payload, which is the TCP segment. It also pads the last byte if there’re odd number of bytes. For detailed description of the algorithm, please refer to comments in the code and part 1.

UDP Header Checksum Calculation Implementation
To calculate the UDP checksum, one can follow the code below,

/* set tcp checksum: given IP header and UDP datagram */

void compute_udp_checksum(struct iphdr *pIph, unsigned short *ipPayload) {

    register unsigned long sum = 0;

    struct udphdr *udphdrp = (struct udphdr*)(ipPayload);

    unsigned short udpLen = htons(udphdrp->len);

    //printf("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~udp len=%dn", udpLen);

    //add the pseudo header 

    //printf("add pseudo headern");

    //the source ip

    sum += (pIph->saddr>>16)&0xFFFF;

    sum += (pIph->saddr)&0xFFFF;

    //the dest ip

    sum += (pIph->daddr>>16)&0xFFFF;

    sum += (pIph->daddr)&0xFFFF;

    //protocol and reserved: 17

    sum += htons(IPPROTO_UDP);

    //the length

    sum += udphdrp->len;

 

    //add the IP payload

    //printf("add ip payloadn");

    //initialize checksum to 0

    udphdrp->check = 0;

    while (udpLen > 1) {

        sum += * ipPayload++;

        udpLen -= 2;

    }

    //if any bytes left, pad the bytes and add

    if(udpLen > 0) {

        //printf("+++++++++++++++padding: %dn", udpLen);

        sum += ((*ipPayload)&htons(0xFF00));

    }

      //Fold sum to 16 bits: add carrier to result

    //printf("add carriern");

      while (sum>>16) {

          sum = (sum & 0xffff) + (sum >> 16);

      }

    //printf("one's complementn");

      sum = ~sum;

    //set computation result

    udphdrp->check = ((unsigned short)sum == 0x0000)?0xFFFF:(unsigned short)sum;

The code is similar to TCP checksum computation. Except that when the checksum is compted as all 0s, we set them to 1s. As 0x0000 is already reserved for indicating that the checksum is not computed. Also please refer to part 1 for detailed description of the algorithm.

All the code can be downloaded at part 3.

How to Calculate IP/TCP/UDP Checksum–Part 1 Theory

This is the first part of How to Calculate IP/TCP/UDP Checksum. The 2 parts followed are Part 2 Implementation and Part 3 Usage Example and Validation.

This part focuses on the algorithms. We’ll go through IP/TCP/UDP one by one.

IP Header Checksum Calculation

IP checksum is a 16-bit field in IP header used for error detection for IP header. It equals to the one’s complement of the one’s complement sum of all 16 bit words in the IP header. The checksum field is initialized to all zeros at computation.

One’s complement sum is calculated by summing all numbers and adding the carry (or carries) to the result. And one’s complement is defined by inverting all 0s and 1s in the number’s bit representation.

For example, if an IP header is 0x4500003044224000800600008c7c19acae241e2b.
We start by calculating the one’s complement sum. First, divide the header hex into 16 bits each and sum them up,

4500 + 0030 + 4422 + 4000 + 8006 + 0000 + 8c7c + 19ac + ae24 + 1e2b = 2BBCF

Next fold the result into 16 bits by adding the carry to the result,

2 +  BBCF  = BBD1

The final step is to compute the one’s complement of the one’s complement’s sum,

BBD1 = 1011101111010001

IP checksum = one’s complement(1011101111010001) = 0100010000101110 = 442E

Note that IP header needs to be parsed at each hop, because IP addresses are needed to route the packet. To detect the errors at IP header, the checksum is validated at every hop.

The validation is done using the same algorithm. But this time the initialized checksum value is 442E.

2BBCF + 442E = 2FFFD, then 2 + FFFD = FFFF

Take the one’s complement of FFFF = 0.

At validation, the checksum computation should evaluate to 0 if the IP header is correct.

TCP Checksum Calculation

TCP Checksum is a 16-bit field in TCP header used for error detection. It is computed over the TCP segment (might plus some padding) and a 12-byte TCP pseudo header created on the fly. Same as IP checksum, TCP checksum is also one’s complement of the one’s complement sum of all 16 bit words in the computation data.

Below is a figure that illustrates the data used to calculate TCP checksum,

Figure 1. TCP Checksum Computation Data

As shown in the figure, the pseudo header consists of 5 fields,

  • source address: 32 bits/4 bytes, taken from IP header
  • destination address: 32bits/4 bytes, taken from IP header
  • resevered: 8 bits/1 byte, all zeros
  • protocol: 8 bits/1 byte, taken from IP header. In case of TCP, this should always be 6, which is the assigned protocol number for TCP.
  • TCP Length: The length of the TCP segment, including TCP header and TCP data. Note that this field is not available in TCP header, therefore is computed on the fly.

Note that TCP pseudo header does not really exist, and it’s not transmitted over the network. It’s constructed on the fly to compute the checksum.

If a TCP segment contains an odd number of octets to be checksummed, the last octect is padded on the right with zeros to form a 16-bit word. But the padding is not part of the TCP segment and therefore not transmitted.

Also note the checksum field of the TCP header needs to be initialized to zeros before checksum calculation. And it’s set to the computed value after the computation.

When TCP packet is received at the destination, the receiving TCP code also performs the TCP calculation and see if there’s a mismatch. If there is, it means there’s error in the packet and it will be discarded. The same validation logic used for IP header checksum validation can be used.

UDP Checksum Calcuation

UDP Checksum calculation is similar to TCP Checksum computation. It’s also a 16-bit field of one’s complement of one’s complement sum of a pseudo UDP header + UDP datagram.
The Pseudo UDP header also consists of 5 fields,

  • source address: 32 bits/4 bytes, taken from IP header
  • destination address: 32 bits/4 bytes, taken from IP header
  • reserved: 8 bits/1 byte, set to all 0s.
  • protocol: 8 bits/1 byte, taken from IP header
  • length: Because UDP header has a length field that indicates the length of the entire datagram, including UDP header and data, the value from UDP header is used. Note that this is different from TCP pseudo header, which is computed on the fly. But they both indicates the header+payload length.

Note that UDP checksum is optional. If it’s not computed, it’s set to all 0s. This could cause issue as sometimes the checksum can be computed as all 0s. To avoid confusion, if the checksum is computed as all 0s, it’s set to all 1s (which is equivalent in one’s complement arithmetic).

References

1. IP/TCP/UDP Headers: http://www.imengineering.com/Training/CISCO/TCPIPUDPheaders.pdf
2. TCP Checksum Calculation:
http://www.tcpipguide.com/free/t_TCPChecksumCalculationandtheTCPPseudoHeader-2.htm
3. IP Checksum:
http://www.netfor2.com/checksum.html
4. One’s complement: http://en.wikipedia.org/wiki/Ones’_complement
5. UDP datagram RFC: http://www.ietf.org/rfc/rfc768.txt

TCP Tahoe, Reno, NewReno, and SACK–a Brief Comparison

1. TCP and its Algorithms (Slow-Start, Congestion Avoidance, Fast Retransmit and Fast Recovery)

TCP is a complex transport layer protocol containing four interwined algorithms: Slow-start, congestion avoidance, fast retransmit and fast recovery.

In Slow-start phase, TCP increases the congestion window each time an acknowledgement is received, by number of packets acknowledged. This strategy effectively doubles the TCP congestion window for every round trip time (RTT).

When the congestion window exceeds a threshold named ssthresh, it enters congestion avoidance phase. TCP congestion window is increased by 1 for each RTT until a loss event occurs.

TCP maintains a timer after sending out a packet, if no acknowledgement is received after the timer is expired, the packet is considered as lost. However, this might take too long for TCP to realize a packet is lost and take action. A fast retransmit algorithm is proposed to make use of duplicate ACKs to detect packet loss. In fast retransmit, when an acknowledgement packet with the same sequence number is received a specified number of times (normally set to 3), TCP sender is reasonably confident that the TCP packet is lost and will retransmit the packet.

Fast recovery is closely related to fast retransmit. When a loss event is detected by TCP sender, a fast retransmit is performed. If fast recovery is used, TCP sender will not enter slow-start phase, instead it will reduce the congestion window by half, and “inflates” the congestion window by calculating usable window using min(awin, cwnd+ndup), where awin is the receiver’s window, cwnd is the congestion window, and ndup is number of dup ACK received. When an acknowledgement of new data (called recovery ACK) is received, it returns to congestion avoidance phase.

2. Tahoe, Reno, NewReno, and SACK

TCP Tahoe is the simplest one out of the four variants. It doesn’t have fast recovery. At congestion avoidance phase, it treats the triple duplicate ACKs same as timeout. When timeout or triple duplicate ACKs is received, it will perform fast retransmit, reduce congestion window to 1, and enters slow-start phase.

TCP Reno differs from TCP Tahoe at congestion avoidance. When triple duplicate ACKs are received, it will halve the congestion window, perform a fast retransmit, and enters fast recovery. If a timeout event occurs, it will enter slow-start, same as TCP Tahoe. TCP Reno is effective to recover from a single packet loss, but it still suffers from performance problems when multiple packets are dropped from a window of data.

TCP NewReno tries to improve the TCP Reno’s performance when a burst of packets are lost by modifying the fast recovery algorithm. In TCP NewReno, a new data ACK is not enough to take TCP out of fast recovery to congestion avoidance. Instead it requires all the packets outstanding at the start of the fast recovery period are acknowledged.

TCP NewReno works by assuming that the packet that immediately follows the partial ACK received at fast recovery is lost, and retransmit the packet. However, this might not be true and it affects the performance of TCP. SACK TCP adds a number of SACK blocks in TCP packet, where each SACK block acknowledges a non-contiguous set of data has been received. The main difference between SACK TCP and Reno TCP implementations is in the behavior when multiple packets are dropped from one window of data. SACK sender maintains the information which packets is missed at receiver and only retransmits these packets. When all the outstanding packets at the start of fast recovery are acknowledged, SACK exits fast recovery and enters congestion avoidance.

Note that the four variants of TCP only differs when there’s a packet loss. If all packets reach the destination successfully, the four variants behave the same.

References

1. Allman, M., Paxson, V. and Blanton, E. 2009. TCP Congestion Control. RFC 5681.

2. Fall, K. and Floyd, S. 1996. Simulation-based Comparisons of Tahoe, Reno and SACK TCP. ACM SIGCOMM Computer Communication Review, 26(3):5-21.