TCP TIME_WAIT State and Address Already in Use Error

0. The Problem
Recently I am working on a project consists of TCP socket programming on Linux. I encountered errno 98 (address already in use) and 99 (cannot assign requested address) frequently. I wrote a small test program to reproduce the issue. The test code is as below,

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <errno.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <sys/select.h>

#include <sys/time.h>

#include <fcntl.h>

#include <netinet/in.h>

#include <netdb.h> 

#include <arpa/inet.h>

#include <pthread.h>

#include <signal.h>

#include <sys/time.h>

 

unsigned short PROXY_SERVER_PORT = 0x5B77;        

#define PROXY_CLIENT_ST_PORT 0x6B77                

unsigned int bindError = 0;

 

int main(void) {

    int socketFd;

    int one = 1;

    int mapIdx = 0;

    int i, j; 

    int conTry = 0;

    struct sockaddr_in localAddr, serv_addr;

    struct addrinfo hints, *res;

    int err;

    //char *hostname = "www.google.com";

    char *hostname = "74.125.235.18";

    memset(&hints, 0, sizeof(hints));

    hints.ai_socktype = SOCK_STREAM;

    hints.ai_family = AF_INET;

 

    if ((err = getaddrinfo(hostname, NULL, &hints, &res)) != 0) {

        printf("error %dn", err);

        return 1;

    }

 

    bzero((char *) &serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;

    serv_addr.sin_addr.s_addr = ((struct sockaddr_in *)(res->ai_addr))->sin_addr.s_addr;    

    serv_addr.sin_port = htons(80);

 

    printf("ip address : %sn", inet_ntoa(serv_addr.sin_addr));

    freeaddrinfo(res);

 

    printf("before loop...n");

    for (i = 0; i < 3; ++i) {

        socketFd = socket(AF_INET, SOCK_STREAM, 0);

        //if the line below is enabled, error occurs at connect

        setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);

        if (socketFd < 0)  {

            printf("ERROR opening proxy tcp client socket %d: %d", mapIdx, errno);

            exit(1);

        }

        memset(&localAddr, 0, sizeof(struct sockaddr_in));

        localAddr.sin_family = AF_INET;

        localAddr.sin_addr.s_addr = INADDR_ANY;

        localAddr.sin_port = htons((PROXY_CLIENT_ST_PORT + mapIdx + bindError)%65535);

        printf("binding...n");

        for (j = 1; ;++j) {

            if (bind(socketFd, (struct sockaddr *)&localAddr, sizeof(struct sockaddr_in)) < 0) {

                printf("Error %d, tcp proxy client binding to %dn", errno, ntohs(localAddr.sin_port));

                perror("error binding tcp proxy client: ");

                ++bindError;

                localAddr.sin_port = htons((PROXY_CLIENT_ST_PORT + mapIdx + bindError)%65535);

            } else {    

                break;

            }

        }

        printf("bound socket %d to tcp client port %un", socketFd, ntohs(localAddr.sin_port));

        printf("connecting....n");

        if (connect(socketFd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0) {

            printf("error code: %dn", errno);

            perror("error connecting to ori server from proxy tcp client: ");

            if (errno == EADDRNOTAVAIL) {

                printf("The specified address %d is not available from the local machine.n", ntohs(localAddr.sin_port));

                exit(1);

            }

        }

        printf("connected: %dn", mapIdx);

        close(socketFd);

        mapIdx++;

    }

    return 0;

}

Save the code to test.c, then you can compile it using “gcc -o test test.c”. Running the program multiple times will likely to give you one error, enable/disable line “setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);“ will give you the other error.

When the error occurs, run netstat command “netstat | grep 27511” (from the program output, I know the error occurs at port 27511). Below is the screenshot,

Figure 1. Cannot Assign Request Address

As shown in figure 1, the tcp port is in TIME_WAIT state.

1. The TCP State Transition at Closure

To close a established TCP connection, both endpoints send FIN packets to indicate there’s no more data. Upon receiving the other party’s FIN packet, both endpoints need to ACK it.

The FIN packets are sent when a program calls exit(), close() or shutdown(). The ACKs are handled by the kernel after close() is completed. Therefore, it is possible that the program finishes before the kernel releases the associated network resource. And another process won’t be able to use it until kernel has freed it.

Below is a figure of detailed state transitions for an endpoint when TCP connection closes. It follows different paths depending on which side initiated the closure.

Figure 2. TCP State Transition at Closure (diagram from reference 4)

Note that TIME_WAIT only occurs at the endpoint which initiated the closure.

2. Why TIME_WAIT

After the TCP connection is closed, there might still be live packets in the network. If a new connection is established with the exact same (client IP, client port, server IP, server port) tuple, the packets from the previous connection will be treated for the new connection.

To avoid this, TIME_WAIT time is generally set to twice the packets maximum age. The value is long enough that the packets for the old connection will be dead after the time expires. Note that setting TIME_WAIT at one endpoint would be enough to make sure no two exactly same (client IP, client port, server IP, server port) tuples appear.

3. How to Avoid the Problem

TIME_WAIT only occurs at the side which initiates the TCP connection closure, so a natural solution would be avoid calling close(). If you have control over both client and server, you may want to let the client close first, so the server won’t ends of lots of TIME_WAIT ports.

As indicated in the testing program, setsockopt() with SO_REUSEADDR allows you to bind the a socket to a port which in TIME_WAIT. If you use the socket as a client side, and try connecting to the same (server address, server port) tuple, you’ll fail at connect stage. However, connecting to other (server address, server port) is allowed.

If you use the socket as a server socket, you can also use SO_REUSEADDR. I’ve not tested if the same (client address, client port) tries to connect, what will happen. But I guess the connection request will be denied.

It’s also possible to modify the TIME_WAIT values on some operating systems.

References:

1. Setting TIME_WAIT TCP, stackoverflow: http://stackoverflow.com/questions/337115/setting-time-wait-tcp

2. TIME_WAIT and its design implications for protocols and scalable client server systems: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

3. The TIME_WAIT state in TCP and Its Effect on Busy Servers: http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/

4. Bind: Address Already in Use or How to Avoid this Error when Closing TCPConnections

Simple TCP Socket Client and Server Communication in C Under Linux

This post doesn’t provide details about how Linux socket works, its design etc. It mainly for providing source code of simple TCP socket client and server in C. I’m writing this because I found myself need simple TCP client and server for testing from time to time. 🙂
1. The TCP Server Code

You can refer to TCP Server code below (save it as tcpserver.c),

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <unistd.h>

#include <fcntl.h>

#include <sys/types.h> 

#include <sys/socket.h>

#include <netinet/in.h>

 

void nonblock(int sockfd)

{

    int opts;

    opts = fcntl(sockfd, F_GETFL);

    if(opts < 0)

    {

        fprintf(stderr, "fcntl(F_GETFL) failedn");

    }

    opts = (opts | O_NONBLOCK);

    if(fcntl(sockfd, F_SETFL, opts) < 0) 

    {

        fprintf(stderr, "fcntl(F_SETFL) failedn");

    }

}

 

int main(int argc, char *argv[])

{

     int BUFLEN = 2000;

     int sockfd, newsockfd, portno;

     socklen_t clilen;

     char buffer[BUFLEN];

     struct sockaddr_in serv_addr, cli_addr;

     int n, i;

     int one = 1;

 

     if (argc < 2) {

         fprintf(stderr,"please specify a port numbern");

         exit(1);

     }

     sockfd = socket(AF_INET, SOCK_STREAM, 0);

     if (sockfd < 0) {

        perror("ERROR create socket");

        exit(1);

     }

     setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);    //allow reuse of port

     //bind to a local address

     bzero((char *) &serv_addr, sizeof(serv_addr));

     portno = atoi(argv[1]);

     serv_addr.sin_family = AF_INET;

     serv_addr.sin_addr.s_addr = INADDR_ANY;

     serv_addr.sin_port = htons(portno);

     if (bind(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {

        perror("ERROR on bind");

        exit(1);

     }

     //listen marks the socket as passive socket listening to incoming connections, 

     //it allows max 5 backlog connections: backlog connections are pending in queue

     //if pending connections are more than 5, later request may be ignored

     listen(sockfd,5);

     //accept incoming connections

     clilen = sizeof(cli_addr);

     newsockfd = accept(sockfd, (struct sockaddr *) &cli_addr, &clilen);

     //nonblock(newsockfd);        //if we want to set the socket as nonblock, we can uncomment this

     if (newsockfd < 0) {

        perror("ERROR on accept");

        exit(1);

     }

     printf("connection acceptedn");

     for (i = 0; i < 5; ++i) {

         bzero(buffer,BUFLEN);

         n = read(newsockfd,buffer,BUFLEN);

         if (n < 0) {

            perror("ERROR read from socket");

         }

         printf("received: %s",buffer); 

         n = write(newsockfd, buffer, n);

         printf("sent: %s", buffer);

         if (n < 0) {

            perror("ERROR write to socket");

         }

     }

     close(newsockfd);

     close(sockfd);

     return 0; 

}

The code first create a socket of SOCK_STREAM type in AF_INET domain. SOCK_STREAM corresponds to TCP, and AF_INET refers to IPv4.

It calls setsockopt to make the socket resuable. For example, your server open up a socket at a port number, and then it exits. Without this line of code, sometimes you may not bind the socket to same port.

The program then binds the socket to a local IP address with a specific port number. listen(sockfd, 5) will set the socket as passive socket listening to incoming connections. 5 means the maximum number of backlog is 5. Backlog connections are pending connect request in queue. If pending request are more than 5, later request may be ignored.

The code then calls accept method to accept incoming connections. This is a block call, and it returns when a new connect request is received, or an error occurred.

In the for loop, the tcp server listens to incoming packets and echo the message back. Note that the nonblock(newsockfd) has been commented out. You can uncomment it to enable nonblock read and write, but you may want to modify the read and write part to make it work properly.

2. The TCP Client Code
The TCP client code is as below (save it as tcpclient.c),

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <fcntl.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <netinet/in.h>

#include <netdb.h> 

 

void nonblock(int sockfd)

{

    int opts;

    opts = fcntl(sockfd, F_GETFL);

    if(opts < 0)

    {

        fprintf(stderr, "fcntl(F_GETFL) failedn");

    }

    opts = (opts | O_NONBLOCK);

    if(fcntl(sockfd, F_SETFL, opts) < 0) 

    {

        fprintf(stderr, "fcntl(F_SETFL) failedn");

    }

}

 

int main(int argc, char *argv[])

{

    int BUFLEN = 2000;

    int sockfd, portno, n;

    struct sockaddr_in serv_addr;

    struct hostent *server;

    int i;

 

    char buffer[BUFLEN];

    if (argc < 3) {

       fprintf(stderr,"usage: %s hostname_or_ip portn", argv[0]);

       exit(0);

    }

    portno = atoi(argv[2]);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);

    if (sockfd < 0) {

        perror("ERROR creating socket");

        exit(1);

    }

    //get the address info by either host name or IP address

    server = gethostbyname(argv[1]);

    if (server == NULL) {

        fprintf(stderr,"ERROR, no such hostn");

        exit(1);

    }

    bzero((char *) &serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;

    bcopy((char *)server->h_addr, (char *)&serv_addr.sin_addr.s_addr, server->h_length);

    serv_addr.sin_port = htons(portno);

    if (connect(sockfd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0)  {

        perror("ERROR connecting");

        exit(1);

    }

    printf("connection establishedn");

    //nonblock(sockfd);    //uncomment this line if we want to make the socket non-block

    for (i = 0; i < 5; ++i) {

        printf("Please enter the message: ");

        bzero(buffer,BUFLEN);

        fgets(buffer,BUFLEN,stdin);

        n = write(sockfd,buffer,strlen(buffer));

        printf("sent: %s", buffer);

        if (n < 0) {

             perror("ERROR writing to socket");

        }

        bzero(buffer,BUFLEN);

        n = read(sockfd,buffer,BUFLEN);

        if (n < 0) {

             perror("ERROR reading from socket");

        }

        printf("received: %s",buffer);

    }

    close(sockfd);

    return 0;

}

The code creates a TCP socket. It then tries to connect to a server address with specified IP and port number. The connect is a block call, it returns when the connection is established, which means it receives the ACK from tcp server, which is third packet of TCP 3-way handshake (SYN, ACK+SYN, ACK).

In the loop the tcp client sends and receives messages. Also the nonblock code has been disabled.

3. Compile and Run
To compile the code, use the commands below,

gcc -o client tcpclient.c
gcc -o server tcpserver.c

A sample run could be,

./server 12345 (at one terminal)
./client localhost 12345 (at another terminal)

How to Calculate IP/TCP/UDP Checksum–Part 3 Usage Example and Validation

This is a follow up of the previous post IP/TCP/UDP Checksum Calculation part 1 theory, and part 2 implementation.

This post gives an example using libnetfiler_queue library and the checksum code we implemented in part 2 to illustrate how to use the checksum code and verify that our code is actually computing correctly.

Note this post is not a post for libnetfilter_queue library. So the usage of this library is not covered here. I’ve written separate posts for libnetfilter_queue.

The code (let’s call it test.c) is as below,

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netinet/in.h>
#include <linux/types.h>
#include <linux/netfilter.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <libnetfilter_queue/libnetfilter_queue.h>

#include "checksum.h"

static int cb(struct nfq_q_handle *qh, struct nfgenmsg *nfmsg, struct nfq_data *nfa, void *data) {
    printf("entering callbackn");
    struct nfqnl_msg_packet_hdr *ph;
    int payload_len;
    unsigned char *payloadData;
    struct iphdr *ipHeader;
    struct tcphdr *tcpHeader;
    struct udphdr *udpHeader;
    unsigned short ipCheck, udpCheck, tcpCheck;
    ph = nfq_get_msg_packet_hdr(nfa);
    u_int32_t id = ntohl(ph->packet_id);
    payload_len = nfq_get_payload(nfa, &payloadData);
    printf("ip datagram len = %dn", payload_len);
    ipHeader = (struct iphdr *)payloadData;
    ipCheck = ipHeader->check;
    printf("ip checksum: %04xn", ipHeader->check);
    //calculate ip checksum, and see if the calculation matches
    compute_ip_checksum(ipHeader);
    printf("calculated ip checksum: %04xn", ipHeader->check);
    if (ipCheck != ipHeader->check) {
    printf("-------------ip checksum calculation is wrong-----------n");
    }
    if (ipHeader->protocol == IPPROTO_TCP) {
        tcpHeader = (struct tcphdr *)(payloadData + (ipHeader->ihl<<2));
    tcpCheck = tcpHeader->check;
    printf("tcp checksum: %04xn", tcpHeader->check);
    //calculate tcp checksum, and see if the calculation matches with original tcp checksum
        compute_tcp_checksum(ipHeader, (unsigned short*)tcpHeader);
    printf("calculated tcp checksum: %04xn", tcpHeader->check);
    if (tcpHeader->check != tcpCheck) {
        printf("-----------calculation is wrong-------n");
    }
    } else if (ipHeader->protocol == IPPROTO_UDP) {
    udpHeader = (struct udphdr *)(payloadData + (ipHeader->ihl<<2));
    udpCheck = udpHeader->check;
    printf("udp checksum: %04xn", udpHeader->check);
    //calculate udp checksum, and see if the calculation matches with original udp checksum
    compute_udp_checksum(ipHeader, (unsigned short*)udpHeader);
    printf("calculated udp checksum: %04xn", udpHeader->check);
        if (udpHeader->check != udpCheck) {
        printf("-----------calculation is wrong-------n");
    }
    }
    //issue a verdict on a packet
    //qh: netfilter queue handle; id: ID assigned to packet by netfilter; verdict: verdict to return to netfilter, data_len: number
    //of bytes of data pointed by buf, buf: the buffer that contains the packet data (payload)
// return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);
    return nfq_set_verdict(qh, id, NF_ACCEPT, payload_len, payloadData);
}

int main(int argc, char **argv) {
    struct nfq_handle *h;
    struct nfq_q_handle *qh;
    //struct nfnl_handle *nh;
    int fd;
    int rv;
    char buf[4096] __attribute__((aligned));

    /*netfilter_queue library set up step 1: call nfq_open to open a NFQUEUE handler*/
    printf("opening library handlen");
    h = nfq_open();
    if (!h) {
        fprintf(stderr, "error during nfq_open()n");
        exit(1);
    }
    /*set up step 2: tell the kernel that userspace queuing is handled by NFQUEUE for the selected protocol. This is 
 made by calling nfq_unbind_pf and nfq_bind_pf with protocol info. The idea behind this is to enable simulataneously loaded modules to be used for queuing*/
    printf("unbinding existing nf_queue handler for AF_INET (if any)n");
    if (nfq_unbind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_unbind_pf()n");
        exit(1);
    }
    printf("binding nfnetlink_queue as nf_queue handler for AF_INETn");
    if (nfq_bind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_bind_pf()n");
        exit(1);
    }
    /*after the above two steps, we can set up and use a queue*/
    /*bind the program to a specific queue, cb is callback function to call for each queued packet
 *h: netfilter queue handle, 0: the number of the queue to bind to, cb: callback function for each queued packet, data: custom data to pass to callback function*/
    printf("binding this socket to queue '0'n");
    qh = nfq_create_queue(h, 0, &cb, NULL);
    if (!qh) {
        fprintf(stderr, "error during nfq_create_queue()n");
        exit(1);
    }
    /*set mode: NFQNL_COPY_PACKET: copy entire packet, it defines the part of data that nfqueue copies to userspace
 0xffff: size of the packet that we want to get*/
    printf("setting copy_packet moden");
    if (nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff) < 0) {
        fprintf(stderr, "cannot set packet_copy moden");
        exit(1);
    }
    /*handle the incoming packets*/
    fd = nfq_fd(h);
    while ((rv = recv(fd, buf, sizeof(buf), 0)) && rv >= 0) {
    printf("pkt receivedn");
    nfq_handle_packet(h, buf, rv);
    }
    printf("unbinding from queue 0n");
    nfq_destroy_queue(qh);
    /*the program has finished with libnetfilter_queue, it can call nfq_close to free all associated resources*/
    printf("closing library handlen");
    nfq_close(h);
    exit(0);
}

The code basically use libnetfilter_queue to get the IP datagram of TCP/UDP packets, and retrieve their IP, TCP/UDP checksum. It then use the checksum code we implemented in part 2 to recalculate the checksum. If the calculated the checksum matches with the checksum we retrieved from the original packet, then we’re sure the computation is done correctly.

Compile and Run the Code

To compile the code, you can run the command below,

gcc -Wall -o testsum checksum.c test.c -lnfnetlink -lnetfilter_queue

Note that you’ll need libnfnetlink and libnetfilter_queue to compile the code. Please google for more info.

To run the code, save the following command to run.sh, and then use sudo ./run.sh,

#!/bin/sh
sudo iptables -t mangle –flush
sudo iptables -t mangle -A OUTPUT -p tcp  -j NFQUEUE –queue-num 0
sudo iptables -t mangle -A OUTPUT -p udp -j NFQUEUE –queue-num 0
sudo ./testsum

Note that after you finish running the code and killed the program, you won’t be able to access Internet, you’ll need to run the following command to bring your Internet back,

sudo iptables -t mangle –flush

Note that the above procedure basically set up Linux iptable rules to queue TCP and UDP packets, so the test.c program code can receive them at userspace.

Download

You can download the source code and scripts from here.

How to Calculate IP/TCP/UDP Checksum–Part 2 Implementation

This is a follow up of the previous post, how to calculate IP/TCP/UDP checksum part 1 — theory.

IP Header Checksum Calculation Implementation
To calculate the IP checksum, one can use the code below,

/* set ip checksum of a given ip header*/

void compute_ip_checksum(struct iphdr* iphdrp){

  iphdrp->check = 0;

  iphdrp->check = compute_checksum((unsigned short*)iphdrp, iphdrp->ihl<<2);

}

/* Compute checksum for count bytes starting at addr, using one's complement of one's complement sum*/

static unsigned short compute_checksum(unsigned short *addr, unsigned int count) {

  register unsigned long sum = 0;

  while (count > 1) {

    sum += * addr++;

    count -= 2;

  }

  //if any bytes left, pad the bytes and add

  if(count > 0) {

    sum += ((*addr)&htons(0xFF00));

  }

  //Fold sum to 16 bits: add carrier to result

  while (sum>>16) {

      sum = (sum & 0xffff) + (sum >> 16);

  }

  //one's complement

  sum = ~sum;

  return ((unsigned short)sum);

}

The method compute_ip_checksum initialize the checksum field of IP header to zeros. Then calls a method compute_checksum. The mothod compute_checksum accepts the computation data and computation length as two input parameters. It sum up all 16-bit words, if there’s odd number of bytes, it adds a padding byte. After summing up all words, it folds the sum to 16 bits by adding the carrier to the results. At last, it takes the one’s complement of sum and cast it to 16-bit unsigned short type. You can refer to part 1 for more detailed description of the algorithm.

Note that the data structure iphdr and tcphdr and udphdr are Linux data structures representing IP header, TCP header and UDP header respectively. You may want to Google for more information in order to understand the code.

TCP Header Checksum Calculation Implementation
To calculate the TCP checksum, you can use the code below,

/* set tcp checksum: given IP header and tcp segment */

void compute_tcp_checksum(struct iphdr *pIph, unsigned short *ipPayload) {

    register unsigned long sum = 0;

    unsigned short tcpLen = ntohs(pIph->tot_len) - (pIph->ihl<<2);

    struct tcphdr *tcphdrp = (struct tcphdr*)(ipPayload);

    //add the pseudo header 

    //the source ip

    sum += (pIph->saddr>>16)&0xFFFF;

    sum += (pIph->saddr)&0xFFFF;

    //the dest ip

    sum += (pIph->daddr>>16)&0xFFFF;

    sum += (pIph->daddr)&0xFFFF;

    //protocol and reserved: 6

    sum += htons(IPPROTO_TCP);

    //the length

    sum += htons(tcpLen);

 

    //add the IP payload

    //initialize checksum to 0

    tcphdrp->check = 0;

    while (tcpLen > 1) {

        sum += * ipPayload++;

        tcpLen -= 2;

    }

    //if any bytes left, pad the bytes and add

    if(tcpLen > 0) {

        //printf("+++++++++++padding, %dn", tcpLen);

        sum += ((*ipPayload)&htons(0xFF00));

    }

      //Fold 32-bit sum to 16 bits: add carrier to result

      while (sum>>16) {

          sum = (sum & 0xffff) + (sum >> 16);

      }

      sum = ~sum;

    //set computation result

    tcphdrp->check = (unsigned short)sum;

}

The mothod sums the pseudo TCP header first, then the IP payload, which is the TCP segment. It also pads the last byte if there’re odd number of bytes. For detailed description of the algorithm, please refer to comments in the code and part 1.

UDP Header Checksum Calculation Implementation
To calculate the UDP checksum, one can follow the code below,

/* set tcp checksum: given IP header and UDP datagram */

void compute_udp_checksum(struct iphdr *pIph, unsigned short *ipPayload) {

    register unsigned long sum = 0;

    struct udphdr *udphdrp = (struct udphdr*)(ipPayload);

    unsigned short udpLen = htons(udphdrp->len);

    //printf("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~udp len=%dn", udpLen);

    //add the pseudo header 

    //printf("add pseudo headern");

    //the source ip

    sum += (pIph->saddr>>16)&0xFFFF;

    sum += (pIph->saddr)&0xFFFF;

    //the dest ip

    sum += (pIph->daddr>>16)&0xFFFF;

    sum += (pIph->daddr)&0xFFFF;

    //protocol and reserved: 17

    sum += htons(IPPROTO_UDP);

    //the length

    sum += udphdrp->len;

 

    //add the IP payload

    //printf("add ip payloadn");

    //initialize checksum to 0

    udphdrp->check = 0;

    while (udpLen > 1) {

        sum += * ipPayload++;

        udpLen -= 2;

    }

    //if any bytes left, pad the bytes and add

    if(udpLen > 0) {

        //printf("+++++++++++++++padding: %dn", udpLen);

        sum += ((*ipPayload)&htons(0xFF00));

    }

      //Fold sum to 16 bits: add carrier to result

    //printf("add carriern");

      while (sum>>16) {

          sum = (sum & 0xffff) + (sum >> 16);

      }

    //printf("one's complementn");

      sum = ~sum;

    //set computation result

    udphdrp->check = ((unsigned short)sum == 0x0000)?0xFFFF:(unsigned short)sum;

The code is similar to TCP checksum computation. Except that when the checksum is compted as all 0s, we set them to 1s. As 0x0000 is already reserved for indicating that the checksum is not computed. Also please refer to part 1 for detailed description of the algorithm.

All the code can be downloaded at part 3.

How to Calculate IP/TCP/UDP Checksum–Part 1 Theory

This is the first part of How to Calculate IP/TCP/UDP Checksum. The 2 parts followed are Part 2 Implementation and Part 3 Usage Example and Validation.

This part focuses on the algorithms. We’ll go through IP/TCP/UDP one by one.

IP Header Checksum Calculation

IP checksum is a 16-bit field in IP header used for error detection for IP header. It equals to the one’s complement of the one’s complement sum of all 16 bit words in the IP header. The checksum field is initialized to all zeros at computation.

One’s complement sum is calculated by summing all numbers and adding the carry (or carries) to the result. And one’s complement is defined by inverting all 0s and 1s in the number’s bit representation.

For example, if an IP header is 0x4500003044224000800600008c7c19acae241e2b.
We start by calculating the one’s complement sum. First, divide the header hex into 16 bits each and sum them up,

4500 + 0030 + 4422 + 4000 + 8006 + 0000 + 8c7c + 19ac + ae24 + 1e2b = 2BBCF

Next fold the result into 16 bits by adding the carry to the result,

2 +  BBCF  = BBD1

The final step is to compute the one’s complement of the one’s complement’s sum,

BBD1 = 1011101111010001

IP checksum = one’s complement(1011101111010001) = 0100010000101110 = 442E

Note that IP header needs to be parsed at each hop, because IP addresses are needed to route the packet. To detect the errors at IP header, the checksum is validated at every hop.

The validation is done using the same algorithm. But this time the initialized checksum value is 442E.

2BBCF + 442E = 2FFFD, then 2 + FFFD = FFFF

Take the one’s complement of FFFF = 0.

At validation, the checksum computation should evaluate to 0 if the IP header is correct.

TCP Checksum Calculation

TCP Checksum is a 16-bit field in TCP header used for error detection. It is computed over the TCP segment (might plus some padding) and a 12-byte TCP pseudo header created on the fly. Same as IP checksum, TCP checksum is also one’s complement of the one’s complement sum of all 16 bit words in the computation data.

Below is a figure that illustrates the data used to calculate TCP checksum,

Figure 1. TCP Checksum Computation Data

As shown in the figure, the pseudo header consists of 5 fields,

  • source address: 32 bits/4 bytes, taken from IP header
  • destination address: 32bits/4 bytes, taken from IP header
  • resevered: 8 bits/1 byte, all zeros
  • protocol: 8 bits/1 byte, taken from IP header. In case of TCP, this should always be 6, which is the assigned protocol number for TCP.
  • TCP Length: The length of the TCP segment, including TCP header and TCP data. Note that this field is not available in TCP header, therefore is computed on the fly.

Note that TCP pseudo header does not really exist, and it’s not transmitted over the network. It’s constructed on the fly to compute the checksum.

If a TCP segment contains an odd number of octets to be checksummed, the last octect is padded on the right with zeros to form a 16-bit word. But the padding is not part of the TCP segment and therefore not transmitted.

Also note the checksum field of the TCP header needs to be initialized to zeros before checksum calculation. And it’s set to the computed value after the computation.

When TCP packet is received at the destination, the receiving TCP code also performs the TCP calculation and see if there’s a mismatch. If there is, it means there’s error in the packet and it will be discarded. The same validation logic used for IP header checksum validation can be used.

UDP Checksum Calcuation

UDP Checksum calculation is similar to TCP Checksum computation. It’s also a 16-bit field of one’s complement of one’s complement sum of a pseudo UDP header + UDP datagram.
The Pseudo UDP header also consists of 5 fields,

  • source address: 32 bits/4 bytes, taken from IP header
  • destination address: 32 bits/4 bytes, taken from IP header
  • reserved: 8 bits/1 byte, set to all 0s.
  • protocol: 8 bits/1 byte, taken from IP header
  • length: Because UDP header has a length field that indicates the length of the entire datagram, including UDP header and data, the value from UDP header is used. Note that this is different from TCP pseudo header, which is computed on the fly. But they both indicates the header+payload length.

Note that UDP checksum is optional. If it’s not computed, it’s set to all 0s. This could cause issue as sometimes the checksum can be computed as all 0s. To avoid confusion, if the checksum is computed as all 0s, it’s set to all 1s (which is equivalent in one’s complement arithmetic).

References

1. IP/TCP/UDP Headers: http://www.imengineering.com/Training/CISCO/TCPIPUDPheaders.pdf
2. TCP Checksum Calculation:
http://www.tcpipguide.com/free/t_TCPChecksumCalculationandtheTCPPseudoHeader-2.htm
3. IP Checksum:
http://www.netfor2.com/checksum.html
4. One’s complement: http://en.wikipedia.org/wiki/Ones’_complement
5. UDP datagram RFC: http://www.ietf.org/rfc/rfc768.txt

TCP Tahoe, Reno, NewReno, and SACK–a Brief Comparison

1. TCP and its Algorithms (Slow-Start, Congestion Avoidance, Fast Retransmit and Fast Recovery)

TCP is a complex transport layer protocol containing four interwined algorithms: Slow-start, congestion avoidance, fast retransmit and fast recovery.

In Slow-start phase, TCP increases the congestion window each time an acknowledgement is received, by number of packets acknowledged. This strategy effectively doubles the TCP congestion window for every round trip time (RTT).

When the congestion window exceeds a threshold named ssthresh, it enters congestion avoidance phase. TCP congestion window is increased by 1 for each RTT until a loss event occurs.

TCP maintains a timer after sending out a packet, if no acknowledgement is received after the timer is expired, the packet is considered as lost. However, this might take too long for TCP to realize a packet is lost and take action. A fast retransmit algorithm is proposed to make use of duplicate ACKs to detect packet loss. In fast retransmit, when an acknowledgement packet with the same sequence number is received a specified number of times (normally set to 3), TCP sender is reasonably confident that the TCP packet is lost and will retransmit the packet.

Fast recovery is closely related to fast retransmit. When a loss event is detected by TCP sender, a fast retransmit is performed. If fast recovery is used, TCP sender will not enter slow-start phase, instead it will reduce the congestion window by half, and “inflates” the congestion window by calculating usable window using min(awin, cwnd+ndup), where awin is the receiver’s window, cwnd is the congestion window, and ndup is number of dup ACK received. When an acknowledgement of new data (called recovery ACK) is received, it returns to congestion avoidance phase.

2. Tahoe, Reno, NewReno, and SACK

TCP Tahoe is the simplest one out of the four variants. It doesn’t have fast recovery. At congestion avoidance phase, it treats the triple duplicate ACKs same as timeout. When timeout or triple duplicate ACKs is received, it will perform fast retransmit, reduce congestion window to 1, and enters slow-start phase.

TCP Reno differs from TCP Tahoe at congestion avoidance. When triple duplicate ACKs are received, it will halve the congestion window, perform a fast retransmit, and enters fast recovery. If a timeout event occurs, it will enter slow-start, same as TCP Tahoe. TCP Reno is effective to recover from a single packet loss, but it still suffers from performance problems when multiple packets are dropped from a window of data.

TCP NewReno tries to improve the TCP Reno’s performance when a burst of packets are lost by modifying the fast recovery algorithm. In TCP NewReno, a new data ACK is not enough to take TCP out of fast recovery to congestion avoidance. Instead it requires all the packets outstanding at the start of the fast recovery period are acknowledged.

TCP NewReno works by assuming that the packet that immediately follows the partial ACK received at fast recovery is lost, and retransmit the packet. However, this might not be true and it affects the performance of TCP. SACK TCP adds a number of SACK blocks in TCP packet, where each SACK block acknowledges a non-contiguous set of data has been received. The main difference between SACK TCP and Reno TCP implementations is in the behavior when multiple packets are dropped from one window of data. SACK sender maintains the information which packets is missed at receiver and only retransmits these packets. When all the outstanding packets at the start of fast recovery are acknowledged, SACK exits fast recovery and enters congestion avoidance.

Note that the four variants of TCP only differs when there’s a packet loss. If all packets reach the destination successfully, the four variants behave the same.

References

1. Allman, M., Paxson, V. and Blanton, E. 2009. TCP Congestion Control. RFC 5681.

2. Fall, K. and Floyd, S. 1996. Simulation-based Comparisons of Tahoe, Reno and SACK TCP. ACM SIGCOMM Computer Communication Review, 26(3):5-21.