Linux Signal–Part 2. Signal Handler

Previous post covers the basics of Linux signals. This post illustrates how to install a Linux signal handler with examples.

Linux provides two methods to install signal handler, signal and sigaction. sigaction is the preferred method over signal because signal behaves differently on different UNIX and UNIX-like OSes and therefore less compatible than sigaction. This post covers sigaction only.

Before going to sigaction system call, we go through several functions which we’ll need later. You can also go the end of the post to read the source code first and refer to the explaination later.

sigprocmask System Call

A signal can be blocked. In this case, it is not delivered to the process until it is unblocked. After the signal is generated and before it is delivered, the signal is in pending state. sigprocmask system call is used to query blocked signals. It has the following prototype,

#include <signal.h>

int sigprocmask(int how, const sigset_t *set, sigset_t *oldset);

how: specify how to change the block signal set. It can be one of the three values.  SIG_BLOCK: the set of blocked signal is the union of current set and the input parameter set SIG_UNBLOCK: the signals specified in set is removed from the set of blocked signals SIG_SETMASK: the set of blocked signals is set to same as the input parameter set

set: if not null, the set value is used to according to the description for how. If null, the blocked signal set is unchanged and how has no meeting.

oldset: if not null, the previous value of the signal mask is returned through oldset.

Signal Sets System Calls

The sigprocmask function returns a data structure of sigset_t type, several functions are defined to manipulate the signal set.

#include <signal.h>

int sigemptyset(sigset_t *set);

int sigfillset(sigset_t *set);

int sigaddset(sigset_t *set, int signum);

int sigdelset(sigset_t *set, int signum);

int sigismember(const sigset_t *set, int signum);

One can tell what the functions do from their names.

pause and sigsuspend System Calls

Two system calls are defined to suspend the execution of a process to wait for a signal to be caught, pause() and sigsuspend(). Their prototypes are as below,

#include <unistd.h>

int pause(void);

 

#include <signal.h>

int sigsuspend(const sigset_t *mask);

pause: Suspends execution until any signal is caught. It only returns when a signal was caught and the signal catching function returned. In this case, it returns -1 and errno is set to EINTR.

sigsuspend: Temporarily changes the signal mask and suspends execution until one of the unmasked signals is caught. The calling process’s signal maks is temporarily replaced by value given in mask input parameter until the sigsuspend returns or the process is terminated.

Note that sigsuspend always returns -1, normally with the error EINTR.

Now we explain the sigaction system call.

sigaction System Call

sigaction is a system call to query and change the actions taken when a process receives a particular signal. The sigaction system call has the following prototype,

#include <signal.h>

int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact);

signum: specify the signal
act: if not NULL, the action specified by act is installed.
oldact: if not NULL, the previous action is returned.

And the data structure sigaction is defined as below,

struct sigaction {

      void     (*sa_handler)(int);

      void     (*sa_sigaction)(int, siginfo_t *, void *);

      sigset_t   sa_mask;

      int        sa_flags;

      void     (*sa_restorer)(void);

};

The parameters have the following meanings,

sa_handler: specifies the action taken to the signal, SIG_DFL (default action), SIG_IGN (ignore the signal) or a function pointer to a user defined signal handler. The defined signal handler should accepts the signal number as the only input argument.

sa_sigaction: if SA_SIGINFO flag is set in sa_flags, the handler pointed by sa_sigaction should be used.

The handler function should accept three arguments, the signal number, a pointer to an instance of siginfo_t structure (contains information about the signal) and a pointer to an object of type ucontext_t (contains the receiving process context which has been interrupted by the delivered signal. It is a void pointer can be casted to ucontext_t pointer).

The siginfo_t data structure has the following elements,

siginfo_t {

   int      si_signo;    /* Signal number */

   int      si_errno;    /* An errno value */

   int      si_code;     /* Signal code */

   int      si_trapno;   /* Trap number that caused

                            hardware-generated signal

                            (unused on most architectures) */

   pid_t    si_pid;      /* Sending process ID */

   uid_t    si_uid;      /* Real user ID of sending process */

   int      si_status;   /* Exit value or signal */

   clock_t  si_utime;    /* User time consumed */

   clock_t  si_stime;    /* System time consumed */

   sigval_t si_value;    /* Signal value */

   int      si_int;      /* POSIX.1b signal */

   void    *si_ptr;      /* POSIX.1b signal */

   int      si_overrun;  /* Timer overrun count; POSIX.1b timers */

   int      si_timerid;  /* Timer ID; POSIX.1b timers */

   void    *si_addr;     /* Memory location which caused fault */

   int      si_band;     /* Band event */

   int      si_fd;       /* File descriptor */

}

sa_mask: specify the signals should be blocked when the handler is in execution. In other words, the sa_mask adds signals to the signal mask of the process before signal handler is called. And when the signal handler function returns, the signal mask of the process is reset to its previous value. In this way, we can block certain signals when the signal handler is running. In addition, the signal that caused the signal handler to execute is also blocked, unless the SA_NODEFER flag is set.

sa_flags: specifies various options of handling the signal. The actions can be aggregated by the bitwise OR operation. The details of the flags can be referred from Linux man page.

sa_restorer: This is obsolete and should not be used.

An Example

Below is an example of installing a naive signal handler for SIGINT.

#include <stdio.h>

#include <signal.h>

 

void check_block(int signum) {

    sigset_t set;

    if (sigprocmask(0, NULL, &set)==-1) {

        perror("error sigprocmask:");

    } else {

        if (sigismember(&set, signum)) {

            printf("%d is blockedn", signum);

        } else {

            printf("%d is not blockedn", signum);

        }

    }

}

 

void my_handler(int signum) {

    check_block(signum);

    printf("inside signal handler: %dn", signum);

}

 

void my_handler2(int signum, siginfo_t *info, void *context) {

    check_block(signum);

    printf("inside signal handler: %dn", info->si_signo);

}

 

int main() {

    struct sigaction action;

//    action.sa_handler = my_handler;

//    action.sa_flags = SA_RESTART;

    action.sa_sigaction = my_handler2;

    action.sa_flags = SA_RESTART | SA_SIGINFO;

    sigaction(SIGINT, &action, NULL);

    printf("press cntl + cn");

    pause();

    printf("after signal handler:n");

    check_block(SIGINT);

    return 0;

}

The code first installs the signal handler for SIGINT with sigaction system call. Inside the signal handler, it checks if the SIGINT signal is blocked. As described in sa_mask of sigaction system call, the SIGINT  should be blocked inside the signal handler.

After installing the signal handler, the program calls pause() system call to wait for signal to occur. And it checks to see if the SIGINT is blocked again after the execution returns from the signal handler. This time, SIGINT should not be blocked.

Compile the code with command below,

gcc -o sigaction sigaction.c

And a sample execution is as below,

Untitled

Figure 1. Sample Execution of Sigaction Program

SA_RESTART and Interrupted System Call

You may notice that the we set the sa_flags with SA_RESTART in the example above. This has something to do with system calls.

When a signal is caught while a system call is blocked (e.g. waiting for IO), then two results can happen after the signal handler:

  • the call is restarted automatically after the signal handler returns
  • the call fails with errno set to EINTR

With SA_RESTART flag set, some system calls are restarted automatically after the signal handler, including open, wait, waitpid etc. Those system calls return failure if SA_RESTART is not set.

There’re some system calls return failure regardless SA_RESTART is set or not. A special case is the sleep function, which is never restarted but the number of seconds remaining is returned instead of failure.

The detailed list of system calls that are restarted can found with “man 7 signal” command.

The Async-signal-safe Functions

When a signal is caught by a signal handler, the normal execution is temporarily interrupted by the signal handler. If  the signal handler returns instead of terminating the process, the execution continues at where it is interrupted. Special attention needs to be paid for functions inside signal handler because we don’t want signal handler code to interfere with other code.

A bad example of this is the signal handler changes the global static variable which is used by the code after the handler.

POSIX standard defines a list of safe functions which can be called inside the signal handler, the detailed list can be obtained with “man 7 signal”.

Note that in the example above, we called printf() function inside the signal handler, which is actually not safe. If the signal is caught when we are calling printf inside the main function, the results may be unexpected.

References:

1. The Linux Signals Handling Model: http://www.linuxjournal.com/article/3985

2. Linux Signals for the Application Programmer: http://www.linuxjournal.com/article/6483

3. Linux sigaction man page.

4. Linux signal man page (man 7 signal).

5. Advaned Programming in the UNIX environment

Linux Signal–Part 1. The Basics

Linux Signals Overview

Linux supports both POSIX reliable/standard signals and real-time signals. The first 31 signals are standard signals. Real time signals ranges from 34 to 64. The Linux command “kill -l” lists all signals numbers and names.

This post discusses reliable signals only.

A signal can be synchronous or asynchronous to the process, depending what caused the signal. Synchronous signals are also referred as traps, because they cause a trap into a kernel trap handler. An example is signal caused due to illegal instruction. Asynchronous signals are also referred as interrupts, they’re external to the current execution context and often used to send asynchronous events to a process.

If a process is in interruptible sleep state, the kernel can deliver the signal to the process and wake it up to handle it. For example, the process is waiting for terminal IO. If a process is in uninterruptible sleep, the kernel will hold the signal until the process wakes up. For instance, the process is waiting for disk IO.

Linux kernel maintains signal information for each process. The information includes an array of signal handlers and related information. Once a signal is generated, the kernel sets a bit in corresponding to the signal. Since it’s a single bit, multiple occurrences and a single occurrence are equivalent.

Signals names always start with SIG, they are defined by positive integer constants called signal numbers in the header file signal.h. signal.h is normally found in /usr/include/ directory of Linux. The actual signal numbers are usually defined in /usr/include/bits/signum.h.

Signals can happens at random times and the process can tell kernel to do one of the three things when a signal occurs,

1. ignore the signal. All except two signals (SIGSTOP and SIGKILL) can be ignored. SIGKILL always terminate a process, while SIGSTOP always clears any pending/queued SIGCONT signals and stop the process (a stopped is process may be resumed later, which is different from a terminated process). In addition, some hardware generated signals may cause the process behavior undefined if ignored (e.g. SIGSEGV).

2. catch the signal. All except two signals (SIGSTOP and SIGKILL) can be caught. We tell the kernel to execute a handler function whenever the signal occurs.

3. apply default action. Every signal has a default action. There’re four possible actions if the signal is not ignored or caught.

  • ignore: nothing happens
  • terminate: the process is terminated
  • terminate + coredump: create a core dump file and then terminate the process
  • stop: stop all threads in the process. The process goes to TASK_STOPPED state.

Linux Signals One by One

SIGHUP(1): Hangup. The signal is sent to the controlling process when a disconnect is detected by its controlling terminal interface. By default, the process is terminated.

SIGINT(2): Interrupt. The signal is sent to all foreground processes when the interrupt key (often DELETE or Control – C) is pressed. By default, the foreground processes are terminated.

SIGQUIT(3): Quit. The signal is sent to all foreground processes when the quit key (often Control – ) is pressed. In addition to terminates the foreground processes, it creates a core dump file.

SIGILL(4): Illegal instruction. It is generated when a process has executed an illegal (malformed, privileged, or unknown) hardware instruction. By default, the process is terminated and a core dump is created.

Below is a program that generates SIGILL,

typedef void(*FUNC)(void);

int main(void) {

   const static unsigned char insn[4] = {0xff, 0xff, 0xff, 0xff};

   FUNC function = (FUNC) insn;

   function();

   return 0;

}

SIGTRAP(5): Trace trap. It is used as a mechanism to notify a debugger when the process execution hits a breakpoint. By default, the process is killed and a core dump is created.

SIGABRT(6): Abort. It is generated by calling the abort function. The process terminates abnormally and a core dump file will be created by default.

SIGBUS(7): Bus error, BUS refers to the address bus in the context of a bus error. It indicates implementation defined hardware fault. It is usually caused by improper memory handling. By default, the process is terminated and a core dump is created.

SIGFPE(8): Floating point exception. It is sent to a process when it performs an illegal arithmetic operation. Note that although it is named so mainly for backward compatibility. Some common cases are dividing by 0 or floating point overflow.By default, the process is terminated and a core dump is created.

int main(void)

{

      /* "volatile" needed to eliminate compile-time optimizations */

      volatile int x = 42; 

      volatile int y = 0;

      x=x/y;

      return 0; /* Never reached */

}

SIGKILL(9): Kill. This signal always cause the process to terminate. It cannot be ignored or caught. It provides a sure way to kill a process for users with proper privilege.

SIGUSR1(10): User defined signal 1. It is sent to a process to indicate user-defined conditions. By default, the process is terminated.

SIGSEGV(11): Segmentation fault. Invalid memory segment access. By default, the process is terminated and a core dump is created. Some common cases of SIGSEGV are buffer overflow, using uninitialized pointers, dereferencing NULL pointers, exceeding the allowable stack size, attempting to access the memory the program doesn’t own, etc.

#include <stdio.h>

#include <stdlib.h>

int main(void) {

   int* a;

   a[1] = 0;

   return 0;

}

SIGUSR2(12): User defined signal 2. It is sent to a process to indicate user-defined conditions. By default, the process is terminated.

SIGPIPE(13): Broken pipe. By default, the process is terminated. It is sent to a process when it tries writing to a pipe without a process connecting to the other end.

SIGALRM(14): Alarm clock. By default, the process is terminated. It is sent to a process when a time limit has exceeded. Programs usually use SIGALRM to make a long running operation time out, or to perform an action at regular intervals.

SIGTERM(15): Termination signal. It is the default signal sent by kill and killall command. By default, the process is terminated.

SIGSTKFLT(16): Stack fault. By default, the process is terminated.

SIGCHLD/SIGCLD(17): Child process has stopped or exited, changed. By default, the signal is ignored by the process. A process can create a child process using fork. The signal is sent to the parent process when the child process terminates.

SIGCONT(18): Continue executing, if stopped. It resumes the process from TASK_STOPPED state and also clears any pending/queued stop signals. This happens no matter the process blocks, catches, or ignores the SIGCONT signal. By default, the signal is ignored.

SIGSTOP(19): Stop executing. The signal cannot be caught or ignored. It clears any pending/queued SIGCONT signals and stops the process.

SIGTSTP(20): Terminal stop signal. It clears any pending/queued SIGCONT signals no matter the signal is blocked, ignored or caught. Though it may or may not stop the process later on. By default, it stops the process. It is the signal sent to the process by its controlling terminal when user presses the SUSP key combination (Ctrl + Z normally).

SIGTTIN(21): Background process trying to read, from TTY. It clears any pending/queued SIGCONT signals no matter the signal is blocked, ignored or caught. Though it may or may not stop the process later on. By default, it stops the process.

SIGTTOU(22): Background process trying to write, to TTY. It clears any pending/queued SIGCONT signals no matter the signal is blocked, ignored or caught. Though it may or may not stop the process later on. By default, it stops the process.

SIGURG(23): Urgent condition on socket. By default, the signal is ignored by the process. It is sent to a process with async IO configured by fcntl system call on Linux when out-of-band data is available on a file descriptor connected to a socket.

SIGXCPU(24): CPU limit exceeded. By default, the process is terminated and a core dump is created. It is sent to a process when it has used the CPU for a duration that exceeds a certain predetermined user set value.

SIGXFSZ(25): File size limit exceeded. By default, the process is terminated and a core dump is created. It is sent to a process when it has created a file that exceeded the maximum allowed size.

SIGVTALRM(26): Virtual alarm clock. By default, the process is terminated. It is sent to a process when a time limit has reached. It counts only the time spent executing the process itself.

SIGPROF(27): Profiling alarm clock. By default, the process is terminated. It is sent to a process when a time limited has reached. It counts the time spent by the process and the system executing on behalf of the process.

SIGWINCH(28): Window size change. By default, the process is ignored. It is sent to a process when its controlling terminal size changes. It gives the process an opportunity to adjust its display.

SIGIO/SIGPOLL(29): I/O now possible. By default, the process is terminated. It is sent to a process when an async IO event occurs.

SIGPWR(30): Power failure restart. By default, the process is terminated. It is sent to a process when the system experiences power failure. This gives the process an opportunity to save its state.

SIGSYS/SIGUNUSED(31): Bad system call. By default, the process is terminated and a core dump is created. It is sent to a process when a bad argument is passed to a system call.

References:
1. Linux signal overview man page: http://linux.die.net/man/7/signal
2. Advanced Programming in the UNIX Environment, 2nd edition.
3. Linux signal.h and some other header source files.
4. SIGKILL wikipedia page: http://en.wikipedia.org/wiki/SIGILL
5. SIGFPE wikipedia page: http://en.wikipedia.org/wiki/SIGFPE
6. SIGBUS wikipedia page: http://en.wikipedia.org/wiki/SIGBUS

Programming FIFO/Named PIPE in Linux

Previous post covers pipe, an IPC mechanism for processes that have common parent process. We refer to processes that have common parent process as related process. But for unrelated processes, pipe cannot be used, because one process has no way of referring to pipes have been created by another process.

When two unrelated processes share some information, an identifier must be associated with the shared information. Therefore, one process can create the IPC object and other processes can refer to the IPC object by the identifier. Linux provides named pipe (also called FIFO) to communication through pipe in two unrelated processes.

A FIFO is created by mkfifo function,

#include <sys/types.h>

#include <sys/stat.h>

int mkfifo(const char *pathname, mode_t mode);

mkfifo creates a special file with name pathname. mode specifies the special FIFO file’s permissions. It is modified by the process’s umask: the created file will have the permission (mode & ~umask).

Once a FIFO is created, it can be operated like a usual file with the exception that both ends of FIFO need to be open first before reading and writing can begin. In other words, opening a file for reading blocks until another process open it for writing, and vice versa.

Below are two programs that uses named pipe to communicate with each other. It is modified from the pipe post example.

Firstly, the two programs need to agree on the pipe names. This is defined in a header file included by both of them.

#ifndef TEST_FIFO_H

#define TEST_FIFO_H

 

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <fcntl.h>

#include <sys/types.h>

#include <sys/stat.h>

 

#define CS_FIFO_NAME "/tmp/cs"

#define SC_FIFO_NAME "/tmp/sc"

 

//user read, user write, group read and other read

#define FIFO_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)

 

#endif

Next, the server program creates two pipes for communication and opens one for read and one for write,

#include "fifo.h"

 

void createfifo() {

    int rv;

    if ((rv = mkfifo(CS_FIFO_NAME, FIFO_MODE)) == -1) {

        perror("error making cs fifo: ");

        return;

    }

    if ((rv = mkfifo(SC_FIFO_NAME, FIFO_MODE)) == -1) {

        perror("error making sc fifo: ");

        return;

    }

}

 

void server(int readfd, int writefd) {

    char msg[100];

    int n;

    if ((n = read(readfd, msg, 100)) < 0) {

        perror("error reading...");

    }

    msg[n] = '';

    printf("%d server received from client: %sn", getpid(), msg);

    printf("server enter something: ");

    fgets(msg, 100, stdin);

    write(writefd, msg, strlen(msg)-1); //-1: not send the newline

}

 

int main(int argc, char **argv) {

    int readfd, writefd;

    createfifo();

    readfd = open(CS_FIFO_NAME, O_RDONLY, 0);

    writefd = open(SC_FIFO_NAME, O_WRONLY, 0);

    server(readfd, writefd);

    close(readfd);

    close(writefd);

    return 0; 

}

The client program also opens the two fifos for write and read.

#include "fifo.h"

 

void client(int readfd, int writefd) {

    char msg[100];

    int n;

    char eof = EOF;

    printf("%d client enter something: ", getpid());

    fgets(msg, 100, stdin);

    if (write(writefd, msg, strlen(msg)-1) == -1) { //-1: not send the newline

        perror("error writing...");

        exit(0);

    }

    if((n = read(readfd, msg, 100)) < 0){

        perror("error reading...");

        exit(0);

    }

    msg[n] = '';

    printf("client received from server: %sn", msg);

}

 

void removefifo() {

    unlink(SC_FIFO_NAME);

    unlink(CS_FIFO_NAME);

}

 

int main(int argc, char **argv) {

    int readfd, writefd;

 

    writefd = open(CS_FIFO_NAME, O_WRONLY, 0);

    readfd = open(SC_FIFO_NAME, O_RDONLY, 0);

    client(readfd, writefd);

    close(readfd);

    close(writefd);

    removefifo();

    return 0;

}

Note that if we swap the order of the two lines in the client source code, a deadlock will occur.

writefd = open(CS_FIFO_NAME, O_WRONLY, 0);

readfd = open(SC_FIFO_NAME, O_RDONLY, 0);

This is because the server blocks at opening CS_FIFO_NAME for reading and waits for client to open it for writing, if the client opens SC_FIFO_NAME for reading first, it also blocks and waits for server open it for writing. Both programs stuck forever.

Note that similar to pipes, it is also possible to create NONBLOCK FIFOs. Also the PIPE_BUF limits apply to FIFOs.

For a thorough explanation on these subjects, one can refer to reference 2.

References:

1. mkfifo man page: http://linux.die.net/man/3/mkfifo

2. Unix Network Programming, Volumn 2.

Programming Pipe in Linux

Pipe is one of the message passing IPC techniques. A pipe is a one way communication channel. It is created by the pipe function,

#include <unistd.h>
int pipe(int fd[2]);

The array fd[2] returns two file descriptors referring to two ends of the created pipe: fd[0] for reading and fd[1] for writing. Data written to the pipe is buffered by the kernel until it is read by the read end.

Below is a program use pipe to create a two way communication channel between two processes.

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

 

void client(int readfd, int writefd) {

    char msg[100];

    int n;

    char eof = EOF;

    printf("%d client enter something: ", getpid());

    fgets(msg, 100, stdin);

    if (write(writefd, msg, strlen(msg)-1) == -1) { //-1: not send the newline

        perror("error writing...");

        exit(0);

    }

    if((n = read(readfd, msg, 100)) < 0){

        perror("error reading...");

        exit(0);

    }

    msg[n] = '';

    printf("client received from server: %sn", msg);

}

 

void server(int readfd, int writefd) {

    char msg[100];

    int n;

    if ((n = read(readfd, msg, 100)) < 0) {

        perror("error reading...");

    }

    msg[n] = '';

    printf("%d server received from client: %sn", getpid(), msg);

    printf("server enter something: ");

    fgets(msg, 100, stdin);

    write(writefd, msg, strlen(msg)-1); //-1: not send the newline

}

 

int main(int argc, char **argv) {

    int pipe1fd[2], pipe2fd[2];

    int pid;

    if (pipe(pipe1fd) == -1) {

        perror("pipe:");

        return 0;

    }

    if (pipe(pipe2fd) == -1) {

        perror("pipe:");

        return 0;

    }

    pid = fork();

    if (pid == 0) {

        //child process

        close(pipe1fd[1]);  //close the write end

        close(pipe2fd[0]);  //close the read end

        client(pipe1fd[0], pipe2fd[1]);

        exit(0);

    } else {

        //parent process

        close(pipe1fd[0]);

        close(pipe2fd[1]);

        server(pipe2fd[0], pipe1fd[1]);

        wait(NULL);       //wait for child process to finish

        return 0;

    }

}

Compile the code using the command,

gcc -o pipe pipe.c

And below is a screenshot of a sample run,

Figure 1. Execution of Pipe Sample Program

The program above illustrates the steps of creating two pipes and use them for two way communication.

1. Create two pipes, pipe1 (pipe1fd[2]) and pipe2 (pipe2fd[2]).
2. call fork to create a child process
3. child process close write end of pipe1 (pipe1fd[1]) and read end of pipe2 (pipe2fd[0])
4. parent process close read end of pipe1 (pipe1fd[0]) and write end of pipe2 (pipe2fd[1])

POSIX defines “half-duplex” pipes, where communication is one way in pipes. System V Release 4 (SVR4) Unix implements “full-duplex” manner, which allows the two file descriptors for both reading and writing.

GNU/Linux implements “half-duplex” in a unique manner. One does not have to close one of them in order to use the other as shown in the above program. Comment out the close statements give the program below,

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

 

void client(int readfd, int writefd) {

    char msg[100];

    int n;

    char eof = EOF;

    printf("client enter something: ");

    fgets(msg, 100, stdin);

    if (write(writefd, msg, strlen(msg)-1) == -1) { //-1: not send the newline

        perror("error writing...");

        exit(0);

    }

    if((n = read(readfd, msg, 100)) < 0){

        perror("error reading...");

        exit(0);

    }

    msg[n] = '';

    printf("client received from server: %sn", msg);

}

 

void server(int readfd, int writefd) {

    char msg[100];

    int n;

    if ((n = read(readfd, msg, 100)) < 0) {

        perror("error reading...");

    }

    msg[n] = '';

    printf("server received from client: %sn", msg);

    printf("server enter something: ");

    fgets(msg, 100, stdin);

    write(writefd, msg, strlen(msg)-1); //-1: not send the newline

}

 

int main(int argc, char **argv) {

    int pipe1fd[2], pipe2fd[2];

    int pid;

    if (pipe(pipe1fd) == -1) {

        perror("pipe:");

        return 0;

    }

    if (pipe(pipe2fd) == -1) {

        perror("pipe:");

        return 0;

    }

    pid = fork();

    if (pid == 0) {

        //child process

//        close(pipe1fd[1]);  //close the write end

//        close(pipe2fd[0]);  //close the read end

        client(pipe1fd[0], pipe2fd[1]);

        exit(0);

    } else {

        //parent process

//        close(pipe1fd[0]);

//        close(pipe2fd[1]);

        server(pipe2fd[0], pipe1fd[1]);

        wait(NULL);       //wait for child process to finish

        return 0;

    }

}

Compile the code using,

gcc -o pipe2 pipe2.c

And below is a screenshot of a run,

Figure 2. Execution of Modified Pipe Sample Program

popen and pclose

Linux provides two functions (popen and pclose) to create pipe to or from a process. It avoids the usage of pipe, fork, close, and wait.

#include <stdio.h>
FILE *popen(const char *command, const char *type);
int pclose(FILE *stream);

popen function starts a process by creating a pipe, forking and invoking a shell. The command argument accepts a shell command line, and the command is passed to /bin/sh using -c flag. The type argument expects either “r” or “w”. Note that since a pipe is unidirectional, we cannot pass both “r” and “w”, and the returned pipe stream is either read-only or write-only.

If the returned pipe stream is read-only, the calling process reads from the standard output.

If the returned pipe stream is write-only, the calling process writes to the standard input.
pclose function waits for the associated process to terminate and returns the exit status of the command passed to popen.

Note that with popen and pclose, it is not convenient to create two pipes for two way communication. Below is a sample program using popen and pclose,

#include <stdlib.h>

#include <unistd.h>

 

void createTest() {

    FILE *f;

    f = fopen("test.txt", "w");

    fprintf(f, "hello worldn");

    fclose(f);

}

 

int main(int argc, char **argv) {

    char buf[100], command[100];

    FILE *pf;

    createTest();

    sprintf(command, "cat test.txt");

    pf = popen(command, "r");   //read from cat

    while (fgets(buf, 100, pf) != NULL) {

        printf("%s", buf);

    }

    pclose(pf);

    return 0;

}

Compile the code using the command below,

gcc -o pipe3 pipe3.c

And then run the program,

./pipe3

The program should give output as “hello world”. The code creates a new process for “cat test.txt” command, and reads from the pipe stream between cat command and itself.

PIPE_BUF

Writing less than PIPE_BUF bytes is atomic. Writing more than PIPE_BUT bytes may not be atomic: the kernel may interleave the data with data written by other processes. For example, if two processes trying to write “aaaaaa” and “bbbbb” respectively to the same pipe. If the writes are atomic, the content is either “aaaaaabbbbb” or “bbbbbaaaaaa”. But if it’s not atomic, the content can be something like “aaabbaabbba”.

In Linux, PIPE_BUF is set at limits.h, and the program below can print it out.

#include <stdio.h>

#include <limits.h>

 

int main(void) {

    printf("%dn", PIPE_BUF);

    return 0;

}

On my Ubuntu Linux, the value is 4096.

Non Block Pipe

The default pipe blocks both reads and writes. A non blocking pipe can be created by fcntl with O_NONBLOCK flag. The code block below illustrates how to create a non block pipe,

#include <fcntl.h>

 

void nonblock(int fd) {

    int flags;

    if ((flags = fcntl(fd, F_GETFL, 0)) < 0) {

        perror("F_GETFL error:");

        return;

    }

    flags |= O_NONBLOCK;

    if (fcntl(fd, F_SETFL, flags) < 0) {

        perror("F_SETFL error:");

    }

}

More details about pipes can be referred at reference 4.

References:
1. POSIX wikipedia page: http://en.wikipedia.org/wiki/POSIX
2. Understanding Linux Kernel Inter-process Communication: Pipes, FIFO & IPC: http://linux.omnipotent.net/article.php?article_id=12504&page=2
3. popen Linux man page.
4. Unix Network Programming, Volume 2.
5. http://manpages.courier-mta.org/htmlman7/pipe.7.html

TCP TIME_WAIT State and Address Already in Use Error

0. The Problem
Recently I am working on a project consists of TCP socket programming on Linux. I encountered errno 98 (address already in use) and 99 (cannot assign requested address) frequently. I wrote a small test program to reproduce the issue. The test code is as below,

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <errno.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <sys/select.h>

#include <sys/time.h>

#include <fcntl.h>

#include <netinet/in.h>

#include <netdb.h> 

#include <arpa/inet.h>

#include <pthread.h>

#include <signal.h>

#include <sys/time.h>

 

unsigned short PROXY_SERVER_PORT = 0x5B77;        

#define PROXY_CLIENT_ST_PORT 0x6B77                

unsigned int bindError = 0;

 

int main(void) {

    int socketFd;

    int one = 1;

    int mapIdx = 0;

    int i, j; 

    int conTry = 0;

    struct sockaddr_in localAddr, serv_addr;

    struct addrinfo hints, *res;

    int err;

    //char *hostname = "www.google.com";

    char *hostname = "74.125.235.18";

    memset(&hints, 0, sizeof(hints));

    hints.ai_socktype = SOCK_STREAM;

    hints.ai_family = AF_INET;

 

    if ((err = getaddrinfo(hostname, NULL, &hints, &res)) != 0) {

        printf("error %dn", err);

        return 1;

    }

 

    bzero((char *) &serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;

    serv_addr.sin_addr.s_addr = ((struct sockaddr_in *)(res->ai_addr))->sin_addr.s_addr;    

    serv_addr.sin_port = htons(80);

 

    printf("ip address : %sn", inet_ntoa(serv_addr.sin_addr));

    freeaddrinfo(res);

 

    printf("before loop...n");

    for (i = 0; i < 3; ++i) {

        socketFd = socket(AF_INET, SOCK_STREAM, 0);

        //if the line below is enabled, error occurs at connect

        setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);

        if (socketFd < 0)  {

            printf("ERROR opening proxy tcp client socket %d: %d", mapIdx, errno);

            exit(1);

        }

        memset(&localAddr, 0, sizeof(struct sockaddr_in));

        localAddr.sin_family = AF_INET;

        localAddr.sin_addr.s_addr = INADDR_ANY;

        localAddr.sin_port = htons((PROXY_CLIENT_ST_PORT + mapIdx + bindError)%65535);

        printf("binding...n");

        for (j = 1; ;++j) {

            if (bind(socketFd, (struct sockaddr *)&localAddr, sizeof(struct sockaddr_in)) < 0) {

                printf("Error %d, tcp proxy client binding to %dn", errno, ntohs(localAddr.sin_port));

                perror("error binding tcp proxy client: ");

                ++bindError;

                localAddr.sin_port = htons((PROXY_CLIENT_ST_PORT + mapIdx + bindError)%65535);

            } else {    

                break;

            }

        }

        printf("bound socket %d to tcp client port %un", socketFd, ntohs(localAddr.sin_port));

        printf("connecting....n");

        if (connect(socketFd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0) {

            printf("error code: %dn", errno);

            perror("error connecting to ori server from proxy tcp client: ");

            if (errno == EADDRNOTAVAIL) {

                printf("The specified address %d is not available from the local machine.n", ntohs(localAddr.sin_port));

                exit(1);

            }

        }

        printf("connected: %dn", mapIdx);

        close(socketFd);

        mapIdx++;

    }

    return 0;

}

Save the code to test.c, then you can compile it using “gcc -o test test.c”. Running the program multiple times will likely to give you one error, enable/disable line “setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);“ will give you the other error.

When the error occurs, run netstat command “netstat | grep 27511” (from the program output, I know the error occurs at port 27511). Below is the screenshot,

Figure 1. Cannot Assign Request Address

As shown in figure 1, the tcp port is in TIME_WAIT state.

1. The TCP State Transition at Closure

To close a established TCP connection, both endpoints send FIN packets to indicate there’s no more data. Upon receiving the other party’s FIN packet, both endpoints need to ACK it.

The FIN packets are sent when a program calls exit(), close() or shutdown(). The ACKs are handled by the kernel after close() is completed. Therefore, it is possible that the program finishes before the kernel releases the associated network resource. And another process won’t be able to use it until kernel has freed it.

Below is a figure of detailed state transitions for an endpoint when TCP connection closes. It follows different paths depending on which side initiated the closure.

Figure 2. TCP State Transition at Closure (diagram from reference 4)

Note that TIME_WAIT only occurs at the endpoint which initiated the closure.

2. Why TIME_WAIT

After the TCP connection is closed, there might still be live packets in the network. If a new connection is established with the exact same (client IP, client port, server IP, server port) tuple, the packets from the previous connection will be treated for the new connection.

To avoid this, TIME_WAIT time is generally set to twice the packets maximum age. The value is long enough that the packets for the old connection will be dead after the time expires. Note that setting TIME_WAIT at one endpoint would be enough to make sure no two exactly same (client IP, client port, server IP, server port) tuples appear.

3. How to Avoid the Problem

TIME_WAIT only occurs at the side which initiates the TCP connection closure, so a natural solution would be avoid calling close(). If you have control over both client and server, you may want to let the client close first, so the server won’t ends of lots of TIME_WAIT ports.

As indicated in the testing program, setsockopt() with SO_REUSEADDR allows you to bind the a socket to a port which in TIME_WAIT. If you use the socket as a client side, and try connecting to the same (server address, server port) tuple, you’ll fail at connect stage. However, connecting to other (server address, server port) is allowed.

If you use the socket as a server socket, you can also use SO_REUSEADDR. I’ve not tested if the same (client address, client port) tries to connect, what will happen. But I guess the connection request will be denied.

It’s also possible to modify the TIME_WAIT values on some operating systems.

References:

1. Setting TIME_WAIT TCP, stackoverflow: http://stackoverflow.com/questions/337115/setting-time-wait-tcp

2. TIME_WAIT and its design implications for protocols and scalable client server systems: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

3. The TIME_WAIT state in TCP and Its Effect on Busy Servers: http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/

4. Bind: Address Already in Use or How to Avoid this Error when Closing TCPConnections

Using mmap for Random Access of Files

mmap allows a program to map a section of the file into the program’s memory space, and access it with a pointer. This post illustrates mmap for random access of files.

0. Description

The mmap system call is defined in sys/mman.h as below,

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);

 

The input parameters have the following meanings,

  • addr: the address we want to map to. Usually set to zero to let the OS decide the address for us.
  • len: the length of data we want to map.
  • prot: the access rights the process want to have to the mapped memory. The values can be PROT_READ, PROT_WRITE and PROT_EXEC, which corresponds to read, write and execute permissions respectively. Note that one can specify multiple permissions by ORing the values.
  • flags: Set to MAP_SHARED if you want to share the file with other processes. Set to MAP_PRIVATE will get your process a copy of the mapped region, and your changes won’t be reflected in the original file. There’re other flags, you can refer to man page.
  • fildes: the file descriptor of the file you want to map. It can be obtained using the open system call as below. Note that the access mode should be the same as prot flags you set in mmap call.

    int fd = open(“test”, O_RDWR);

  • off: the offset of the file to start mmap. Note that this must be a multiple of virtual memory page size, which can be get by a call to getpagesize() or sysconf(_SC_PAGE_SIZE)

On success, mmap returns a pointer to the beginning of the mapped data region. Otherwise, it returns a value of MAP_FAILED and sets errno.

After mmap the file, we got a pointer to the mapped data. We can use the pointer to jump to a specific byte to access the data. The example given in part 1 illustrates this by reading a specific byte you specify.

After we’re done with mmap, we can unmap by mummap call,

int munmap(void *addr, size_t len);

 

  • addr: the address of unmap, which should be the address returned by mmap call.
  • len: the length to unmap, which should be the same as the len parameter passed in mmap call.

Note that a mapped file is unmapped automatically upon process termination.

1. Example

Below is an example illustrating mmap,

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main(int argc, char *argv[])
{
    int fd, offset;
    unsigned char *data;
    struct stat sbuf;
    FILE *f;
    int i;
    unsigned char writeBytes[10];

    if (argc != 2) {
        fprintf(stderr, "usage: mmapdemo offsetn");
        exit(1);
    }

    /*create the test file*/
    for (i = 0; i < 10; ++i) {
    writeBytes[i] = (unsigned char)i;
    }
    f = fopen("./test", "w");
    for (i = 0; i < 10; ++i) {
        fwrite(writeBytes, 1, 10, f);
    }
    fclose(f);

    if ((fd = open("./test", O_RDONLY)) == -1) {
        perror("open");
        exit(1);
    }
    if (stat("./test", &sbuf) == -1) {
        perror("stat");
        exit(1);
    }
    printf("file size: %ldn", sbuf.st_size);
    offset = atoi(argv[1]);
    if (offset < 0 || offset > sbuf.st_size-1) {
        fprintf(stderr, "mmapdemo: offset must be in the range 0-%ldn", (sbuf.st_size-1));
        exit(1);
    }

    data = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)    ;
    if (data == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    printf("byte at offset %d is %dn", offset, data[offset]);
    munmap(data, sbuf.st_size);

    return 0;
}

The code creates a test file first. It opens the file to get a file descriptor. stat is called to get the file size. mmap is then applied to map the entire file into memory. After that, we can access a specific byte using the pointer returned by printing its value. Note that we can define the data pointer as pointer to some other data types. If we define as int, then we can access the data as ints.

Copy, paste and save the file to mmaptest.c, and then compile the code using,

gcc -o mmaptest mmaptest.c

Below are some sample execution,

./mmaptest 99

file size: 100

byte at offset 99 is 9

./mmaptest 2

file size: 100

byte at offset 2 is 2

./mmaptest 50

file size: 100

byte at offset 50 is 0

mmap can perform better than fseek, fread in certain cases. mmap can also be used to share file between different processes, which can reduce memory access significantly as it avoids loading the same file for each process.

Update 1: mmap can perform faster when dealing with big files. The normal read/write uses system call and frequent reads and writes cause lots of context switch between kernel space and user space. These context switches can be a significant overhead.

Update 2: mmap doesn’t load the entire file into memory. It adopts lazy access approach — when the data is needed, it is loaded into memory. This leads to another optimization for mmap based file accessing — pre-loading the content before the content is actually needed.

 

Simple TCP Socket Client and Server Communication in C Under Linux

This post doesn’t provide details about how Linux socket works, its design etc. It mainly for providing source code of simple TCP socket client and server in C. I’m writing this because I found myself need simple TCP client and server for testing from time to time. 🙂
1. The TCP Server Code

You can refer to TCP Server code below (save it as tcpserver.c),

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <unistd.h>

#include <fcntl.h>

#include <sys/types.h> 

#include <sys/socket.h>

#include <netinet/in.h>

 

void nonblock(int sockfd)

{

    int opts;

    opts = fcntl(sockfd, F_GETFL);

    if(opts < 0)

    {

        fprintf(stderr, "fcntl(F_GETFL) failedn");

    }

    opts = (opts | O_NONBLOCK);

    if(fcntl(sockfd, F_SETFL, opts) < 0) 

    {

        fprintf(stderr, "fcntl(F_SETFL) failedn");

    }

}

 

int main(int argc, char *argv[])

{

     int BUFLEN = 2000;

     int sockfd, newsockfd, portno;

     socklen_t clilen;

     char buffer[BUFLEN];

     struct sockaddr_in serv_addr, cli_addr;

     int n, i;

     int one = 1;

 

     if (argc < 2) {

         fprintf(stderr,"please specify a port numbern");

         exit(1);

     }

     sockfd = socket(AF_INET, SOCK_STREAM, 0);

     if (sockfd < 0) {

        perror("ERROR create socket");

        exit(1);

     }

     setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof one);    //allow reuse of port

     //bind to a local address

     bzero((char *) &serv_addr, sizeof(serv_addr));

     portno = atoi(argv[1]);

     serv_addr.sin_family = AF_INET;

     serv_addr.sin_addr.s_addr = INADDR_ANY;

     serv_addr.sin_port = htons(portno);

     if (bind(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {

        perror("ERROR on bind");

        exit(1);

     }

     //listen marks the socket as passive socket listening to incoming connections, 

     //it allows max 5 backlog connections: backlog connections are pending in queue

     //if pending connections are more than 5, later request may be ignored

     listen(sockfd,5);

     //accept incoming connections

     clilen = sizeof(cli_addr);

     newsockfd = accept(sockfd, (struct sockaddr *) &cli_addr, &clilen);

     //nonblock(newsockfd);        //if we want to set the socket as nonblock, we can uncomment this

     if (newsockfd < 0) {

        perror("ERROR on accept");

        exit(1);

     }

     printf("connection acceptedn");

     for (i = 0; i < 5; ++i) {

         bzero(buffer,BUFLEN);

         n = read(newsockfd,buffer,BUFLEN);

         if (n < 0) {

            perror("ERROR read from socket");

         }

         printf("received: %s",buffer); 

         n = write(newsockfd, buffer, n);

         printf("sent: %s", buffer);

         if (n < 0) {

            perror("ERROR write to socket");

         }

     }

     close(newsockfd);

     close(sockfd);

     return 0; 

}

The code first create a socket of SOCK_STREAM type in AF_INET domain. SOCK_STREAM corresponds to TCP, and AF_INET refers to IPv4.

It calls setsockopt to make the socket resuable. For example, your server open up a socket at a port number, and then it exits. Without this line of code, sometimes you may not bind the socket to same port.

The program then binds the socket to a local IP address with a specific port number. listen(sockfd, 5) will set the socket as passive socket listening to incoming connections. 5 means the maximum number of backlog is 5. Backlog connections are pending connect request in queue. If pending request are more than 5, later request may be ignored.

The code then calls accept method to accept incoming connections. This is a block call, and it returns when a new connect request is received, or an error occurred.

In the for loop, the tcp server listens to incoming packets and echo the message back. Note that the nonblock(newsockfd) has been commented out. You can uncomment it to enable nonblock read and write, but you may want to modify the read and write part to make it work properly.

2. The TCP Client Code
The TCP client code is as below (save it as tcpclient.c),

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <fcntl.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <netinet/in.h>

#include <netdb.h> 

 

void nonblock(int sockfd)

{

    int opts;

    opts = fcntl(sockfd, F_GETFL);

    if(opts < 0)

    {

        fprintf(stderr, "fcntl(F_GETFL) failedn");

    }

    opts = (opts | O_NONBLOCK);

    if(fcntl(sockfd, F_SETFL, opts) < 0) 

    {

        fprintf(stderr, "fcntl(F_SETFL) failedn");

    }

}

 

int main(int argc, char *argv[])

{

    int BUFLEN = 2000;

    int sockfd, portno, n;

    struct sockaddr_in serv_addr;

    struct hostent *server;

    int i;

 

    char buffer[BUFLEN];

    if (argc < 3) {

       fprintf(stderr,"usage: %s hostname_or_ip portn", argv[0]);

       exit(0);

    }

    portno = atoi(argv[2]);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);

    if (sockfd < 0) {

        perror("ERROR creating socket");

        exit(1);

    }

    //get the address info by either host name or IP address

    server = gethostbyname(argv[1]);

    if (server == NULL) {

        fprintf(stderr,"ERROR, no such hostn");

        exit(1);

    }

    bzero((char *) &serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;

    bcopy((char *)server->h_addr, (char *)&serv_addr.sin_addr.s_addr, server->h_length);

    serv_addr.sin_port = htons(portno);

    if (connect(sockfd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0)  {

        perror("ERROR connecting");

        exit(1);

    }

    printf("connection establishedn");

    //nonblock(sockfd);    //uncomment this line if we want to make the socket non-block

    for (i = 0; i < 5; ++i) {

        printf("Please enter the message: ");

        bzero(buffer,BUFLEN);

        fgets(buffer,BUFLEN,stdin);

        n = write(sockfd,buffer,strlen(buffer));

        printf("sent: %s", buffer);

        if (n < 0) {

             perror("ERROR writing to socket");

        }

        bzero(buffer,BUFLEN);

        n = read(sockfd,buffer,BUFLEN);

        if (n < 0) {

             perror("ERROR reading from socket");

        }

        printf("received: %s",buffer);

    }

    close(sockfd);

    return 0;

}

The code creates a TCP socket. It then tries to connect to a server address with specified IP and port number. The connect is a block call, it returns when the connection is established, which means it receives the ACK from tcp server, which is third packet of TCP 3-way handshake (SYN, ACK+SYN, ACK).

In the loop the tcp client sends and receives messages. Also the nonblock code has been disabled.

3. Compile and Run
To compile the code, use the commands below,

gcc -o client tcpclient.c
gcc -o server tcpserver.c

A sample run could be,

./server 12345 (at one terminal)
./client localhost 12345 (at another terminal)

How to Calculate IP/TCP/UDP Checksum–Part 3 Usage Example and Validation

This is a follow up of the previous post IP/TCP/UDP Checksum Calculation part 1 theory, and part 2 implementation.

This post gives an example using libnetfiler_queue library and the checksum code we implemented in part 2 to illustrate how to use the checksum code and verify that our code is actually computing correctly.

Note this post is not a post for libnetfilter_queue library. So the usage of this library is not covered here. I’ve written separate posts for libnetfilter_queue.

The code (let’s call it test.c) is as below,

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netinet/in.h>
#include <linux/types.h>
#include <linux/netfilter.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <libnetfilter_queue/libnetfilter_queue.h>

#include "checksum.h"

static int cb(struct nfq_q_handle *qh, struct nfgenmsg *nfmsg, struct nfq_data *nfa, void *data) {
    printf("entering callbackn");
    struct nfqnl_msg_packet_hdr *ph;
    int payload_len;
    unsigned char *payloadData;
    struct iphdr *ipHeader;
    struct tcphdr *tcpHeader;
    struct udphdr *udpHeader;
    unsigned short ipCheck, udpCheck, tcpCheck;
    ph = nfq_get_msg_packet_hdr(nfa);
    u_int32_t id = ntohl(ph->packet_id);
    payload_len = nfq_get_payload(nfa, &payloadData);
    printf("ip datagram len = %dn", payload_len);
    ipHeader = (struct iphdr *)payloadData;
    ipCheck = ipHeader->check;
    printf("ip checksum: %04xn", ipHeader->check);
    //calculate ip checksum, and see if the calculation matches
    compute_ip_checksum(ipHeader);
    printf("calculated ip checksum: %04xn", ipHeader->check);
    if (ipCheck != ipHeader->check) {
    printf("-------------ip checksum calculation is wrong-----------n");
    }
    if (ipHeader->protocol == IPPROTO_TCP) {
        tcpHeader = (struct tcphdr *)(payloadData + (ipHeader->ihl<<2));
    tcpCheck = tcpHeader->check;
    printf("tcp checksum: %04xn", tcpHeader->check);
    //calculate tcp checksum, and see if the calculation matches with original tcp checksum
        compute_tcp_checksum(ipHeader, (unsigned short*)tcpHeader);
    printf("calculated tcp checksum: %04xn", tcpHeader->check);
    if (tcpHeader->check != tcpCheck) {
        printf("-----------calculation is wrong-------n");
    }
    } else if (ipHeader->protocol == IPPROTO_UDP) {
    udpHeader = (struct udphdr *)(payloadData + (ipHeader->ihl<<2));
    udpCheck = udpHeader->check;
    printf("udp checksum: %04xn", udpHeader->check);
    //calculate udp checksum, and see if the calculation matches with original udp checksum
    compute_udp_checksum(ipHeader, (unsigned short*)udpHeader);
    printf("calculated udp checksum: %04xn", udpHeader->check);
        if (udpHeader->check != udpCheck) {
        printf("-----------calculation is wrong-------n");
    }
    }
    //issue a verdict on a packet
    //qh: netfilter queue handle; id: ID assigned to packet by netfilter; verdict: verdict to return to netfilter, data_len: number
    //of bytes of data pointed by buf, buf: the buffer that contains the packet data (payload)
// return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);
    return nfq_set_verdict(qh, id, NF_ACCEPT, payload_len, payloadData);
}

int main(int argc, char **argv) {
    struct nfq_handle *h;
    struct nfq_q_handle *qh;
    //struct nfnl_handle *nh;
    int fd;
    int rv;
    char buf[4096] __attribute__((aligned));

    /*netfilter_queue library set up step 1: call nfq_open to open a NFQUEUE handler*/
    printf("opening library handlen");
    h = nfq_open();
    if (!h) {
        fprintf(stderr, "error during nfq_open()n");
        exit(1);
    }
    /*set up step 2: tell the kernel that userspace queuing is handled by NFQUEUE for the selected protocol. This is 
 made by calling nfq_unbind_pf and nfq_bind_pf with protocol info. The idea behind this is to enable simulataneously loaded modules to be used for queuing*/
    printf("unbinding existing nf_queue handler for AF_INET (if any)n");
    if (nfq_unbind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_unbind_pf()n");
        exit(1);
    }
    printf("binding nfnetlink_queue as nf_queue handler for AF_INETn");
    if (nfq_bind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_bind_pf()n");
        exit(1);
    }
    /*after the above two steps, we can set up and use a queue*/
    /*bind the program to a specific queue, cb is callback function to call for each queued packet
 *h: netfilter queue handle, 0: the number of the queue to bind to, cb: callback function for each queued packet, data: custom data to pass to callback function*/
    printf("binding this socket to queue '0'n");
    qh = nfq_create_queue(h, 0, &cb, NULL);
    if (!qh) {
        fprintf(stderr, "error during nfq_create_queue()n");
        exit(1);
    }
    /*set mode: NFQNL_COPY_PACKET: copy entire packet, it defines the part of data that nfqueue copies to userspace
 0xffff: size of the packet that we want to get*/
    printf("setting copy_packet moden");
    if (nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff) < 0) {
        fprintf(stderr, "cannot set packet_copy moden");
        exit(1);
    }
    /*handle the incoming packets*/
    fd = nfq_fd(h);
    while ((rv = recv(fd, buf, sizeof(buf), 0)) && rv >= 0) {
    printf("pkt receivedn");
    nfq_handle_packet(h, buf, rv);
    }
    printf("unbinding from queue 0n");
    nfq_destroy_queue(qh);
    /*the program has finished with libnetfilter_queue, it can call nfq_close to free all associated resources*/
    printf("closing library handlen");
    nfq_close(h);
    exit(0);
}

The code basically use libnetfilter_queue to get the IP datagram of TCP/UDP packets, and retrieve their IP, TCP/UDP checksum. It then use the checksum code we implemented in part 2 to recalculate the checksum. If the calculated the checksum matches with the checksum we retrieved from the original packet, then we’re sure the computation is done correctly.

Compile and Run the Code

To compile the code, you can run the command below,

gcc -Wall -o testsum checksum.c test.c -lnfnetlink -lnetfilter_queue

Note that you’ll need libnfnetlink and libnetfilter_queue to compile the code. Please google for more info.

To run the code, save the following command to run.sh, and then use sudo ./run.sh,

#!/bin/sh
sudo iptables -t mangle –flush
sudo iptables -t mangle -A OUTPUT -p tcp  -j NFQUEUE –queue-num 0
sudo iptables -t mangle -A OUTPUT -p udp -j NFQUEUE –queue-num 0
sudo ./testsum

Note that after you finish running the code and killed the program, you won’t be able to access Internet, you’ll need to run the following command to bring your Internet back,

sudo iptables -t mangle –flush

Note that the above procedure basically set up Linux iptable rules to queue TCP and UDP packets, so the test.c program code can receive them at userspace.

Download

You can download the source code and scripts from here.

How to Calculate IP/TCP/UDP Checksum–Part 2 Implementation

This is a follow up of the previous post, how to calculate IP/TCP/UDP checksum part 1 — theory.

IP Header Checksum Calculation Implementation
To calculate the IP checksum, one can use the code below,

/* set ip checksum of a given ip header*/

void compute_ip_checksum(struct iphdr* iphdrp){

  iphdrp->check = 0;

  iphdrp->check = compute_checksum((unsigned short*)iphdrp, iphdrp->ihl<<2);

}

/* Compute checksum for count bytes starting at addr, using one's complement of one's complement sum*/

static unsigned short compute_checksum(unsigned short *addr, unsigned int count) {

  register unsigned long sum = 0;

  while (count > 1) {

    sum += * addr++;

    count -= 2;

  }

  //if any bytes left, pad the bytes and add

  if(count > 0) {

    sum += ((*addr)&htons(0xFF00));

  }

  //Fold sum to 16 bits: add carrier to result

  while (sum>>16) {

      sum = (sum & 0xffff) + (sum >> 16);

  }

  //one's complement

  sum = ~sum;

  return ((unsigned short)sum);

}

The method compute_ip_checksum initialize the checksum field of IP header to zeros. Then calls a method compute_checksum. The mothod compute_checksum accepts the computation data and computation length as two input parameters. It sum up all 16-bit words, if there’s odd number of bytes, it adds a padding byte. After summing up all words, it folds the sum to 16 bits by adding the carrier to the results. At last, it takes the one’s complement of sum and cast it to 16-bit unsigned short type. You can refer to part 1 for more detailed description of the algorithm.

Note that the data structure iphdr and tcphdr and udphdr are Linux data structures representing IP header, TCP header and UDP header respectively. You may want to Google for more information in order to understand the code.

TCP Header Checksum Calculation Implementation
To calculate the TCP checksum, you can use the code below,

/* set tcp checksum: given IP header and tcp segment */

void compute_tcp_checksum(struct iphdr *pIph, unsigned short *ipPayload) {

    register unsigned long sum = 0;

    unsigned short tcpLen = ntohs(pIph->tot_len) - (pIph->ihl<<2);

    struct tcphdr *tcphdrp = (struct tcphdr*)(ipPayload);

    //add the pseudo header 

    //the source ip

    sum += (pIph->saddr>>16)&0xFFFF;

    sum += (pIph->saddr)&0xFFFF;

    //the dest ip

    sum += (pIph->daddr>>16)&0xFFFF;

    sum += (pIph->daddr)&0xFFFF;

    //protocol and reserved: 6

    sum += htons(IPPROTO_TCP);

    //the length

    sum += htons(tcpLen);

 

    //add the IP payload

    //initialize checksum to 0

    tcphdrp->check = 0;

    while (tcpLen > 1) {

        sum += * ipPayload++;

        tcpLen -= 2;

    }

    //if any bytes left, pad the bytes and add

    if(tcpLen > 0) {

        //printf("+++++++++++padding, %dn", tcpLen);

        sum += ((*ipPayload)&htons(0xFF00));

    }

      //Fold 32-bit sum to 16 bits: add carrier to result

      while (sum>>16) {

          sum = (sum & 0xffff) + (sum >> 16);

      }

      sum = ~sum;

    //set computation result

    tcphdrp->check = (unsigned short)sum;

}

The mothod sums the pseudo TCP header first, then the IP payload, which is the TCP segment. It also pads the last byte if there’re odd number of bytes. For detailed description of the algorithm, please refer to comments in the code and part 1.

UDP Header Checksum Calculation Implementation
To calculate the UDP checksum, one can follow the code below,

/* set tcp checksum: given IP header and UDP datagram */

void compute_udp_checksum(struct iphdr *pIph, unsigned short *ipPayload) {

    register unsigned long sum = 0;

    struct udphdr *udphdrp = (struct udphdr*)(ipPayload);

    unsigned short udpLen = htons(udphdrp->len);

    //printf("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~udp len=%dn", udpLen);

    //add the pseudo header 

    //printf("add pseudo headern");

    //the source ip

    sum += (pIph->saddr>>16)&0xFFFF;

    sum += (pIph->saddr)&0xFFFF;

    //the dest ip

    sum += (pIph->daddr>>16)&0xFFFF;

    sum += (pIph->daddr)&0xFFFF;

    //protocol and reserved: 17

    sum += htons(IPPROTO_UDP);

    //the length

    sum += udphdrp->len;

 

    //add the IP payload

    //printf("add ip payloadn");

    //initialize checksum to 0

    udphdrp->check = 0;

    while (udpLen > 1) {

        sum += * ipPayload++;

        udpLen -= 2;

    }

    //if any bytes left, pad the bytes and add

    if(udpLen > 0) {

        //printf("+++++++++++++++padding: %dn", udpLen);

        sum += ((*ipPayload)&htons(0xFF00));

    }

      //Fold sum to 16 bits: add carrier to result

    //printf("add carriern");

      while (sum>>16) {

          sum = (sum & 0xffff) + (sum >> 16);

      }

    //printf("one's complementn");

      sum = ~sum;

    //set computation result

    udphdrp->check = ((unsigned short)sum == 0x0000)?0xFFFF:(unsigned short)sum;

The code is similar to TCP checksum computation. Except that when the checksum is compted as all 0s, we set them to 1s. As 0x0000 is already reserved for indicating that the checksum is not computed. Also please refer to part 1 for detailed description of the algorithm.

All the code can be downloaded at part 3.

How to Calculate IP/TCP/UDP Checksum–Part 1 Theory

This is the first part of How to Calculate IP/TCP/UDP Checksum. The 2 parts followed are Part 2 Implementation and Part 3 Usage Example and Validation.

This part focuses on the algorithms. We’ll go through IP/TCP/UDP one by one.

IP Header Checksum Calculation

IP checksum is a 16-bit field in IP header used for error detection for IP header. It equals to the one’s complement of the one’s complement sum of all 16 bit words in the IP header. The checksum field is initialized to all zeros at computation.

One’s complement sum is calculated by summing all numbers and adding the carry (or carries) to the result. And one’s complement is defined by inverting all 0s and 1s in the number’s bit representation.

For example, if an IP header is 0x4500003044224000800600008c7c19acae241e2b.
We start by calculating the one’s complement sum. First, divide the header hex into 16 bits each and sum them up,

4500 + 0030 + 4422 + 4000 + 8006 + 0000 + 8c7c + 19ac + ae24 + 1e2b = 2BBCF

Next fold the result into 16 bits by adding the carry to the result,

2 +  BBCF  = BBD1

The final step is to compute the one’s complement of the one’s complement’s sum,

BBD1 = 1011101111010001

IP checksum = one’s complement(1011101111010001) = 0100010000101110 = 442E

Note that IP header needs to be parsed at each hop, because IP addresses are needed to route the packet. To detect the errors at IP header, the checksum is validated at every hop.

The validation is done using the same algorithm. But this time the initialized checksum value is 442E.

2BBCF + 442E = 2FFFD, then 2 + FFFD = FFFF

Take the one’s complement of FFFF = 0.

At validation, the checksum computation should evaluate to 0 if the IP header is correct.

TCP Checksum Calculation

TCP Checksum is a 16-bit field in TCP header used for error detection. It is computed over the TCP segment (might plus some padding) and a 12-byte TCP pseudo header created on the fly. Same as IP checksum, TCP checksum is also one’s complement of the one’s complement sum of all 16 bit words in the computation data.

Below is a figure that illustrates the data used to calculate TCP checksum,

Figure 1. TCP Checksum Computation Data

As shown in the figure, the pseudo header consists of 5 fields,

  • source address: 32 bits/4 bytes, taken from IP header
  • destination address: 32bits/4 bytes, taken from IP header
  • resevered: 8 bits/1 byte, all zeros
  • protocol: 8 bits/1 byte, taken from IP header. In case of TCP, this should always be 6, which is the assigned protocol number for TCP.
  • TCP Length: The length of the TCP segment, including TCP header and TCP data. Note that this field is not available in TCP header, therefore is computed on the fly.

Note that TCP pseudo header does not really exist, and it’s not transmitted over the network. It’s constructed on the fly to compute the checksum.

If a TCP segment contains an odd number of octets to be checksummed, the last octect is padded on the right with zeros to form a 16-bit word. But the padding is not part of the TCP segment and therefore not transmitted.

Also note the checksum field of the TCP header needs to be initialized to zeros before checksum calculation. And it’s set to the computed value after the computation.

When TCP packet is received at the destination, the receiving TCP code also performs the TCP calculation and see if there’s a mismatch. If there is, it means there’s error in the packet and it will be discarded. The same validation logic used for IP header checksum validation can be used.

UDP Checksum Calcuation

UDP Checksum calculation is similar to TCP Checksum computation. It’s also a 16-bit field of one’s complement of one’s complement sum of a pseudo UDP header + UDP datagram.
The Pseudo UDP header also consists of 5 fields,

  • source address: 32 bits/4 bytes, taken from IP header
  • destination address: 32 bits/4 bytes, taken from IP header
  • reserved: 8 bits/1 byte, set to all 0s.
  • protocol: 8 bits/1 byte, taken from IP header
  • length: Because UDP header has a length field that indicates the length of the entire datagram, including UDP header and data, the value from UDP header is used. Note that this is different from TCP pseudo header, which is computed on the fly. But they both indicates the header+payload length.

Note that UDP checksum is optional. If it’s not computed, it’s set to all 0s. This could cause issue as sometimes the checksum can be computed as all 0s. To avoid confusion, if the checksum is computed as all 0s, it’s set to all 1s (which is equivalent in one’s complement arithmetic).

References

1. IP/TCP/UDP Headers: http://www.imengineering.com/Training/CISCO/TCPIPUDPheaders.pdf
2. TCP Checksum Calculation:
http://www.tcpipguide.com/free/t_TCPChecksumCalculationandtheTCPPseudoHeader-2.htm
3. IP Checksum:
http://www.netfor2.com/checksum.html
4. One’s complement: http://en.wikipedia.org/wiki/Ones’_complement
5. UDP datagram RFC: http://www.ietf.org/rfc/rfc768.txt