COP4610: Operating Systems & Concurrent Programming up ↑

Debugging Techniques

 

Assertions

Experienced programmers put self-checking code into their software, so that if a violation of a coding assumption occurs at runtime it will be caught early and enough information will be printed out to permit the source of the problem to be identified quickly. An example of the simplest kind of assertion is the built-in checks for array bounds and invalid pointers that are provided by the compilers for some programming languages. When one is programming in a language that does not have such checks built in, it is a good idea to put in a few by hand, in what appear to be the most critical places. The most logical places to perform such tests are at the interfaces between modules, both because it is convenient to do the checking there and because that is where errors are likely to occur (a person updating one module may not realize that what appears to be a local change could break code in a different module).

For example, if a function has a pointer as a parameter and is designed to rely on the parameter being non-null, it may be a good idea to put in a check for a null value even though you know from the larger context that there will be no calls to that subprogram with a null value. Global properties, such as "this program is never called with a null parameter value", are hard to verify by code examination, and they are even harder to preserve as programs are modified.

The C language has a standard mechanism for assertions. If you include the header <assert.h> you can use the macro

assert( expression )
to code assertions.

If expression evaluates to zero when this is executed the assert macro call will print a message indentifying the failure, with source file name and line number, to the stream stderr, and then terminate the execution.

You can also code your own specialized assertion macro, e.g.:

#define ASSERT(COND) \
  if (!COND) ...whatever code you want...

The advantage of such macro-based assertion implementations is that you can turn them off easily when you are not testing, if you are concerned about the performance penalty of executing the assertions. For example, the built-in C language assertions can be disabled by putting defining NDEBUG before <assert.h> is included.

It is sometimes worth the trouble to write fairly complex assertion-checking subprograms. For example, if you have a program that relies on a particular list being kept in sorted order, you might write a function that runs down the list and verifies that the data is in order. Calls to this function can be placed in assertions at key points to help to catch errors.

Bisection

When a problem occurs, how does one isolate the place in the code where it is happening? Suppose your program dies with a "Segmentation fault" error message from the operating system. You guess that it is a problem with a pointer, and found out where the program was when it crashed, either by examining the "core" dump file or you by using a debugger.

On closer examination, it seems that the problem is none of the obvious things. You don't immediatly see where you indexed off the end of an arrady, or tried to dereference a null or uninitialized pointer. You do see that a pointer that was initialized to a good value now seems to contain some garbage, and you wonder where and when it was corrupted.

This kind of problem can be solved by the application of bisection ("divide and conquer") technique. You devise some form of test or experiment that narrows down where the problem can be. You then repeat this process until you have narrowed the search enough that the source of the problem is obvious.

Bisection can be applied in many ways. One way is to divide up the timeline. You can look at execution traces (see below), and find the first place a problem shows up. Then add more detailed execution tracing to try to locate the problem more exactly. Another way is to divide up the text of the software. You can deactivate (e.g., by textual deletion, conditional compilation, or comments) certain portions of your code until the problem either goes away, reactivate portions of the code until the problem reappears, and then repeat until you have narrowed down the scope enough to recognize the problem.

Execution Tracing and Logging

To see whether you software is working, or to localize an error when one occurs, it is helpful to be able to see a trace of the program execution. There are some tracing tools (see below) that will allow you to trace certain generic aspects of a program's execution. For example ctrace can trace every line executed, and strace can trace every system call made. In practice, such general tools tend to produce too much output, and then are missing important details (such as the values of key variables) when you get down to really trying to isolate a bug. For this reason you will find yourself adding diagnositc output to your code.

When you do this, keep the following in mind:

  1. Don't skimp on quality. Putting some thought and care into coding diagnostic output will pay off by saving you time that otherwise will be spent in guessing what the output really means.

  2. Design data structures with support for tracing. No matter what the data structure, sooner or later you will run into a situation where you need diagnostic output to see what the data structure contains. Recognize this, and provide a method to print out a summary of the contents. It is generally a good idea for this output to be concise if it is part of an execution trace. For example, if you implement a stack structure, you might provide a subprogram to print out a single line of output using few characters for each item in the stack.

  3. Design the diagnostic output to be left in the code after it is in production. This means you need to have some way to enable and disable the output of the diagnostics, without making changes to your source code.

There are at least three good coding techiques for controlling the amoung of diagnostic output:

Debugging Tools

There are many useful debugging tools. One kind is the interactive debugger. There are several such debuggers available in the Unix environment, including adb, dbx, and gdb. The one that was developed to work with the Gnu C compiler is gdb. (See the specific notes on using gdb for more detail.)

Examples:

Execution tracing tools usually work by inserting code into your program (e.g., ctrace), but sometimes may be able to take advantage of operating system traps (e.g., strace) to trace some events without source-code modifictions.

Examples:

While such tools are sometimes useful, they tend to produce so much output that it may be hard to find the particular thing you are looking for, and they will not print out the values of specific variables that may be needed to figure out what went wrong. Therefore, the existence of such tools does not eliminate the value of building specific trace capabilities into your software.

Memory allocation debugging tools can help you to detect and diagnose common dynamic memory management problems, such as memory leakage, allocation of wrong-sized objects, dereferencing uninitialized pointers, and heap corruption due to dangling references to freed memory.

Examples:

For more detail and more memory debugging tools, see the discussion of Dynamic Memory Mismanagement & Other Memory Usage Problems below.

Using Compiler Warnings

Compiler warning messages are not exactly a debugging technique, but you can avoid a lot of time spent debugging later if you always request the compiler to give you all its optional warnings (with gcc this is "-pedantic" and "-Wall") and you then pay attention to correcting all the code that produced the warnings. By correcting I don't mean just suppressing the warnings, but really correcting the underlying problem. For example, if you get a warning about a pointer being used as a pointer, you should not just routinely add a type cast "(int)" to get rid of the warning; you should ask yourself whether you really meant to convert a pointer to an integer, and whether there is another way to do what you want without such a violation of typing.

Defensive Programming Practices

There are many good defensive programming practices, which can reduce the need for debugging, or make debugging easier when you are forced to do it. There are so many good defensive techniques that we do not have time to do more than mention a few examples here. One example is to use executable assertions, already mentioned above. Another example is to take advantage of the type-checking capabilities of the programming language and compiler. In the case of the C language the required type checking is not very strong, but by turning on the optional compile-time checks one get get a bit more help from the compiler.

A valuable general defensive coding technique is incremental coding. You code by increments, and contrive ways to test what you have written before you continue to the next increment. In that way, you can usually limit the search for a bug to the portions of the code that have been added or changed recently. This technique works best if you have also developed an automated set of regression tests, so that it is easy to repeat all previous tests every time a change is made to the code base.

Linkage Problems

Compiling and linking a program echocli.c using the command

gcc -Wall -ansi -pedantic -o echocli echolib.o -lnsl echocli.c

under SunOS 5.6 resulted in the following error messages:

echocli.c: In function `main':
echocli.c:18: warning: implicit declaration of function `__xnet_socket'
echocli.c:27: warning: implicit declaration of function `__xnet_connect'
Undefined                       first referenced
 symbol                             in file
__xnet_socket                       /var/tmp/ccv_aGES1.o
__xnet_connect                      /var/tmp/ccv_aGES1.o
ld: fatal: Symbol referencing errors. No output written to echocli

The error messages are from the linker, which is named ld. They indicate that it could not find definitions for the subprograms named __xnet_socket and __xnet_connect.

The solution is to add the parameter -lsocket to the compilation command, from which it will be passed on to ld. The effect is for the linker to search for a library named libsocket.a, which has the required function definition.

How would you know that this library is needed? In general, how would you find out which library has a definition for a given function?

This is likely to be system dependent, and is generally not well documented. You may sometimes find that the man-page for a function you are using specifies the library that should be used. For example, the Linux man-page for pthread_create() mentions that you need -lpthread, but the man-page for Solaris does not mention it.

If you do not find the information in the man pages, one way to discover this information is using the command nm, which lists out the names of symbols that are defined in a library or object file. The following shell script will search two library directories, /lib and /usr/lib for definitions of a symbol that is specified by the parameter.

#!/bin/sh
# list all occurrences of global name $1 in libraries
# /lib/*.a /usr/lib *.a

NMOPTS="-g -p"
for NAME in /lib/*.a /usr/lib/*.a; do
   if [ -r $NAME ]; then
      nm $NMOPTS $NAME | grep \ T\  | grep $1 > tmpfile
      if [ -s tmpfile ]; then
         echo ---$NAME---
        cat tmpfile
      fi
   fi
done
rm -f tmpfile

Running this script with parameter "__xnet_connect" under SunOS 5.6 results in the following output:

xi>findname __xnet_connect
---/lib/libsocket.a---
0000000296 T __xnet_connect
---/usr/lib/libsocket.a---
0000000296 T __xnet_connect

This indicates that there are definitions of __xnet_connect in two different directoris, which both happen to be on the default search path for the linker.

If the directory containing the library is not on the default linker search path, the library can be added to the path by using the compiler/linker parameter -Ldirectoryname.

It is a fairly common problem that a system will have more than one version of a function with the same name. In the case above it happens we are safe, since comparing the two libraries /lib/libsocket.a and /usr/lib/libsocket.a shows they are bit-for-bit the same. If this did not turn out to be true, we would have to deal with the following questions:

Dynamic Memory Mismanagement & Other Memory Usage Problems

There are many errors one can make using pointers and dynamic memory management. Some of them will cause a failure at a point in execution that is far past the point of the error. This makes locating the error very difficult.

An example of such an error is when a program stores a value outside the bounds of a dynamically allocated object. For example, consider the program pointer0.c, in directory examples/pointers/. It contains the following code, with an intentionally obvious error in pointer usage:

int * p = (int *) malloc (100);
p[-1] = 0;

If we compiled this program, even with the "-ansi -pedantic -Wall" options of the gcc comiler, there were no warnings, but when we ran it (using the Red Hat 7.3 Linux distribution) the error caused a segmentation fault. When we used the gdb debugger to find the location of the fault, we found that it occurred after return from the main program, in finalization code of the C runtime library:

Program received signal SIGSEGV, Segmentation fault.
0x4207ae76 in chunk_free () from /lib/i686/libc.so.6
(gdb) where
#0  0x4207ae76 in chunk_free () from /lib/i686/libc.so.6
#1  0x4207ac24 in free () from /lib/i686/libc.so.6
#2  0x4211581d in __deregister_frame () from /lib/i686/libc.so.6
#3  0x4211665a in _fini () from /lib/i686/libc.so.6
#4  0x4000bbd2 in _dl_fini () from /lib/ld-linux.so.2
#5  0x4202bb6b in exit () from /lib/i686/libc.so.6
#6  0x420174a2 in __libc_start_main () from /lib/i686/libc.so.6

In this kind of situation, the debugger is not of much direct help. It gave us only a hint as to the nature of the error: we overwrote something that is used by free(). In general, it would be very hard to locate such an error in our code. In this case, because our code is so short and because we intentionally committed the error, we know what happened. We overwrote the header of a memory region allocated by malloc(). If this were a real program and we were following an incremental coding and testing plan, we might be able to limit the search for the error, to the code we had modified most recently. If we are lucky, inspecting the code will reveal the error. If we were unlucky, though, the error might be one that slipped by initial testing, and is now hiding in a very large body of code. Is there any help for this situation?

There are tools that can instrument a program to catch such an error at the time it occurs, so that one can locate it and fix it. Probably the best-known such tool is Purify(TM), a licensed commercial product. There are also some open source software tools that provide similar checks, including the following:

For example, valgrind 1.0.4 found the error in the example program pointer0.c above (valgrind -v pointer0) and reported it as follows:

==18441== valgrind-1.0.4, a memory error detector for x86 GNU/Linux.
==18441== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward.
==18441== Estimated CPU clock rate is 932 MHz
==18441== For more details, rerun with: -v
==18441== 
==18441== Invalid write of size 4
==18441==    at 0x804841B: main (pointer0.c:15)
==18441==    by 0x40262177: __libc_start_main (../sysdeps/generic/libc-start.c:129)
==18441==    by 0x8048321: __libc_start_main@@GLIBC_2.0 (in /home/courses/cop4610/examples/pointers/pointer0)
==18441==    Address 0x40C98020 is 4 bytes before a block of size 100 alloc'd
==18441==    at 0x40048434: malloc (vg_clientfuncs.c:100)
==18441==    by 0x8048410: main (pointer0.c:14)
==18441==    by 0x40262177: __libc_start_main (../sysdeps/generic/libc-start.c:129)
==18441==    by 0x8048321: __libc_start_main@@GLIBC_2.0 (in /home/courses/cop4610/examples/pointers/pointer0)
==18441== 
==18441== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==18441== malloc/free: in use at exit: 100 bytes in 1 blocks.
==18441== malloc/free: 1 allocs, 0 frees, 100 bytes allocated.
==18441== For a detailed leak analysis,  rerun with: --leak-check=yes
==18441== For counts of detected errors, rerun with: -v
T. P. Baker.