COP4610: Operating Systems & Concurrent Programming up ↑

Introduction to the UNIX API

 

Two Example Programs

We will look at two example programs, and use them to discuss several different subjects, in an interleaved fashion. The subjects are:

A Simple Program Calling fork()

simple_fork0.c:

#define _XOPEN_SOURCE 500
#include <unistd.h>
#include <stdio.h>
int main(int argc, char *argv[])
{ int i = 0;
  while (i < 3) {
    fprintf (stdout, "i = %d in %d\n", i, getpid ());
    fork ();
    i++; }
  return 0; }

This simple program can serve as introduction to several topics. First, look at it as an example of the C language.

iostream versus stdio

#include <iostream.h>
...
cout << "i = " << i << " in " << getpid () << endl;

#include <stdio.h>
...
fprintf (stdout, "i = %d in %d\n", i, getpid ());

Students coming into this course are expected to be proficient in C++ programming. If you have also taken a course in C programming, such as FSU's CGS 3408, you should also know how to program in C. Since C++ is an extension of the C language, that might mean you are also proficient in C programming. However, C++ is a large and complex language. A person can be a proficient programmer using only a subset. The subset of C++ taught in most courses does not include some of the C language features, because C++ has other features that provide the same functionality and are viewed by some people as "better".

One example is output. The first output method taught in C++ courses is the output insert operator "<<" of the ostream class, defined in iostream.h. You can't use that method in the C language, since it does not have classes. There are several standard ways to do output in C, one of which is the fprintf function defined in stdio.h. You can use fprintf in C++, too, and some people prefer to use it over the stream-oriented output.

If you have not used fprintf before, take a look at the man-page for it. The concept is simple.

The first parameter is a reference to a C stream object. The header stdio.h defines three standard streams: stdin (used for keyboard input), stdout (used for normal console output), and stderr (used for error messages).

The second parameter is a format string The format string is printed out, and as it is printed out certain portions are treated as specifying where output of other objects should be inserted in the format string. In the example, the two substrings "%d" specify that the decimal (d) representation of an int value should be inserted. The substring "\n" specifies that a new output line should be started at that point.

The remaining parameters specify the values that should be inserted into the format string, where it calls for values to be inserted. In the example, the two values provided are given by the variable "i" and the function call "getpid ()".

As Example of Unix API

#define _XOPEN_SOURCE 500
#include <unistd.h>
...
    fprintf (stdout, "i = %d in %d\n", i, getpid ());
    fork ();
...

The definition of the macro _XOPEN_SOURCE is needed to ensure that any following headers that are included are interpreted according to the Unix application program interface standard defined by XOpen, an industry standards group that has since been merged into The Open Group. It also causes the compiler to limit the program to using only the standard operating system API. Any attempts to go beyond the API defined by XOpen will be caught as compile-time errors.

This feature is important for source code portability. If you leave this out, you will get whatever default dialect of the Unix API is provided by the local operating system. If your code does not match that dialect, you will get a flood of syntax error messages from the compiler.

In practice, these standards are helpful, but not foolproof. The implementors have not yet caught up with them completely. For example, Sun Micorosystems generally keeps in very close conformance to the standards, but Linux falls far behind. For this reason, it may be a good idea to use the version of the standard one older than the latest. You will also sometimes find that you must use extensions. In that case, you can define _XOPEN_SOURCE_EXTENDED, and the compiler will give you the XOpen headers, but allow you to include other API headers as well.

The value of _XOPEN_SOURCE indicates which version of the XOpen standard should be applied. This standard is continually being "improved". The value 500 indicates Issue 5 of the specification (XPG 5, also known as UNIX 95). At the time of this writing, that was one version behind the latest XOpen standard, which is UNIX 98.

There are other Unix API standards, which can be invoked by defining other symbols, such as _POSIX_C_SOURCE. For some applications, one may be preferable to another, depending on which OS API features you want to use.

The header unistd.h defines the core portion of the standard application program interface to the Unix operating system. In this case only two elements of that are used, the functions fork() and getpid().

The effect of getpid() is to return an integer value that uniquely identifies the currently executing process.

The effect of fork() is described below.

As Example of Poor C Programming

non-portable:

fprintf (stdout, "i = %d in %d\n", i, getpid ());

portable:

fprintf (stdout, "i = %d in %d\n", i, (int) getpid ());

The example uses the "%d" format to print out the value returned by getpid(). That function is defined as returning a value of type pid_t. One many systems pid_t and int happen to be the same type. However, if this code is compiled on a system where that is not true, the compiler will reject it and the code will need to be modified. A good C programmer would at least cast the pid_t value to int.

As Introduction to Processes

A process is a program, executing within its own address space. Processes execute logically in parallel. In reality, if there is only one CPU, the operating system arranges to interleave the execution using the single CPU. How it does this will be covered later in the course.

The effect of fork() is to create a new process, that is a duplicate of the currently executing process. This simple example throws away the value returned by fork(). A realistic application would not. The value returned by fork() is different for the original (called "parent") process and for the new (called "child") process. The parent process receives the process id of of the child. The child receives the value zero.

As Example of Concurrent Execution

forking diagram

Observe how the fork operation works, and how this determines the output.

The actual order of the output may vary, according to how the execution of the processes is interleaved.

The interleaving of the output in this case turns out to be by whole lines. Generally, a programmer would need to take more care than in this example, to ensure that the interleaving might split up lines, rendering the output unreadable.

Using a File for Output

simple_fork1.c:

...
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

int main(int argc, char *argv[])
{ char pathname [32];
  int i = 0;
  int outfildes;
  FILE *ofstream;
  if (argc < 2) {
    fprintf (stdout, "Please enter a filename or pathname: ");
    if (!gets (pathname)) {
      fprintf (stderr, "you must provide a filename\n");
      exit (-1);
    }
  } else strcpy (pathname, argv[1]);
  if ((outfildes =
       open (pathname, O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR)) == -1) {
    fprintf (stderr,
      "could not open file \"%s\" for writing: %s\n",
      pathname, strerror (errno));
    exit (-2);  }
  if (!(ofstream = fdopen (outfildes, "w"))) {
    perror ("could not open stream for writing");
    exit (-3);  }
  while (i < 3) {
    fprintf (stdout, "i = %d in %d\n", i, getpid ());
    fprintf (ofstream, "i = %d in %d\n", i, getpid ());
    fork ();
    i++;  }
  fclose (ofstream);
  close (outfildes);
  return 0;
}

This program extends the previous example, to read the name of a file from the argument list or the standard input stream, and to write output to that file as well as the standard output stream. This introduces the Unix filesystem API, as well as some subtle interactions of fork() and I/O. It also introduces some more C language features that are not typically used in C++, along with some C programming pitfalls. Some of the new features introduced are indicated with emphasis font. We will look at each part of this program in more detail, below.

As Example of C Code

#include <string.h>
...
int main(int argc, char *argv[]) {
  char pathname [32];  if (argc < 2) {
  } else strcpy (pathname, argv[1]);
...
  return 0;
}

Command line arguments are conventionally passed to a main program via the parameters argc (argument count) and argv (array of pointers to argument strings). The first "argument" (arg[0]) is conventionally the name of the program. Thus, argc = 2 if there is some other argument on the command line.

Like a function subprogram, a main program returns a value. The value is conventionally used to indicate whether the program execution succeeded. Some applications may use different values to indicate different outcomes of the execution, or different kinds of failure. The value 0 conventionally means success. The value -1 is usually used to indicate failure. This is true of most Unix system calls, but there are a few exceptions.

The program uses C strings, which are arrays of characters. Some C++ courses avoid talking about C strings, and even try to avoid talking about C arrays until late in the course, because they are awkward and prone to programming errors. Instead, the course may introduce a string class. FSU's introductory C++ courses do cover strings and arrays, but they may not cover the string manipulation functions defined in string.h. The function strcpy() acts like an assignment statement for strings. It copies the string specified by the second argument to the string specified by the first argument. It is very convenient, but very prone to misuse.

strcpy dangers

The use of strcpy() is "deprecated".

This is called an unchecked buffer overflow.

As Example of C Code

#include <stdlib.h>
...
if (!gets (pathname)) {
  fprintf (stderr, "you must provide a filename\n");
  exit (-1);
}

explain gets(), stderr, and exit()

Explain difference between "man exit", "man -S 2 exit", and "man -S 3 exit" on Linux. Explain difference in "man" parameters between Linux (-S) and Solaris (-s).

gets dangers

The use of gets() is "deprecated".

The use of gets() in the example protram is another example of an unchecked buffer overflow.

While it is possible to use strcpy() safely, by first calling strlen() to find out the length of the source string, there is no way to use gets() safely. Therefore, the gcc compiler gives a warning.

Digression on Buffer Overflows

The buffer overflow problems in our example come from the uses of the functions gets() and strcpy(). The function strcpy() must be used with great care, and the function gets() should be avoided entirely because it creates and unavoidable risk of undetected array bounds overflow. They are just two of many ways a buffer overflow can occur. Programmers frequently create their own unchecked buffer overflows, by not taking care when they write a loop that iterates over the elements of an array.

Unchecked buffer overflows are one of the two top C programming errors. The other is the use of uninitialized pointers. Both can cause behaviors that can be both very destructive and very difficult to detect in routine testing.

Please see the notes on buffer overflows and stack crashing for more explanation of the nature of this problem, and how it is both a source of unreliability and a potentially exploitable hole in system security.

strcpy versus strncpy

if (strncpy (destination, source, maxchars) != destination)
  destination [sizeof(destination)] = '\0';

The analogous input function is fgets(), which should be used to replace gets().

The function strncpy() may be used to replace strcpy(), to avoid the need for calling strlen(). It has an argument that allows specficiation of the length of the array argument, and so if the caller provides the correct value for this argument, buffer overflow can be detected and prevented. However, care must be taken to avoid creating a non-terminated string, which could cause other problems later.

The analogous replacement for gets() is fgets(). It should always be used to in place of gets().

Read the man pages for both of these.

As Example of Concurrent Programming Pitfalls

Terminal output, when simple_fork1.c executed on a Linux machine with two CPU's:

[cop4610@websrv forkexec]$ ./simple_fork1 abc
i = 0 in 27398
i = 1 in 27398
i = 1 in 27399
i = 2 in 27398
i = 2 in 27399
i = 2 in 27400
[cop4610@websrv forkexec]$ i = 2 in 27401

Why is there output after the second prompt?

The parent processes do not wait for the child processes to terminate. In this case, the original process (27398) returned to the shell before one child (27400) had terminated. The child did its output after the shell printed the next prompt.

To solve this problem, the parent process should use the waitpid() function to wait for a child process to terminate. With default values of the various options (as shown in the example below), this function causes the calling process to block until a child process terminates, and then returns to the parent the "termination status" of the child process.

Compare the behaviors of the example programs simple_fork3.c, and simple_fork2.c, which only differ by the following code:

if (fork ()) { /* only parent gets non-zero return value */
  while (waitpid (-1, NULL, 0) != -1);
}

The above form of waitpid(), with -1 as the first parameter, waits for any child process to terminate. This is useful if we have more than one child process and want to catch the status of the next one that terminates.

More Concurrent Programming Pitfalls

Sample output on file "abc", for same run:

i = 0 in 27561
i = 1 in 27562
i = 2 in 27562
i = 0 in 27561
i = 1 in 27562
i = 2 in 27564
i = 0 in 27561
i = 1 in 27562
i = 2 in 27564
i = 0 in 27561
i = 1 in 27562
i = 2 in 27562
i = 0 in 27561
i = 1 in 27561
i = 2 in 27563
i = 0 in 27561
i = 1 in 27561
i = 2 in 27563
i = 0 in 27561
i = 1 in 27561
i = 2 in 27561
i = 0 in 27561
i = 1 in 27561
i = 2 in 27561

Why are there more lines of output written to the file "abc" than to the terminal?

Stream output is buffered by the C runtime system and operating system. (The C runtime system is a collection of code that is provided by the C libraries, to support the execution of C programs. It includes support for I/O, including the support for the stream abstraction.) This means that the output operations do not wait for the output to be transferred all the way to its final destination (in this case, to the console). Execution continues, and the C runtime system and the operating system take responsibility for eventually getting the output to its destination. At the first stage of waiting to be put out, stream output sits in an internal data structure (buffer) of the C runtime system, within the address space of the process. The fork opertion duplicates this buffer, along with its contents.

A program can wait for the contents of a stream buffer to be flushed out, by calling fflush(). Flushing is done automatically for each character on the stream stderr, and for each line on the stream stdout. That is why the output written to stdout is not duplicated. The buffer was flushed when fprintf() processed the "\n" in the format. For other files, the program must do the call to fflush() explicitly.

Compare the behaviors of the example programs simple_fork2.c, and simple_fork1.c, which only differ by the call "fflush (ofstream);".

As Example of Unix API

int outfildes;
FILE *ofstream;
...
if ((outfildes =
   open (pathname, O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR)) == -1) {
  fprintf (stderr,
    "could not open file \"%s\" for writing: %s\n",
    pathname, strerror (errno));
  exit (-2);  }
if (!(ofstream = fdopen (outfildes, "w"))) {
  perror ("could not open stream for writing");
  exit (-3);  }
T. P. Baker. ($Id)