COP4610: Operating Systems & Concurrent Programming up ↑

Threads

 

These notes started with an outline of the contents of Chapter 4 in Stallings' OS text. I fleshed them out with an example and some more details about Solaris and Linux threads.

Topics

What is the Essence of a Process?

These two aspects are independent, and so can be separated:

     Process = Resources + Thread

Multithreading: More than one Thread per Process

*

Summary

A process has: A thread has: *

An Analogy

Why not make each thread a separate process?

Review

  1. Why do we need multiple threads at all, whether in a single process or several?
  2. Why not put all threads into a single process?

Having answered the question of why we might want more than one thread per process, make certain you remember why we need threads, and why we need processes.

Uses of Threads in a Single-User Multiprocessing System

Since one main reason multithreading was introduced is to make better use of hardware parallelism, this is a good point to discuss multiprocessor operating systems. However, that is not the only reason threads may be useful.

Taxonomy of Parallelism in Computer Systems

Of these types of parallel architectures, the MIMD seems to have been most widely adopted. That is probably because people found it more natural to write software for this model. In particular, the SMP architecture is now quite common, especially in the form of systems with dual processors.

SISD/SIMD/MISD/MIMD Taxonomy Diagram

*

Symmetric Multiprocessing (SMP)

*

Multiprocessor Operating System Design Considerations

*

The existence of true parallelism complicates virtually all aspects of the operating system design. Optimal CPU scheduling turns out to be much harder. Algorithms that are optimal for a single-CPU system can be very bad for an SMP system. Optimal multiprocessor scheduling is NP-hard. Synchronization is harder for SMP systems. The interrupt-masking technique no longer works, since we have truly parallel execution. Memory management may be complicated by considerations of CPU-local versus shared memory. Fault tolerance becomes critical, because with more CPU's there is much higher probability of a single-CPU failure, and because people may ask for the system to continue operating as long as one CPU is working.

Thread Scheduling Considerations

Operations that Change Thread Scheduling State

I would rather avopid using the term "spawn", as in the textbook, for thread creation. The term "spawn" has been traditionally used in Unix for process creation. For example, the following is taken from the combined POSIX/Unix 98 Specifications:

The posix_spawn() and posix_spawnp() functions shall create a new process (child process) from the specified process image. The new process image shall be constructed from a regular executable file called the new process image file.

Synchronous vs. Asynchronous Events

All OS actions are triggered by some event. There are two kinds of such triggering events:

Synchronous
caused by normal execution of some thread
Asynchronous
caused by a hardware interrupt

Events that cause the thread to change state can be characterized similarly:

Synchronous
caused by execution of the same thread
Asynchronous
caused by a hardware interrupt or the execution of another thread

In general, writing software that can deal with asynchronous events is harder than writing software that only has to deal with synchronous events. The main problem with asynchronous event handling is that the event handler cannot assume much about what else is going on at the time it executes.

To see how hard this is, suppose the grader for this course did not tell you when your programming assignment would be collected. Suppose the grader had permission to access your files, and simply picked an arbitrary time to fetch a copy of your program from your home directory. How could you be sure that the version graded is one that is working? Suppose you kept two versions in different directories, and only copied from your working version to the directory where the grader would pick things up when you thought you had a good, working version. It could still be that the grader collects the program while you are copying, and so gets part of one version and part of the next version. It might not even be compilable!

For this reason, system designers generally keep asynchronous event handler very simple. They usually just capture the minimum information necessary to record the event, in a simple data structure in memory, and let the rest of the processing be done synchronously, by a thread or process that is responsible for reading from that data structure and processing the event.

Level of Thread Blocking

There are two approaches to thread implementation

Application Example: Remote Procedure Call (RPC) Using Threads

*

This demonstrates the usefulness of single-thread blocking.

See also the explanation in the text of how Adobe Pagemaker uses threads to improve user responsiveness by decoupling asynchronous activities.

Level of Thread Scheduling

There are also two approaches to thread scheduling

Level of Thread Implementation (User vs. Kernel)

*

To see one of the advantages of a combined implementation strategy, look at the benchmark results comparing Solaris (combined) and Linux (kernel) threads, reported further below in these notes.

However, after the publication of those results, Solaris updates dropped the two-level model in favor of a single-level model, with one kernel thread per user thread. It seems that the industry practice is converging toward a single implementation model, namely:

It is too bad that the players could not have agreed upon this before the POSIX thread standard was approved, since it would have resulted in a cleaner and more portable API.

VAX Performance Comparision

Operation User-Level Threads Kernel-Level Threads Processes
null fork 34 us 948 us 11,300 us
signal-wait 37 us 441 us 1,840 us

The null fork operation consists of creating and destroying a new thread to execute a procedure that does nothing but return. This is intended to measure the overhead of thread creation.

The signal-wait operation consists of signaling another blocked thread to go, and then blocking in turn. This is intended to measure the overhead of thread synchronization.

POSIX Threads

FSU Threads

Threads in Solaris (Sun Microsystems' Unix)

Support the POSIX API & semanics, with extensions

Solaris Threads Diagram

*

This combined implementation model gives the maximum flexibility to users, to configure their applications to take advantage of both logical and physical parallelism. However, making good use of this flexibility requires the programmer to have quite a bit of knowledge of both the application and the hardware it will run on, and to exercise quite a bit of judgement.

Implementation: Solaris vs. Traditional Unix Process Structures

*

Today, Unix means more than one thing. There was one original Unix operating system, but it evolved, and the name was then applied to many different operating system implementations. Unix is a registered trademark, originally owned by ATT, but currently managed by The Open Group. The name is applied to several generations of operating system specifications, and it can be applied to any operating system product for with The Open Group has licensed use of the name. Since Solaris and Linux both currently attempt to implement the same Unix specifications, they can both be called Unix operating systems in this sense. However, some books use the name Unix is a more restrictive sense, to mean a specific historical operating system, apparently Unix System V Release 4, which was a milestone in the development of ATT Bell Laboratories original thread of Unix systems.

Thread Scheduling is Replicated At Library and Kernel Levels

*

Several Kinds of User-Level Events may Require Thread Switching

When one of these events occurs, the LWP that was assigned to the thread executes a user-level thread dispatcher routine, which looks for another runnable ULT.

This is in contrast to situations where the system blocks the LWP itself, in a kernel call. In that case, the LWP is tied up, and cannot switch to another ULT. For this reason, we need multiple LWP's, to prevent whole-process blocking.

Linux Threads

Support the POSIX threads API, but not the semantics.

Linux threads are implemented by the clone() system call, which creates a new task with has a copy of the attributes of the current process.

The child (clone) task shares with the parent (cloned) task:

It also ends up sharing some other attributes that POSIX and the SUS say should be replicated for each thread.

There is a new Native POSIX Thread Library implementation, called "NPTL", which is supposed to be much closer to the POSIX model.

Linux "Process" Attributes

The following abstract classes of information are attributes of a Linux process that are contained in the task_struct data structure. Each Linux thread has its own copy of this structure.

Sunlike Solaris/SunOS Linux was not designed from the bottom up to support threads. The clone() call duplicates most of the process state, which includes the attributes listed here.

Some of these attributes are shared by all the threads in the process. Some are specific to the thread. Can you guess which are which?

Linux "Process" State

The process state tells whether the task is

These states are defined in sched.h as:

#define TASK_RUNNING               0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define TASK_ZOMBIE             4
#define TASK_STOPPED            8

Linux Thread State Diagram

*

Example Comparing Threads & Processes

See file null_fork.c. It is an example of a program similar to that used to derive the performance comparison results mentioned above.

Besides illustrating the obvious performance difference between process creation and thread creation, this example can serve as a review of the Unix process API, an introduction to the Unix thread API, and a side-by-side comparision of the two. Therefore, it is worth looking at the example in detail.

You should read through this code, and make sure you understand it. The system calls for process creation and termination should already be familiar, from the examples we covered previously.

New System Calls for Threads in null_fork.c

pthread.h
must be included to provide declarations of types and functions for working with threads
pthread_attr_init()
initializes a structure that is used to specify thread creation attributes
pthread_create()
creates a new thread to execute a given function
pthread_join()
waits for termination of a given thread
pthread_attr_destroy()
releases any system resources associated with a thread creation attribute

The are separate notes on the POSIX threads API provide more information on these and other POSIX thread system service calls.

Clock call Introduced in null_fork.c

clock()
Fetches the count of microseconds since the system started

There are several different Unix clock functions, of varying precision. We use this simple one to estimate the amount of time used by the program for certain segments of its execution. We really would like to know the amount of CPU time, but there is not yet a well standardized way of getting that information from the system, so we approximate what we want by taking the elapsed real time. If the time interval is short enough that there is no process switch, this will be essentially the same as the CPU time used by the process.

Output of null_fork on Solaris Threads

------------------------------
system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 6060000 ticks
elapsed time = 6060000 us
avg time per iteration = 6060 us
-----------threads------------
number of iterations = 1000
elapsed time = 110000 ticks
elapsed time = 110000 us
avg time per iteration = 110 us
------------------------------
performance ratio = 55
------------------------------

system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 6390000 ticks
elapsed time = 6390000 us
avg time per iteration = 6390 us
-----------threads------------
number of iterations = 1000
elapsed time = 110000 ticks
elapsed time = 110000 us
avg time per iteration = 110 us
------------------------------
performance ratio = 58
------------------------------

system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 6620000 ticks
elapsed time = 6620000 us
avg time per iteration = 6620 us
-----------threads------------
number of iterations = 1000
elapsed time = 100000 ticks
elapsed time = 100000 us
avg time per iteration = 100 us
------------------------------
performance ratio = 66

This output from three runs of null_fork on a machine with four 450 MHz Ultrasparc 2 processors running the Solaris 8 operating system.

Here is some newer data, from the same machine running Solaris 9.

------------------------------
system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
number of iterations = 1000
elapsed time = 2610000 ticks
elapsed time = 2610000 us
avg time per iteration = 2610 us
-----------threads------------
number of iterations = 1000
elapsed time = 120000 ticks
elapsed time = 120000 us
avg time per iteration = 120 us
------------------------------
performance ratio = 21

The ratio is not quite so good. That is probably due to Solaris phasing out the multi-level thread implementation, in favor of 1:1 mapping of user threads to kernel-schedulable threads (LWPs).

Output of null_fork on Linux Threads

------------------------------
system clock ticks per second = 1000000
nanoseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 220000 ticks
elapsed time = 220000 us
avg time per iteration = 220 us
-----------threads------------
number of iterations = 1000
elapsed time = 50000 ticks
elapsed time = 50000 us
avg time per iteration = 50 us
------------------------------
performance ratio = 4
------------------------------
system clock ticks per second = 1000000
nanoseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 180000 ticks
elapsed time = 180000 us
avg time per iteration = 180 us
-----------threads------------
number of iterations = 1000
elapsed time = 60000 ticks
elapsed time = 60000 us
avg time per iteration = 60 us
------------------------------
performance ratio = 3
------------------------------
system clock ticks per second = 1000000
nanoseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 190000 ticks
elapsed time = 190000 us
avg time per iteration = 190 us
-----------threads------------
number of iterations = 1000
elapsed time = 40000 ticks
elapsed time = 40000 us
avg time per iteration = 40 us
------------------------------
performance ratio = 4

This output is from three runs of null_fork on a machine with two 550 MHz Pentium III processor running the Red Hat 7.2 Gnu/Linux operating system.

Solaris versus Linux Performance Summary

OS/Threads Processor Speed Processes Threads Ratio
Linux 2.4.7 550 MHz 220 us 50 us 4:1
Linux 2.4.7 550 MHz 180 us 60 us 3:1
Linux 2.4.7 550 MHz 190 us 40 us 4:1
Solaris 8 450 MHz 6060 us 110 us 55:1
Solaris 8 450 MHz 6390 us 110 us 58:1
Solaris 8 450 MHz 6620 us 100 us 66:1

The table shows, in nanoseconds, some average process creation/termination times reported by null_fork when run on two different operating systems. Since the results vary from one execution to the next, three examples are given. The times shown are in microseconds.

Solaris versus Linux Performance Questions

Observe that the ratio of process creation time to thread creation time (rightmost column) is about 4:1 for Linux threads and about 60:1 for Solaris threads. Taking the cost of creating a process as a norm, this says the cost of creating a Solaris thread is about 1/15 that of creating a Linux thread. That difference must be mainly due Linux threads being kernel threads. Creation of a new Linux thread is a kernel operation and it involves much of the same work as creation of a new process. In contrast, Solaris threads are implemented via a combination of kernel-level threads (LWPs) and user-level threads. The benchmark that generated these numbers only required the creation of a user-level thread, so we are seeing the difference in cost between user-level and kernel-level thread creation. This difference is one reason Solaris makes use of two levels. Where user-level threads will do, they clearly are more efficient.

Windows 2000 Process Object & Thread Object

*

Windows 2000 Thread States

Microkernels

Comparison to Conventional Layered OS Model

*

Benefits of a Microkernel Organization

Benefits of Microkernel Organization

Microkernel Design

T. P. Baker. ($Id)