threads.html

COP4610: Operating Systems & Concurrent Programming

Threads

These notes started with an outline of the contents of Chapter 4 in Stallings' OS text. I fleshed them out with an example and some more details about Solaris and Linux threads.

Topics

What are threads?
How are they implemented?
An example of thread creation and destruction, using the POSIX API
Comparison of execution time overhead for thread vs. process creation, using a benchmark program.
Comparison of Solaris & Linux thread implementations

What is the Essence of a Process?

Resource ownership
- a virtual address space to hold the process image
- real memory, control of I/O devices
- access to open files
Scheduling/execution
- a sequence (thread) of instructions that is executed
- machine state (register values) and activation record stack
- execution state (running, ready, etc.)
- scheduling priority

These two aspects are independent, and so can be separated:

Process = Resources + Thread

Multithreading: More than one Thread per Process

MS-DOS supports only a single thread
Original UNIX supported multiple user processes but only one thread per process
Windows 2000, Solaris, Linux, Mach, and OS/2 support multiple threads per process

Summary

A process has:

a virtual address space which holds the process image
protected access to processors, other processes, files, and I/O resources
one or more threads

A thread has:

an execution state (running, ready, etc.)
a saved execution context (register values) when not running
an execution stack
some per-thread static storage for local variables or attributes
shared access to the memory and other resources of its process

An Analogy

program = musical score (a tune)
processor = musician
thread = a musician playing a tune
address space = a room (soundproof, isolated)
single-threaded process = a room with one musician playing in it
multi-threaded process = a room with several musicians playing in it

Why not make each thread a separate process?

Takes less time to create a new thread than a process
Less time to terminate a thread than a process
Less time to switch between two threads within the same process
Since threads within the same process share memory and files, they can communicate with each other more efficienty, without invoking the kernel

Review

Why do we need multiple threads at all, whether in a single process or several?
Why not put all threads into a single process?

Having answered the question of why we might want more than one thread per process, make certain you remember why we need threads, and why we need processes.

Uses of Threads in a Single-User Multiprocessing System

Separate foreground from background work
Overlap I/O and execution to speed execution
Simplify software design
- decouple asynchronous activities
- support modular decomposition
Make use of multiple CPU's to improve performance

Since one main reason multithreading was introduced is to make better use of hardware parallelism, this is a good point to discuss multiprocessor operating systems. However, that is not the only reason threads may be useful.

Taxonomy of Parallelism in Computer Systems

Single Instruction Single Data (SISD)
- one processor
- each cycle fetches one instruction
- each cyle fetches/stores a single data operand
Single Instruction Multiple Data (SIMD)
- multiple processors
- each cycle fetches one instruction
- each cyle fetches/stores multiple data operands
  to have the same instruction performed on all of them, in parallel
Multiple Instruction Single Data (MISD)
- multiple processors
- each cycle fetches multiple instructions
- each cycle fetches/stores single data operand
  to have different operations done to it, in parallel
- does not seem to be of any practical importance
Multiple Instruction Multiple Data (MIMD)
- multiple processors
- each cycle fetches a different instruction for each processor
- each cycle fetches/stores different data operands for each processor

Of these types of parallel architectures, the MIMD seems to have been most widely adopted. That is probably because people found it more natural to write software for this model. In particular, the SMP architecture is now quite common, especially in the form of systems with dual processors.

SISD/SIMD/MISD/MIMD Taxonomy Diagram

Symmetric Multiprocessing (SMP)

Kernel can execute on any processor
Each processor may choose its own thread/process to execute next from the set of ready threads/processes

Multiprocessor Operating System Design Considerations

Truly parallel execution of processes/threads
Scheduling
Synchronization
Memory Management
Reliability and Fault Tolerance

The existence of true parallelism complicates virtually all aspects of the operating system design. Optimal CPU scheduling turns out to be much harder. Algorithms that are optimal for a single-CPU system can be very bad for an SMP system. Optimal multiprocessor scheduling is NP-hard. Synchronization is harder for SMP systems. The interrupt-masking technique no longer works, since we have truly parallel execution. Memory management may be complicated by considerations of CPU-local versus shared memory. Fault tolerance becomes critical, because with more CPU's there is much higher probability of a single-CPU failure, and because people may ask for the system to continue operating as long as one CPU is working.

Thread Scheduling Considerations

Suspending a process involves suspending all threads of the process since all threads share the same address space
Termination of a process, terminates all threads within the process

Operations that Change Thread Scheduling State

Creation (textbook calls spawning -- a bad choice of words)
- trigger: new process creation (first thread)
  or request by another thread in same process
- implementation: allocate stack and thread control block
Blocking
- trigger: system call by executing thread
- implementation: save context, enqueue thread on event, call dispatcher to switch to another thread
Unblocking
- trigger: hardware interrupt (e.g., timer) or action by another thread
- implementation: move thread from blocked queue to ready queue, call dispatcher
Termination
- trigger: execution of the thread
- implementation: free up all thread-specific resources, call dispatcher

I would rather avopid using the term "spawn", as in the textbook, for thread creation. The term "spawn" has been traditionally used in Unix for process creation. For example, the following is taken from the combined POSIX/Unix 98 Specifications:

The posix_spawn() and posix_spawnp() functions shall create a new process (child process) from the specified process image. The new process image shall be constructed from a regular executable file called the new process image file.

Synchronous vs. Asynchronous Events

All OS actions are triggered by some event. There are two kinds of such triggering events:

Synchronous: caused by normal execution of some thread
Asynchronous: caused by a hardware interrupt

Events that cause the thread to change state can be characterized similarly:

Synchronous: caused by execution of the same thread
Asynchronous: caused by a hardware interrupt or the execution of another thread

In general, writing software that can deal with asynchronous events is harder than writing software that only has to deal with synchronous events. The main problem with asynchronous event handling is that the event handler cannot assume much about what else is going on at the time it executes.

To see how hard this is, suppose the grader for this course did not tell you when your programming assignment would be collected. Suppose the grader had permission to access your files, and simply picked an arbitrary time to fetch a copy of your program from your home directory. How could you be sure that the version graded is one that is working? Suppose you kept two versions in different directories, and only copied from your working version to the directory where the grader would pick things up when you thought you had a good, working version. It could still be that the grader collects the program while you are copying, and so gets part of one version and part of the next version. It might not even be compilable!

For this reason, system designers generally keep asynchronous event handler very simple. They usually just capture the minimum information necessary to record the event, in a simple data structure in memory, and let the rest of the processing be done synchronously, by a thread or process that is responsible for reading from that data structure and processing the event.

Level of Thread Blocking

There are two approaches to thread implementation

per-thread: blocked thread blocks only itself
- needs more kernel support
- more useful
process-wide: blocked thread blocks entire process
- does not require special kernel support
- less useful

Application Example: Remote Procedure Call (RPC) Using Threads

This demonstrates the usefulness of single-thread blocking.

See also the explanation in the text of how Adobe Pagemaker uses threads to improve user responsiveness by decoupling asynchronous activities.

Level of Thread Scheduling

There are also two approaches to thread scheduling

system-wide: threads are scheduled independently by the kernel
- needs kernel support
- threads compete with threads of other processes for CPU time
- per-thread CPU time quantum
process-local: threads scheduled within a process
- can be done without kernel support
- threads compete with each threads in same process
- process competes with other processes
- per-process CPU time quantum

Level of Thread Implementation (User vs. Kernel)

User level, a.k.a. library level
- All thread management is done by the application
- The kernel is not aware of the existence of threads
- Implies process-local thread scheduling
- FSU Pthreads library is an example
- Good: very low overhead
- Bad: sometimes imposes process-wide blocking
- Bad: can't use multiple CPUs
Kernel level
- Kernel maintains context information for the process and the threads
- Allows system-wide thread scheduling
- W2K, Linux, and OS/2 are examples of this approach
- Good: Does not require process-wide blocking
- Bad: higher overhead, mainly due to kernel traps
Combined
- Kernel manages kernel threads
- Kernel threads act as virtual processors to serve user-level threads
- User code can handle many actions without calling the kernel
- Solaris is an example. Kernel thread is called a lightweight process (LWP)
- SGI Irix is another example. Kernel thread is called an sproc
- Good: some benefits of both models
- Bad: implementation complexity

To see one of the advantages of a combined implementation strategy, look at the benchmark results comparing Solaris (combined) and Linux (kernel) threads, reported further below in these notes.

However, after the publication of those results, Solaris updates dropped the two-level model in favor of a single-level model, with one kernel thread per user thread. It seems that the industry practice is converging toward a single implementation model, namely:

kernel-level implementation
one kernel thread per user thread
system-wide scheduling

It is too bad that the players could not have agreed upon this before the POSIX thread standard was approved, since it would have resulted in a cleaner and more portable API.

VAX Performance Comparision

Operation	User-Level Threads	Kernel-Level Threads	Processes
null fork	34 us	948 us	11,300 us
signal-wait	37 us	441 us	1,840 us

The null fork operation consists of creating and destroying a new thread to execute a procedure that does nothing but return. This is intended to measure the overhead of thread creation.

The signal-wait operation consists of signaling another blocked thread to go, and then blocking in turn. This is intended to measure the overhead of thread synchronization.

POSIX Threads

a standard API for thread services, now supported by may systems and included in the current Open-Group Unix (TM) specification
intended to be implementable via kernel-level, library-level, and combined approaches

FSU Threads

a library implementation of the POSIX threads API
done by FSU students and faculty, around 1991-1993
ported to several different Unix systems, including Solaris, SCO Unix, Linux
still in use, e.g., by Compaq in some server utilities
can be used, along with Linux ans Solaris native threads, for Program #2
implementation internals are likely to be the topic of Program #3

Threads in Solaris (Sun Microsystems' Unix)

Support the POSIX API & semanics, with extensions

Process
- includes the user's address space, stack, and process control block
User-level threads
- implemented by library
- invisible to the kernel
- dynamically dispatched to one or more LWPs
Lightweight processes (each has one kernel thread)
- maps to a single kernel thread
- serves as a virtual processor for ULTs
Kernel threads
- scheduled by the kernel
- need not have a corresponding LWP, for system internal use

Solaris Threads Diagram

This combined implementation model gives the maximum flexibility to users, to configure their applications to take advantage of both logical and physical parallelism. However, making good use of this flexibility requires the programmer to have quite a bit of knowledge of both the application and the hardware it will run on, and to exercise quite a bit of judgement.

Implementation: Solaris vs. Traditional Unix Process Structures

Today, Unix means more than one thing. There was one original Unix operating system, but it evolved, and the name was then applied to many different operating system implementations. Unix is a registered trademark, originally owned by ATT, but currently managed by The Open Group. The name is applied to several generations of operating system specifications, and it can be applied to any operating system product for with The Open Group has licensed use of the name. Since Solaris and Linux both currently attempt to implement the same Unix specifications, they can both be called Unix operating systems in this sense. However, some books use the name Unix is a more restrictive sense, to mean a specific historical operating system, apparently Unix System V Release 4, which was a milestone in the development of ATT Bell Laboratories original thread of Unix systems.

Thread Scheduling is Replicated At Library and Kernel Levels

Several Kinds of User-Level Events may Require Thread Switching

Synchronization - e.g. pthread_cond_wait() may block
Suspension - e.g. see thr_suspend()
Preemption - e.g. an active threads unblocks a higher priority thread by calling pthread_cond_signal()
Yielding - e.g. see sched_yield()

When one of these events occurs, the LWP that was assigned to the thread executes a user-level thread dispatcher routine, which looks for another runnable ULT.

This is in contrast to situations where the system blocks the LWP itself, in a kernel call. In that case, the LWP is tied up, and cannot switch to another ULT. For this reason, we need multiple LWP's, to prevent whole-process blocking.

Linux Threads

Support the POSIX threads API, but not the semantics.

Linux threads are implemented by the clone() system call, which creates a new task with has a copy of the attributes of the current process.

The child (clone) task shares with the parent (cloned) task:

virtual memory
open files
signal handlers

It also ends up sharing some other attributes that POSIX and the SUS say should be replicated for each thread.

There is a new Native POSIX Thread Library implementation, called "NPTL", which is supposed to be much closer to the POSIX model.

Linux "Process" Attributes

The following abstract classes of information are attributes of a Linux process that are contained in the task_struct data structure. Each Linux thread has its own copy of this structure.

State
Scheduling information
Identifiers
Interprocess communication
Signal mask
Pending signals
Links (in queues, etc.)
Times and timers
File system references (for open file descriptors)
Virtual memory
Processor-specific context

Sunlike Solaris/SunOS Linux was not designed from the bottom up to support threads. The clone() call duplicates most of the process state, which includes the attributes listed here.

Some of these attributes are shared by all the threads in the process. Some are specific to the thread. Can you guess which are which?

Linux "Process" State

The process state tells whether the task is

Running
Interruptable
Uninterruptable
Stopped
Zombie

These states are defined in sched.h as:

#define TASK_RUNNING               0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define TASK_ZOMBIE             4
#define TASK_STOPPED            8

Linux Thread State Diagram

Example Comparing Threads & Processes

See file null_fork.c. It is an example of a program similar to that used to derive the performance comparison results mentioned above.

Besides illustrating the obvious performance difference between process creation and thread creation, this example can serve as a review of the Unix process API, an introduction to the Unix thread API, and a side-by-side comparision of the two. Therefore, it is worth looking at the example in detail.

You should read through this code, and make sure you understand it. The system calls for process creation and termination should already be familiar, from the examples we covered previously.

New System Calls for Threads in null_fork.c

pthread.h: must be included to provide declarations of types and functions for working with threads
pthread_attr_init(): initializes a structure that is used to specify thread creation attributes
pthread_create(): creates a new thread to execute a given function
pthread_join(): waits for termination of a given thread
pthread_attr_destroy(): releases any system resources associated with a thread creation attribute

The are separate notes on the POSIX threads API provide more information on these and other POSIX thread system service calls.

Clock call Introduced in null_fork.c

clock(): Fetches the count of microseconds since the system started

There are several different Unix clock functions, of varying precision. We use this simple one to estimate the amount of time used by the program for certain segments of its execution. We really would like to know the amount of CPU time, but there is not yet a well standardized way of getting that information from the system, so we approximate what we want by taking the elapsed real time. If the time interval is short enough that there is no process switch, this will be essentially the same as the CPU time used by the process.

Output of null_fork on Solaris Threads

------------------------------
system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 6060000 ticks
elapsed time = 6060000 us
avg time per iteration = 6060 us
-----------threads------------
number of iterations = 1000
elapsed time = 110000 ticks
elapsed time = 110000 us
avg time per iteration = 110 us
------------------------------
performance ratio = 55
------------------------------

system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 6390000 ticks
elapsed time = 6390000 us
avg time per iteration = 6390 us
-----------threads------------
number of iterations = 1000
elapsed time = 110000 ticks
elapsed time = 110000 us
avg time per iteration = 110 us
------------------------------
performance ratio = 58
------------------------------

system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 6620000 ticks
elapsed time = 6620000 us
avg time per iteration = 6620 us
-----------threads------------
number of iterations = 1000
elapsed time = 100000 ticks
elapsed time = 100000 us
avg time per iteration = 100 us
------------------------------
performance ratio = 66

This output from three runs of null_fork on a machine with four 450 MHz Ultrasparc 2 processors running the Solaris 8 operating system.

Here is some newer data, from the same machine running Solaris 9.

------------------------------
system clock ticks per second = 1000000
microseconds per clock tick = 1
----------processes-----------
number of iterations = 1000
elapsed time = 2610000 ticks
elapsed time = 2610000 us
avg time per iteration = 2610 us
-----------threads------------
number of iterations = 1000
elapsed time = 120000 ticks
elapsed time = 120000 us
avg time per iteration = 120 us
------------------------------
performance ratio = 21

The ratio is not quite so good. That is probably due to Solaris phasing out the multi-level thread implementation, in favor of 1:1 mapping of user threads to kernel-schedulable threads (LWPs).

Output of null_fork on Linux Threads

------------------------------
system clock ticks per second = 1000000
nanoseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 220000 ticks
elapsed time = 220000 us
avg time per iteration = 220 us
-----------threads------------
number of iterations = 1000
elapsed time = 50000 ticks
elapsed time = 50000 us
avg time per iteration = 50 us
------------------------------
performance ratio = 4
------------------------------
system clock ticks per second = 1000000
nanoseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 180000 ticks
elapsed time = 180000 us
avg time per iteration = 180 us
-----------threads------------
number of iterations = 1000
elapsed time = 60000 ticks
elapsed time = 60000 us
avg time per iteration = 60 us
------------------------------
performance ratio = 3
------------------------------
system clock ticks per second = 1000000
nanoseconds per clock tick = 1
----------processes-----------
iterations = 1000
number of iterations = 1000
elapsed time = 190000 ticks
elapsed time = 190000 us
avg time per iteration = 190 us
-----------threads------------
number of iterations = 1000
elapsed time = 40000 ticks
elapsed time = 40000 us
avg time per iteration = 40 us
------------------------------
performance ratio = 4

This output is from three runs of null_fork on a machine with two 550 MHz Pentium III processor running the Red Hat 7.2 Gnu/Linux operating system.

Solaris versus Linux Performance Summary

OS/Threads	Processor Speed	Processes	Threads	Ratio
Linux 2.4.7	550 MHz	220 us	50 us	4:1
Linux 2.4.7	550 MHz	180 us	60 us	3:1
Linux 2.4.7	550 MHz	190 us	40 us	4:1
Solaris 8	450 MHz	6060 us	110 us	55:1
Solaris 8	450 MHz	6390 us	110 us	58:1
Solaris 8	450 MHz	6620 us	100 us	66:1

The table shows, in nanoseconds, some average process creation/termination times reported by null_fork when run on two different operating systems. Since the results vary from one execution to the next, three examples are given. The times shown are in microseconds.

Solaris versus Linux Performance Questions

What are some likely causes of the variations in execution times?
What is the likely reason the ratio of process to thread creation time is so much larger on Solaris than on Linux?
Can you think of any reasons the SPARC/Solaris times might be so much slower than the Linux times?

Observe that the ratio of process creation time to thread creation time (rightmost column) is about 4:1 for Linux threads and about 60:1 for Solaris threads. Taking the cost of creating a process as a norm, this says the cost of creating a Solaris thread is about 1/15 that of creating a Linux thread. That difference must be mainly due Linux threads being kernel threads. Creation of a new Linux thread is a kernel operation and it involves much of the same work as creation of a new process. In contrast, Solaris threads are implemented via a combination of kernel-level threads (LWPs) and user-level threads. The benchmark that generated these numbers only required the creation of a user-level thread, so we are seeing the difference in cost between user-level and kernel-level thread creation. This difference is one reason Solaris makes use of two levels. Where user-level threads will do, they clearly are more efficient.

Windows 2000 Process Object & Thread Object

Windows 2000 Thread States

Ready
Standby
Running
Waiting
Transition
Terminated

Microkernels

Small operating system core
Contains only essential operating systems functions
Many services traditionally included in the operating system are now external subsystems
- device drivers
- file systems
- virtual memory manager
- windowing system
- security services

Comparison to Conventional Layered OS Model

Benefits of a Microkernel Organization

Uniform interface on request made by a process
- All services are provided by means of message passing
Extensibility
- Allows the addition of new services
Flexibility
- New features added
- Existing features can be subtracted
Portability
- Changes needed to port the system to a new processor is changed in the microkernel - not in the other services
Reliability
- Modular design
- Small microkernel can be rigorously tested

Benefits of Microkernel Organization

Distributed system support
- Message are sent without knowing what the target machine is
Object-oriented operating system
- Components are objects with clearly defined interfaces that can be interconnected to form software

Microkernel Design

Low-level memory management
- mapping each virtual page to a physical page frame
Inter-process communication
I/O and interrupt management

T. P. Baker. ($Id)