The I/O Subsystem



Connections between CPU scheduling and the I/O Subsystem

Introduction to I/O Subsystem

The main reason for the complexity is the huge and continually expanding number and variety of devices that must be supported, with corresponding device-specific code.

I/O device interfaces are generally not well documented by the manufacturers, and increasingly even the manufacturers do not have internal documentation. The reasons include:

Categories of I/O Devices

Differences in I/O Devices Affect their Treatment by the OS

Models of I/O Interaction

Evolution of the I/O Function

  1. Processor directly controls a peripheral device
  2. Controller or I/O module is added
  3. Controller or I/O module with interrupts
  4. Direct Memory Access
  5. I/O module is a separate processor ("I/O channel")
  6. I/O processor is a full-fledged computer, with its own OS

Note: The latter model of I/O processor was already in use at least as early as 1970 for mainframe computers. It has only recently made its way down to microprocessor-based computing.

Three Different I/O Interaction Models

 No InterruptsUse Interrupts
Data transferred to/from memory via processor Programmed I/OInterrupt-driven I/O
Data transferred to/from memory directly  Direct Memory Access (DMA)

Logic of the Three I/O Interaction Models

Direct Memory Access

DMA and Interrupt Breakpoints During an Instruction Cycle

DMA does not need to wait for an instruction to complete, since (unlike an interrupt) there is no effect on the flow of instruction execution.

Single-bus, detached DMA

If the DMA module must use the system bus to get to the I/O device, there is a bottleneck.

Single-bus, Integrated DMA-I/O

If there is a path between DMA module and I/O module that does not include the system bus the contention is reduced, and the number of required busy (stolen) system bus cycles can be cut. One way is by integrating the DMA and I/O functions in a single module. (Can you think of a specific example of this?)

I/O Bus

This can be carried further, by providing a separate bus between the DMA controller module and a collection of I/O devices. (Can you think of a specific example of this?)

Operating System I/O Design Issues

A Model of I/O Organization

I/O System Layers for a Stream Device

In reality, each of these may be divided into further layers, especially to handle classes of devices. For example, we might have a hierarchy of Block Device -> Disk Device -> SCSI Device -> Specific SCSI Controller.

I/O Buffering

Single Buffer

No Buffering

Single Buffering

Double Buffer

Double Buffering

Circular Buffer

Recall the example of the Producer-Consumer synchronization problem.

Circular Buffering

Notice that time spent copying data from the OS buffer to the user buffer can be saved if we use memory mapping.

Disk Performance Parameters

Timing of a Disk I/O Transfer

Disk Performance Parameters

Disk Performance Parameters

Disk Scheduling Policies


Last-In, First-Out (LIFO)

First-In, First-Out (FIFO)

Shortest Service Time First (SSTF)

  • Select the disk I/O request that requires the least movement of the disk arm from its current position
  • Always choose the minimum Seek time


  • Arm moves in one direction only, satisfying all outstanding requests until it reaches the last track in that direction
  • Direction is reversed


  • Restricts scanning to one direction only
  • When the last track has been visited in one direction, the arm is returned to the opposite end of the disk and the scan begins again



  • Another way to avoids "stickiness"
  • Two queues
  • One queue is empty for new request

Summary of Disk Scheduling Policies

Note that within the policies we have considered you can distinguish trade-offs between maximizing overall disk utlization (and I/O throughput), and providing some quality of service to specific processes.

Note also that the environment in which disk scheduling by the OS really paid off does not quite exist anymore, at least in smaller systems. When all disk I/O was done under direct control of the OS, the OS could implement scheduling policies such as the above. As more intelligence has been added to disk controllers and disk drives themselves, control over the real order or disk operations may be taken over by an external device. Deciding whether and how to schedule disk I/O within the OS becomes more problematic. Some factors that complicate the situation include:

If you detect a trend here, toward relying on multi-layered caches more than scheduling, you may be right.

RAID (Redundant Array of Inexpensive Disks)

Redundancy, Reliability, Performance

 One UnitN Units
Data throughputr

r * N
Probability of single failurep

p * N
Probability of total failureppN

The situation here with multiple diesks is analogous to the situation with multiple CPUs in an SMP system.

Through parallel operation of multiple disks, we can achieve higher data throughput, up to the bandwidth limit of our other system components such as the I/O and memory buses.

However, adding more units also increases the probability that one of the units will fail.

Fortunately, the probability of total system failure goes down faster. Therefore, if we can find a way to use some of the disk redundancy to allow us to recover from failures we can maintain or improve total system reliability.

Data Mapping for a RAID Level 0 Array

The different types of RAID are identified by numbers. The numbers are sometimes called "levels", but that does not mean they are numbered hierarchically according to any measure of performance or reliability. The numbers have more to do with the order in which the various types of RAID were invented.

The figure illustrates the basis idea, which applies in all cases: to implement a virtual disk, whose actual data is distributed over several actual disks. In this case, it is RAID 0, which seeks only parallelism. The data is divided up into strips, and the strips are distributed over the drives.

RAID 0 (non-redundant)

What is gained by distributing strips over disks, as compared to distributing entire files?

The answer depends on the kind of computing, i.e., whether we want to read/write in single large files very fast, versus reading/writing multiple independent files. The main use of this kind of RAID is for supercomputers, for transferring huge blocks of data. Ther may be no gain in performance at all, if one is reading or writing small blocks of data.

RAID 1 (a.k.a. Mirroring)

This type of RAID is mainly done for reliability, to tolerate disk drive failures. It also helps with read performance, since read operations on different files may proceed in parallel. It is practical for small systems (two disks) where reliability is needed but RAID 5 cannot be afforded.

RAID 2 (redundancy through Hamming code)

This type of RAID does not appear to have been successful. It is complicated to implement, and the kind of correction it provides is more than we need. The virtue of the Hamming code is the ability to detect and correct local bit-level errors in the data read from a drive. This kind of error detection is already done internally by modern disk drives, so there is no need for the RAID system to do the same.

It is unlikely that a disk drive will produce bad data on a single read or write operation, but continue to function normally otherwise. When we do have a disk failure the failure is usually more catastrophic, involving the entire disk (e.g. bearing failure) or an entire disk surface. (e.g., head failure). The other types of RAID handle major failures equally well, and are simpler.

However, it is interesting to look for a moment at Hamming codes, and the use of redundancy for detecting and correcting errors in general.

Using Parity to Detect Errors

Hamming Code

Hamming Code uses extra parity bits to allow the identification (and correction) of a single error. Creating an encoded word is done as follows:

To encode 10101100

--1-010-1 1 1 0 . 1 .
.01..10.. 1 1 .
...1010.. . . 0
.......11 1 1 0
101101011 0 1 1

First, we lay out the data word, reserving spaces for the parity bits in the bit positions that are powers of two (1, 2, 4, 8 , ...). All other bit positions (3, 5, 6, 7, 9, 10, ...) are available to hold data.

Each parity bit is set to achieve even parity for a different subsequence of the bits, chosen according to an overlapping binary pattern, as follows:

To correct 101101111011

101101111 0 1 1 . 1 .
.01..11.. 1 1 .
...1011.. . . 0
.......11 1 1 0
1 + 2 + 4 = 7
101101011 0 1 1

The way to verify and correct an encoded number is to verify each parity bit. Write down all the incorrect parity bits, and add up their positions. The sum of the positions of the incorrect bits gives the location of the bad bit.

For example, suppose the encoded word of the example above were corrupted from 101101011011into 101101011011. We verify the parity bits and discover that parity bits 1, 2, and 4 do not match. This tells us that the error is in bit 7. We correct it and get back the original word.

Using Parity to Reconstruct Data for ofther RAID Types

MKM xor K(M xor K) xor K (M xor K) xor M

The higher numbered types of RAID make use of a simpler error correction scheme, based on the exclusive or (xor) operation.

This is just one of the many uses of the exclusive or operation. Other applications include data structures (storing ambidirectional pointers), and cryptography (encryption-decryption). A useful property of this operation is that is invertible. One can recover the value of either operand by combining the other operand with the xor of the pair. The table shows how the value of M can be recovered from the values of M and M xor K, and vice versa.

RAID 3 (bit-interleaved parity)

Creation of parity disk X4:

X4(i) = X3(i) xor X2(i) xor X1(i) xor X0 (i)

Recreation of failed disk X1:

X1(i) = X4(i) xor X3(i) xor X2(i) xor X0 (i)

This form of RAID achieves high data bandwidth and can tolerate loss of a single disk drive, but only allows one transaction at a time. Thus, the performance on independent concurrent I/O processes is worse than if we had not used RAID (instead using each disk as an independent drive).

RAID 4 (block-level parity)

Creation of parity disk X4:

X4(i) = X3(i) xor X2(i) xor X1(i) xor X0 (i)

Update of for modification to one stripe on disk X1:

X4'(i) = X3(i) xor X2(i) xor X1'(i) xor X0 (i) xor X1 (i) xor X1 (i)
= X4(i) xor X1(i) xor X1'(i)

This form of RAID allows concurrent independent writes on different disks.

The algebra shows how this works. When we update one strip, we only need to update two disks: the "original" disk and the parity disk.

However, the parity disk is still involved in every write operation.

RAID 4 (block-level parity)

The parity disk can become a bottleneck. It is also subject to the most wear, and so is the most likely disk to fail.

RAID 5 (block-level distributed parity)

In practice, this seems to be the preferred type of RAID for large systems.

RAID 6 (dual redundancy)

The benefit is higher reliability. Now we can tolerate two (2) disk failures without data loss.

Disk Cache

As mentioned further above, there are actually several layers of disk caching, including cache inside the disk drive, cache inside the disk controller, and cache inside the main memory. The general principles of replacement algorithms mentioned below could apply to any level.

Least Recently Used Buffer Replacement Scheme

Why is LRU practical as a disk cache replacement policy, but not as a VM page replacement policy? What is different here?

Least Frequently Used Buffer Replacement Scheme

The idea was to improve on LRU, but a "memory" problem crops up, which makes this policy perform very poorly.

Note the similarity of the problem here to a problem in processor scheduling, that if we just used average burst processing time the average is too dominated by ancient history. The solution there was to use an exponential average. In principle, the same technique could be used here. However, the techniques described by Stallings take a different approach, more analogous to the clock algorithms for page replacement.

LFU variant: "Frequency Based Allocation"

Note that the phrase "reference count" is used quite differently her from the way it is used in data structures. These counts represent instances of reference (not pointers) and they are never decremented.

Blocks in the "new" section are not eligible for replacement. Blocks in the "old" section are replaced, LFU-first. This is an improvement, but there is still a bias against newer blocks.

Addition of a "middle" section allows newer blocks a chance to build up their reference counts before they have to compete against older blocks.

Performance with this method is actually better than LRU.

Take a look at Figures 11.12 and 11.13 in the text, and the explanation that goes with them. We cannot reproduce those figures here. The publisher did not provide them (probably because they are used in the text by permission of another publisher). Pay attention to the comment about sensitivity of results to reference strings, and the incomparability of results from simulations with different reference strings.

I/O Subsystem in Traditional Unix Architecture

What real information can you get from this figure?

Go back to Chapter 2.5 and review what it says about the relationship if the I/O subsystem to the rest of operating system, for the original Unix system, and then for SVR4 Unix.

UNIX I/O Buffering Types

Device TypeUnbuffered I/OBuffer CacheCharacter Queue
disk X X 
tape X X 
terminal   X
com. line  X
printer X  X

Unix uses two kinds of buffered I/O, as well as unbuffered I/O.

Block devices are treated differently from character devices in that a block that has been written may be re-read from the buffer. Character device buffering follows the producer-consumer model. Block device buffering follows the readers-writers model.

UNIX SVR4 Bufffered I/O

We will look at the Unix block buffer scheme in more detail below.

Unix Block I/OBuffer Scheme

The Unix buffer cache scheme uses a hash table to look up disk blocks in the cache.

What is the purpose of this scheme?

How does allocating more RAM for disk cache affect the performance of the system?

How does this relate to paged virtual memory?

BSD Buffer Header

The figure shows the approximate contents of a 4.3 BSD Unix block buffer header.

An interesting feature is that the real memory allocated to a block buffer can vary from 512 to 8Kbytes, but the virtual memory size is always 8Kbytes. The system uses the VM system to only map the number of pages that the buffer actually needs. Real memory is "moved" around from one virtual buffer to another, as needed.

BSD Buffer Hash Table & Free Lists

The chained hash table is used to find out whether a given block of data on disk is also in a memory buffer.

There are four free lists. The locked buffers always stay in the cache. This feature was intended for "superblocks" of mounted filesystems, but was never used because of a deadlock problem with processes trying to share access to superblocks. The LRU list is ordered by recency of use. The AGE list contains buffers moved from the LRU list, that have not been accessed recently. The EMPTY list contains buffer headers that have no real memory allocated to them. When a buffer is needed, it ordinarily is taken from the AGE list, and if that list is empty one is taken from the head of the LRU list.

Unificiation with VM Paging System

Observe that the disk block buffering functionality and the design considerations are very similar to those of the paging system that is needed to support virtual memory. Some simplicity and economy of design can be realized by using the some of the same software for both. Modern Unix systems, like Linux and Solaris, do this.

I/O Subsystem in the Linux Architecture

There is an excellent book, Linux Device Drivers that is available for downloading and printing over the Internet, free of charge. You can learn a lot about the whole operating system from this book. It covers the interfaces to other parts of the operating system, because device drivers interact with the other parts of the operating system. I recommend you read at least part of it.

The diagram, from that book, gives a good overall view of how the device drivers and I/O system relate to the rest of the OS.

Logical Structure of Linux Device Drivers

  1. Actions initiated by a running process (synchronously) via a kernel call
  2. Actions initiated by a hardware device (asynchronously) via an interrupt
    1. "Top half": called asynchronously, a direct hardware interrupt handler
      • used to notify driver of completion of an I/O operation
      • cannot access most kernel data structures, since it can preempt the kernel
      • generally kept very short and simple
    2. "Bottom half": called by the dispatcher, in response to flag set by the top half
      • can use spinlocks to access all kernel data structures,
        and so can unblock process that was waiting for I/O completion
      • can be longer, since other interrupts are not blocked

Linux device drivers can be separated into three parts, according to how the execution is initiated.

System Call Interface Hierarchy

The actions of an I/O device driver that are initiated by the action of a process begin with a system call. The first part of control the path of a call to the write() function in Linux 2.14 is shown in the figure. More details, including links to the full code, are given below.

System Call Interface Hierarchy: Specifics

  1. library subprogram call -- e.g,.write()
  2. kernel trap:
    mov    $0x4,%eax
    int    $0x80
  3. kernel trap handler -- e.g,. see system_call in entry.S of Linux kernel 2.14 source code:
    	call *SYMBOL_NAME(sys_call_table)(,%eax,4)
  4. call table -- e.g,. see sys_call_table in entry.S of Linux kernel 2.14 source code:
    	.long SYMBOL_NAME(sys_ni_syscall)	/* 0  -  old "setup()" system call*/
    	.long SYMBOL_NAME(sys_exit)
    	.long SYMBOL_NAME(sys_fork)
    	.long SYMBOL_NAME(sys_read)
    	.long SYMBOL_NAME(sys_write)
  5. sys_write in read_write.c of Linux kernel 2.14 source code:
    asmlinkage ssize_t sys_write(unsigned int fd, const char * buf, size_t count) 
       if (file->f_op && (write = file
                                  ->write) != NULL)
          ret = write(file, buf, count, &file->f_pos);

Observe that at two points the flow of control goes through pointers to functions. The first point where this happens is inside the system_call() function, which uses the value of a register (AX, on the Ix86 architecture) to select a function pointer from the vector sys_call_table. The second point where this happens is inside the sys_write() function, which uses the value of file->f_op->write to call the (you could say "overloaded" or "polymorphic") write function that is appropriate to the particular class of file.

Kernel Modules

Linux provides a framework for adding device drivers, based on kernel modules. A kernel module is a kernel component that can be loaded and unloaded dynamically. Besides making it possible to develop and compile device drivers independently from the rest of the kernel, it also allows the size of the kernel memory footprint to be reduced, by loading only those modules that are needed for the set of devices that are installed and active at any one time.

I/O Subsystem in Win2K Architecture

How much real information can you get from a figure such as this one?

Go back to Section 2.5 and review what it says about the relationship if the I/O subsystem to the rest of the Windows 2000 operating system.

Windows 2000 I/O

© 2002 T. P. Baker & Florida State University. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)