|COP 4610: OPERATING SYSTEMS & CONCURRENT PROGRAMMING|
The following notes provide more detail on the Unix filesystem internals than is provided by the textbook. Most of that material comes from M. J. Bach's Design of the Unix Operating System.
The notes also provide more detail on the Unix filesystem than is provided by the textbook. Most of that material comes from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.
Note that the term "blocking" was used under process management in the sense of preventing a thread from executing for some reason. Here, it is also used in another sense, meaning how data is organized into contiguous blocks.
As in other chapters, Stallings bounces around between several different viewpoints on files and filesystems, organizing his presentation by areas of functionality.
He tries to be general, most of the time, to explain each of the principles and implementation techniques in its simplest, isolated, form. As a result nothing here exactly matches what is done in a real operating system. Real systems are not so simple or pure. For example, we have already mentioned that in modern Unix and Windows operating systems the same block buffering and disk I/O mechanisms are used for file I/O and the implementation of virtual memory (for page swapping).
When we looked at the I/O system, we saw how the Unix filesystem fits between the the system call layer and the I/O system. The block buffering system and the device drivers are part of the I/O system.
This figure shows how the file management subsystem fits into the Windows 2000 system architecture. Notice that the disk cache manager and the virtual memory manager sit between the NFTFS driver and the lower levels of the I/O system. In this respect, Windows 2000 is similar to Linux, Solaris, and other modern Unix systems.
It is easier to split out and understand the functions of the file management subsystem if we take a simpler view, corresponding to the way operating systems were structured before the introduction of virtual memory.
The diagram in the Stallings text tries to distinguish the file management system and the "operating system". Since the file management system is part of the operating system, Stallings really means the rest of the operating system.
There is a fuzzy line on the side toward the application program. An operating system may provide more or less support for various access methods and internal file structures, and an application (typically a database managment system) may bypass the OS filesystem entirely, by going directly to the I/O layer API of the OS. For example, Unix provides "raw" disk I/O operations for applications that need direct access. Normally, one should not mix raw access and filesystem access on the same device, just as one should not mix buffered and unbuffered I/O on the same file. (In general, mixing higher and lower level abstractions "voids the warranty" of the higher level abstraction, since the integrity of the implementation of the higher level abstraction depends invariants which are usally not documented and which can be violated by direct calls to the next lower layer.)
These principles apply both to the implementation of files, by the OS, and the organization of data within files, by applications.
We will look at some of these issues in more detail, and finally look at the Unix and Windows 2000 filesytem data structures and algorithms. While reading about these specific systems you should think about how each of the above design factors has been taken into account, what trade-offs that have been made, and the degree to which the design works well for various kinds of access.
This study of filesystem implementation design is useful for more than understanding operating systems. The issues and principles here apply also to many other computer applications, and exemplify some important fundamental engineering design principles.
Note that the above is the traditional view of a file. It applies to what we might call a "normal", "regular" or "real" file.
In the context of Chapter 12 of Stallings and in these notes, the term "file" is used to refer to a disk file, unless we say something specific to the contrary.
For uniformity and convenience in naming objects, operating systems have come to use the operating sytem's filesystem name space to hold many other kinds of objects, that do not fit the conventional description of a file. For some examples, see the list given here, and the discussion of the Unix filesystem, further below.
The operating system may also treat as "files" other objects (that don't necessarily even have names), if it makes sense to perform similar operations on them to the ones we normall perform on files, such as opening, closing, reading, and writing.
This follows the object-oriented mode of thinking, in which ojects are classified according to the operations (a.k.a. "methods") that can be applied to them. In this sense, "files" actually need to be divided up into a variety of classes, and traditional files form just one sub-class. The actual classification system will, of course, depend on the specific operating system.
This is knowledge you should already have, from a data structures course. If you do not already know this, it is time to learn it now.
Incidental to the notion of fields and records is the mechanics of how they are accessed, including the treatment of variable lengths and variations in format. In files, it is generally assumed that a record will be laid out sequentially. In the simplest case, the layout of fields within a record is entirely static. The fields are addressed by static offsets, and their lengths are also static. Where variable lengths are involved, the length of a variable-length field must itself be stored as a field, or a delimited must be used. Where multiple variable length fields occur in a record, additional fields (with static offsets) are required to give the offsets of the variable-length fields of the record. In the more general case, the record may have multiple variants, in which case one or more tag fields are provided, to identify the variant.
Here we do not mean "variable" in the sense that the size of a given field may be changed after it is written. We mean that the size of the field may vary, from one record to another.
The size of a field may be specified explicitly, or specified implicitly by means of a terminator. Neither one necessarily takes less space. Using a terminator rules out the terminator as a valid data value (unless we are prepared to encode or pad all data values using extra bits). Therefore, the explicit size approach is preferred.
When a record contains variable-size fields, the data of the variable length fields is placed at the end of the record. For each field (fixed or variable size) there is a fixed part. For the fixed size fields, the fixed part contains the data. For the variable size fields, the fixed part contains the information needed to locate the data. This includes at least an offset. It may also include a size.
If the data parts of variable-sized fields are always laid out in the same order as the fixed parts, one can eliminate the size or terminator, by using the next offset.
If we want to implement a class of record that may have different sets of fields, we need a way to distinguish between the different record variants (types of object within the class). This is done by starting each record with a tag, i.e., an integer value that identifies the type of the record. The tag value tells us which of a finite number of known field layout formats should be applied to the record. Another word for a tag is a "type ID".
By the way, all of the preceding can be applied to record structures in memory as well as to record structures on files. For objects in memory, we can go a bit further. If the object is a full object in the OOP sense, some fields may be pointers to functions (methods) that apply to the object. Alternatively, the method pointers may belong to a separate object (a class descriptor), that can be looked up using the tag or an explicit pointer within the object.
This taxonomy of file operations best fits a structure view of a file as a sequence of records, as compared to the Unix/Linux view of a file as a sequence of bytes.
The material on file structures (coming up next) is important for everyone to know. It has more to do with application programming than operating systems. That is, in modern operating systems you are likely to see, the internal organization (sequential, indexed sequential, etc.) of ordinary files is up to the application, though the system may provide libraries to support various access methods. You should see this material again whenever you take a course on databases, if the course deals with database internals.
This material is relevant to the study of operating systems in that it motivates the design choices regarding file storage allocation and layout. That is, the operating system chooses how files are laid out physically on devices. How files are laid out physically will in turn determine how efficiently the various access methods can be using the operating system. At the extreme, high performance database systems bypass the operating system's file implementation entirely, since they can get better performance by designing their own layout of data on disk.
You should be able to see how this concept led naturally to hierachical index structures with still faster access times, such as the B-tree.
Basic directoy concepts should already be very familiar to you. We will summarize the evolution to the present commonly accepted practice in operating systems, of providing a tree-structured (hierarchical) directory namespace.
This two-level model appeared in some early operating systems, but has been left behind by modern systems.
This picture differs from the Unix directory system in an important respect. Can you identify it?
The notion of access rights should already be fairly familiar to you, from your experience using files in operating systems.
You should be familiar with some Unix API operations for synchronizing file access. What are they? For more detail see the discussion of Unix specifics, further below.
We may have some wasted space, due to inexact fit (internal fragmentation). However, we can calculate the position of a block simply, and we only need a single size of buffer.
By the way, the main context here is disk files. However, the discussion of blocking alternatives applies to both disk and tape storage. Due to the much slower access time of tape devices, blocking decisions have more impact on tape than on disk. Random access on tape is rather impractical, but even sequential access is inefficient if the tape must be stopped and restarted between records. If the tape is stopped, it needs to be backed up before restarting, to get the tape back up to speed again. For optimium performance, the tape should be able to "stream", i.e., the CPU should process the data at least as fast as it can come in (or generate the data at least as fast as it can be written, in the case of output), so that the tape drive never stops. If this cannot be achieved, we would at least like to keep to a minimum the number of stops and restarts. Longer tape blocks generally are better in this respect. On the other hand they require more buffer space. We must have room for at least two buffers.
This allows variable-sized blocks. It avoids wasted space, by splitting logical records. A price is complexity in buffering, and sometimes multiple I/O operations to access a single record.
If record sizes are small enough, unspanned blocking is possible. This solves the problem of extra I/O operations, at the cost of internal fragmentation.
Allocation of storage to files is an OS function. The degree to which the user is required to provide information or influence the allocation may vary. Earlier operating systems invovled the user more. Real time operating system still provide a high degree of user control over file allocation, in order to provide more predictability and more efficiency. For example, a user may preallocate contiguous extents of storage, to allow for faster sequential access and to prevent the possibility of time delays for disk storage allocation during the operation of the application.
Like the other parts of an operating system, the filesystem can be viewed from several different perspectives.
By now, you should all be very familiar with the user views. We will review them briefly in the following notes, but try to move on quickly to the implementation views.
The starting point for users is a abstract view. One way to approach this is via terminology, naming the components and their relationships.
The filesystem namespace is also used to keep track of other kinds of objects that do not otherwise behave like files. For example, there can be named semaphores and named shared memory objects.
Unix systems specify file access permissions via permission bits associated with a file.
Unix systems specify file access permissions via permission bits associated with a file.
These can be modified, usign the chmod() system call.
The group permission bits apply (only) to the members of the owner-group of the file other than the user-owner.
The other-user permission bits apply (only) to the users other than the user-owner and the members of the owner-group of the file.
Try using the shell command ls -la to look at the permnission bits and owners of various files.
"Advisory" means that it is up to the application to honor file locks. The OS keeps track of who has what locked, and informs the application, but does not prevent a process from accessing a file that is locked by another process.
The Solaris (SunOS) man-pages for chmod and fcntl aslo mention "mandatory locking". Can you find out more about what this is?
Ordinary users can view and manipulate files using shell commands, like the following:
> pwd /faculty/baker
> ls -l . total 20 drwxr-xr-x 2 baker fac 4096 Jan 6 11:25 bin -rw------- 1 baker webmast 32 Apr 9 14:34 dead.letter drwxr-xr-x 17 baker fac 4096 Apr 8 05:20 public_html -rw-r--r-- 1 baker fac 23 Jan 9 05:17 setup drwxr-xr-x 3 baker fac 4096 Jan 6 15:43 tools
Try using ls to look at some non-file objects in the feilsystem namespace, e.g.
[baker@ada]$ ls -lat /dev/hda brw-rw---- 1 root disk 3, 0 Apr 11 2002 /dev/hda
[baker@ada]$ ls -lat /proc/27410 total 0 dr-xr-xr-x 3 baker wheel 0 Nov 11 14:58 . -r--r--r-- 1 baker wheel 0 Nov 11 14:58 cmdline -r--r--r-- 1 baker wheel 0 Nov 11 14:58 cpu lrwxrwxrwx 1 baker wheel 0 Nov 11 14:58 cwd -> /home/baker -r-------- 1 baker wheel 0 Nov 11 14:58 environ lrwxrwxrwx 1 baker wheel 0 Nov 11 14:58 exe -> /usr/local/bin/ssh2 dr-x------ 2 baker wheel 0 Nov 11 14:58 fd -r--r--r-- 1 baker wheel 0 Nov 11 14:58 maps -rw------- 1 baker wheel 0 Nov 11 14:58 mem -r--r--r-- 1 baker wheel 0 Nov 11 14:58 mounts lrwxrwxrwx 1 baker wheel 0 Nov 11 14:58 root -> / -r--r--r-- 1 baker wheel 0 Nov 11 14:58 stat -r--r--r-- 1 baker wheel 0 Nov 11 14:58 statm -r--r--r-- 1 baker wheel 0 Nov 11 14:58 status
You have learned how to manipulate files from application programs, using sytem calls, like the following:
The concept of mounting a filesystem comes from magnetic tapes. A tape reel is called a volume. In order to access the contents, the reel must be mounted. After the physical mounting of the reel on the tape drive, by the operator, the operating sytsem logically mounts the volume. The logical mounting involves initializing system data structures that identify the drive as being on line and identify the volume that resides on the drive. The mounting operation is the analog for an entire filesystem of the opening operation on a single file.
Partitions of fixed disk drives are always physically mounted, but they must be logically mounted by the file management system before the operating system can access the files on them. In the Unix and Windows 2000 operating systems, mounting a disk volume has the effect "grafting" it into the filesystem directory tree. The "root" directory of the volume is identified with an existing directory in the filesytem hierarchy. After that, software that uses the directory system can access the contents of the volume transparently, without being aware of the boundaries between volumes, or transitions between different types of filesystems.
#include <sys/types.h> #include <sys/mount.h> #include <sys/mntent.h> int mount (const char *spec, /* path to a device (block special file) */ const char *dir, /* path to the directory on which to mount */ int mflag, /* bit vector to select mounting options */ char *fstype, /* type of filesystem */ char *dataptr, /* filesystem-specific data (optional) */ int datalen, char *optptr, /* filesystem-specific options (optional) */ int optlen);
The mount operation can be called via a system API call. The interface shown here is for Solaris (SunOS). The API for Linux is slightly different. (This is not an area standardization has reached yet). On all Unix systems there is also a utility program, also called mount, that can be called from a shell to perform the same operation. Obviously, these are privileged operations.
Observe that this operation is "polymorphic" or "overloaded", for all the different types of filesystems. Certain parameters are common to all members of the "class" filesystem. Other parameters are specific to the particular type of filesystem.
You can see that there are many filesystem types. You may recognize the traditional DOS FAT filesystem type (msdos), the Windows 95 filesytem type (vfat), the Windows NTFS filesystem type (ntfs), the Network File System (NFS) type (nfs), the proc filesystem type (proc), the ISO CD-ROM filesytem type (iso9660), and the native Linux filesystem types (ext, ext2, ext3).
Linux hides differences between filesystem types by a virtual filesystem (VFS) layer.
linux>more /etc/mtab /dev/hda5 / ext3 rw 0 0 none /proc proc rw 0 0 /dev/hda1 /boot ext3 rw 0 0 none /dev/pts devpts rw,gid=5,mode=620 0 0 none /dev/shm tmpfs rw 0 0 /dev/hda8 /www/vhosts reiserfs rw 0 0
linux>more /etc/fstab LABEL=/ / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 none /proc proc defaults 0 0 none /dev/shm tmpfs defaults 0 0 /dev/hda2 swap swap defaults 0 0 /dev/sda1 swap swap defaults 0 0 /dev/cdrom /mnt/cdrom iso9660 noauto,owner,kudzu,ro 0 0 /dev/fd0 /mnt/floppy auto noauto,owner,kudzu 0 0 #/dev/sda2 /www reiserfs defaults 0 0 /dev/hda8 /www/vhosts ext3 defaults 1 1
At system initialization time, the file /etc/fstab is read to determine which filesystems to mount. Each row of the table provides a set of parameters to the mount operation. The system file /etc/mtab keeps track of which files are currently mounted.
The starting point for understanding how the filesystem is implemented is an abstract view of the implementation.
Each filesystem occupies a continguous range of disk sectors, called a partition, and is organized internally as follows:
For fault tolerance, redundant copies of the superblock are stored, on different cylinders and different platters of the disk drive. This reduces the chance that a disk media failure will result in corruption of the entire file system.
Copies of the inodes of a mounted filesystem are kept in RAM, for efficiency. Updates are done on the in-RAM copies, which are periodically copied out to disk.
What is not in the inode, that you might expect would be?
Why is this information not kept on disk?
What is the purpose of each of the above data items?
What happens if the system goes down and the in-RAM copies of the inodes are not written out?
Direct and Indirect blocks are also buffered in RAM. When we are lucky they also stay in memory, if there is enough disk cache space.
Block buffers are looked up using a hash table, with chaining.
The actual buffers are represented in the hash table by their headers.
The "data area" of the buffer header is what actually holds a disk block's worth of data
What is the purpose of this scheme?
How does allocating more RAM for disk cache affect the performance of the system?
How does this relate to paged virtual memory?
Each file has one inode, which may point to direct and indirect blocks.
How may there may be "holes" in a file?
Why might a core dump file have holes?
What are the four segments of a process virtual memory?
How might a sparse file be useful in constructing a database with efficient lookup?
How does the read access time depend on the position of a block in the file?
What kinds of fragmentation-like performance degrading effects is such a filesystem subject to?
How is performance affected by the system's choice of logical block size (as compared to physical block size of the device)?
A list of some of the free inodes is kept in the superblock, for efficiency.
When this list is empty, the disk is searched for more free inodes (filetype = 0), starting where the last free inode was found.
Free disk blocks are kept on a linked list, in which each free block is viewed as an array of pointers to other free blocks. (This is not a tree, though. Why not?)
How is this different from the free buffer list?
Is this list on disk or in RAM?
If you have studied databases before, some of the concepts here will already be familiar to you, since the problem here is essentially the same as the transaction problem encountered in the implementation of databases. In fact, since filesystems came before databases, this is where the problem and the foundational solutions originated. The filesystem internal data structures and directory structure of an OS amount to a specific kind of database. Updating the filesystem, with the more complex (and higher performance) forms of filesystems, involves modification to more than one disk block, and while the modifications are in progress the data structures on the disk may not be internally consistent. Since everything else in the system depends on these structures, being able to restore consistency after a system crash is critical.
All of these are concerned mainly with preserving the integrity of the fileystem internal structures, not user data. That is, the objective is to prevent inconsistencies in the directory structure and the lower level structures used to access individual files and disguish allocated from free space on the disk. Loss of all or portions of individual files is still possible.
With the Unix and NTFS filesytems, a user or application program can improve reliability, if desired, by forcing disk buffer cache flushing (e.g., see fsync API) at critical points.
The program fsck checks a filesystem for inconsistencies, and then attempts to repair them.
How could each of these situations arise?
How might each of these situations be repaired?
e.g.: Situation (1) could happen if the kernel frees a disk block in a file, returns the block number to the in-RAM copy of the super block, and allocates the block to a new file; the kernel writes the inode and blocks of the new file to disk, but crashes before updating the inode of the old file to the disk. One possible repair would be to duplicate the block and assign the duplicate to one of the inodes. This would probably result in the old file containing a block of garbage, since the data in the block is coming from the new file. An alternative would be to simply delete the block from the least recently modified inode, leaving a hole in the file. This is one of the hardest cases to repair satisfactorily. Some of the other cases are much simpler and more likely to come out right.
We do not have time to go into the implementation code in detail. However, you might like to glance at some of the Linux filesystem implementation code. You can download and browse the full Linux source tree, or just look at the following two sample files:
Like most modern operating systems, Windows 2000 supports several different types of filesystem. We will look at two of them in more detail.
Two copies of the FAT are kept, for reliability. File Allocation Table (FAT) keeps track of which blocks are in which file.
Directory entry points to first FAT entry of file. Each FAT entry points to next cluster in the file, if any.
The diagram is adapted from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.
|Volume Size (MB)||Cluster Size (KB)|
|Volume Size||Cluster Size (KB)|
|32 MB - 8 GB||4|
|8 GB - 16 GB||8|
|16 GB - 32 GB||16|
The data are from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.
Additional informatoin on NFTS can be found at http://linux-ntfs.sourceforge.net/ntfs/index.html.
|0||$Mft - MFT|
|1||$MftMirr - MFT mirror|
|2||$LogFile - log file|
|3||$Volume - volume file|
|4||$AttrDef - attribute definition table|
|5||\ - root directory|
|6||$Bitmap - volume cluster allocation file|
|7||$Boot - boot sector|
|8||$BadClus - bad cluster file|
|9||$Secure - security settings file|
|10||$UpCase - uppercase character mapping|
|11||$Extend - extended metadata directory|
|16||User files and directories|
These are implemented as files, but are used to keep information about the filesystem.
This is our "second resort", for files that do not fit entirely into one MFT file record. The size of file that may be represented in this way is limited by the size of the file and the degree of fragmentation of the filesystem's free storage.
If we need more runs than can fit inside one MFT file record, we need to switch over to the "third resort". That is to split up the attributes (including the data attribute header) into more than one MFT record. This is done by adding an "attributes list" attribute. The attributes list attribute contains the name, type code, and MFT file record reference for each attribute of the file.
Note that Microsoft uses the term fragmentation in a extended way from the usage up to now in Stallings. Up to now, we have spoken only of storage being fragmented. MS says a file is fragmented if it is broken into non-contiguous blocks. This is neither "internal fragmentation" nor "external fragmentation" in the senses used by Stallings for storage management. The connection to fragmentation is that if the disk's free storage is fragmented (external fragmentation sense), we will be forced to allocate the storage of a file in noncontiguous blocks, rendering the file fragmented (in the MS sense).
If a directory is small enough, the directory entries are stored directly within the MFT record of the directory. Each directory entry contains:
The duplicate data from the MFT record is speeds directory content listing.
Directory entries are packed, so the number that fit depends on the lengths of the filenames.
If the directory entries don't all fit in the MFT record of the directory, some are put into index buffers.
The figure is simplified in several respects. The filenames "filen" are just examples, chosen to indicate that the filenames are in lexicographic order. The file entries are packed into the VCNs, so that each VCN can typically hold 20 or 30 of them, rather than just one.
The index allocation attribute of the directory specifies the mapping from LCN's to VCN's, that is used to find thei index buffers on the disk. The bitmap attribute specifies whether each VCN of an index buffer is in use.
How does this compare with the Unix directory structure described above?
The NTFS directory index structures are B+ trees. They keep the data in sorted order, and provide O(logkn) access, where n is between 20 and 30. B+ trees are normally covered in database courses, since they are one of the basic implementation techniques for databases. If you have not seen B+ trees before, you might want to look at http://www.seanster.com/BplusTree/BplusTree.html for a tutorial on them.
The B+ tree indexin scheme can be used for indexing on other attributes, besides filenames. For example, a file can have an $OBJECT_ID attribute. There is an API that allows opening a file using its object ID. There is a mapping from object IDs to file reference numbers in the \$EXtend\$ObjeID metadata file. This is another B+ tree.
Logging is not unique to Windows NT/2000. It has been a feature of mainframe operating systems for years. The newer Unix/Linux filesystems (e.g., Linux ext3fs) also support logging.
This is a standard technique from databases. The checkpoint records limit the distance it is necessary to go back into the log file, to recover after a crash. In turn, this allows some of the log file storage to be recovered, so that the log file "normally" does not fill up.
However, there is still a possibility of the log file space (which is preallocated) overflowing. This is possible with the NFTS. When it is about to happen, the OS essentially shuts down all other filesystem activities (file creation and deletion), and performs and emergency flush operation on the buffer cache. Note that this "short pause for I/O processing" could be horrible for a criticial realtime system.
If you are following this, you should be able to answer the following question:
Since the "redo" records contain enough information to perform the real disk I/O operation, why are we "doing the operation twice"? That is, we first write the redo record, and later write the actual disk record. These are obviously redundant. Why can't we just write the real disk record, and do it only once?
|© 2002 T. P. Baker & Florida State University. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)|