COP 4610: OPERATING SYSTEMS & CONCURRENT PROGRAMMING

File Management

 

Topics


The following notes provide more detail on the Unix filesystem internals than is provided by the textbook. Most of that material comes from M. J. Bach's Design of the Unix Operating System.

The notes also provide more detail on the Unix filesystem than is provided by the textbook. Most of that material comes from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.

Note that the term "blocking" was used under process management in the sense of preventing a thread from executing for some reason. Here, it is also used in another sense, meaning how data is organized into contiguous blocks.


Multiple Views


As in other chapters, Stallings bounces around between several different viewpoints on files and filesystems, organizing his presentation by areas of functionality.

He tries to be general, most of the time, to explain each of the principles and implementation techniques in its simplest, isolated, form. As a result nothing here exactly matches what is done in a real operating system. Real systems are not so simple or pure. For example, we have already mentioned that in modern Unix and Windows operating systems the same block buffering and disk I/O mechanisms are used for file I/O and the implementation of virtual memory (for page swapping).


File Management


Objectives for a File Management System


File System in Traditional Unix Architecture


When we looked at the I/O system, we saw how the Unix filesystem fits between the the system call layer and the I/O system. The block buffering system and the device drivers are part of the I/O system.


File System Software Architecture - Windows NTFS Components

This figure shows how the file management subsystem fits into the Windows 2000 system architecture. Notice that the disk cache manager and the virtual memory manager sit between the NFTFS driver and the lower levels of the I/O system. In this respect, Windows 2000 is similar to Linux, Solaris, and other modern Unix systems.


File System Software Architecture - a Simple Model

It is easier to split out and understand the functions of the file management subsystem if we take a simpler view, corresponding to the way operating systems were structured before the introduction of virtual memory.


Device Drivers


Basic File System


Basic I/O Supervisor


Logical I/O


Access Method


File Management & OS Concerns


The diagram in the Stallings text tries to distinguish the file management system and the "operating system". Since the file management system is part of the operating system, Stallings really means the rest of the operating system.

There is a fuzzy line on the side toward the application program. An operating system may provide more or less support for various access methods and internal file structures, and an application (typically a database managment system) may bypass the OS filesystem entirely, by going directly to the I/O layer API of the OS. For example, Unix provides "raw" disk I/O operations for applications that need direct access. Normally, one should not mix raw access and filesystem access on the same device, just as one should not mix buffered and unbuffered I/O on the same file. (In general, mixing higher and lower level abstractions "voids the warranty" of the higher level abstraction, since the integrity of the implementation of the higher level abstraction depends invariants which are usally not documented and which can be violated by direct calls to the next lower layer.)


File Management Functions


Criteria for File Organization


These principles apply both to the implementation of files, by the OS, and the organization of data within files, by applications.


General Design Considerations for the Filesystem Implementation


We will look at some of these issues in more detail, and finally look at the Unix and Windows 2000 filesytem data structures and algorithms. While reading about these specific systems you should think about how each of the above design factors has been taken into account, what trade-offs that have been made, and the degree to which the design works well for various kinds of access.

This study of filesystem implementation design is useful for more than understanding operating systems. The issues and principles here apply also to many other computer applications, and exemplify some important fundamental engineering design principles.


Terms Used with Files

Field
Record
File
Database
Volume

Note that the above is the traditional view of a file. It applies to what we might call a "normal", "regular" or "real" file.


Real Files versus Other Non-File Objects


In the context of Chapter 12 of Stallings and in these notes, the term "file" is used to refer to a disk file, unless we say something specific to the contrary.

For uniformity and convenience in naming objects, operating systems have come to use the operating sytem's filesystem name space to hold many other kinds of objects, that do not fit the conventional description of a file. For some examples, see the list given here, and the discussion of the Unix filesystem, further below.

The operating system may also treat as "files" other objects (that don't necessarily even have names), if it makes sense to perform similar operations on them to the ones we normall perform on files, such as opening, closing, reading, and writing.

This follows the object-oriented mode of thinking, in which ojects are classified according to the operations (a.k.a. "methods") that can be applied to them. In this sense, "files" actually need to be divided up into a variety of classes, and traditional files form just one sub-class. The actual classification system will, of course, depend on the specific operating system.


Handling Variable-Length Objects & Sub-Objects


This is knowledge you should already have, from a data structures course. If you do not already know this, it is time to learn it now.

Incidental to the notion of fields and records is the mechanics of how they are accessed, including the treatment of variable lengths and variations in format. In files, it is generally assumed that a record will be laid out sequentially. In the simplest case, the layout of fields within a record is entirely static. The fields are addressed by static offsets, and their lengths are also static. Where variable lengths are involved, the length of a variable-length field must itself be stored as a field, or a delimited must be used. Where multiple variable length fields occur in a record, additional fields (with static offsets) are required to give the offsets of the variable-length fields of the record. In the more general case, the record may have multiple variants, in which case one or more tag fields are provided, to identify the variant.


Variable Length Fields


Here we do not mean "variable" in the sense that the size of a given field may be changed after it is written. We mean that the size of the field may vary, from one record to another.

The size of a field may be specified explicitly, or specified implicitly by means of a terminator. Neither one necessarily takes less space. Using a terminator rules out the terminator as a valid data value (unless we are prepared to encode or pad all data values using extra bits). Therefore, the explicit size approach is preferred.


Variable Size Records


When a record contains variable-size fields, the data of the variable length fields is placed at the end of the record. For each field (fixed or variable size) there is a fixed part. For the fixed size fields, the fixed part contains the data. For the variable size fields, the fixed part contains the information needed to locate the data. This includes at least an offset. It may also include a size.


A More Compact Way


If the data parts of variable-sized fields are always laid out in the same order as the fixed parts, one can eliminate the size or terminator, by using the next offset.


Variant Records


If we want to implement a class of record that may have different sets of fields, we need a way to distinguish between the different record variants (types of object within the class). This is done by starting each record with a tag, i.e., an integer value that identifies the type of the record. The tag value tells us which of a finite number of known field layout formats should be applied to the record. Another word for a tag is a "type ID".

By the way, all of the preceding can be applied to record structures in memory as well as to record structures on files. For objects in memory, we can go a bit further. If the object is a full object in the OOP sense, some fields may be pointers to functions (methods) that apply to the object. Alternatively, the method pointers may belong to a separate object (a class descriptor), that can be looked up using the tag or an explicit pointer within the object.


Typical File Operations


This taxonomy of file operations best fits a structure view of a file as a sequence of records, as compared to the Unix/Linux view of a file as a sequence of bytes.


File Organization


The material on file structures (coming up next) is important for everyone to know. It has more to do with application programming than operating systems. That is, in modern operating systems you are likely to see, the internal organization (sequential, indexed sequential, etc.) of ordinary files is up to the application, though the system may provide libraries to support various access methods. You should see this material again whenever you take a course on databases, if the course deals with database internals.

This material is relevant to the study of operating systems in that it motivates the design choices regarding file storage allocation and layout. That is, the operating system chooses how files are laid out physically on devices. How files are laid out physically will in turn determine how efficiently the various access methods can be using the operating system. At the extreme, high performance database systems bypass the operating system's file implementation entirely, since they can get better performance by designing their own layout of data on disk.


The "Pile"

  • Data are collected in the order they arrive
  • Purpose is to accumulate a mass of data and save it
  • Records may have different fields
  • No file structure
  • Records must self-describing
  • Description techniques include:
    • record and field delimiters
    • record and field tags
    • record and field lengths
    • record and field names
  • Record access is by exhaustive search

Sequential File

  • Fixed format used for records (normally)
  • Records are the same length (normally)
  • All fields the same (order and length) from one record to the next
  • Field names and lengths are attributes of the file
  • One field is the key field
  • Uniquely identifies the record
  • Records are stored in key sequence
  • New records are placed temporarily in a log file or transaction file
  • Batch update is performed to merge the log file with the master file

Indexed Sequential File

  • Index provides a lookup capability to quickly reach the vicinity of the desired record
  • Contains key field and a pointer to the main file
  • Indexed is searched to find highest key value that is equal or less than the desired key value
  • Search continues in the main file at the location indicated by the pointer
  • New records are added to an overflow file
  • Record in main file that precedes it is updated to contain a pointer to the new record
  • The overflow is merged with the main file during a batch update
  • Multiple indexes for the same key field can be set up to increase efficiency

You should be able to see how this concept led naturally to hierachical index structures with still faster access times, such as the B-tree.


Comparison of sequential and indexed sequential


Indexed File

  • Uses multiple indexes for different key fields
  • May contain an exhaustive index that contains one entry for every record in the main file
  • May contain a partial index

Direct or Hashed File


File Directories


Basic directoy concepts should already be very familiar to you. We will summarize the evolution to the present commonly accepted practice in operating systems, of providing a tree-structured (hierarchical) directory namespace.


Simple Structure for a Directory


Two-level Scheme for a Directory


This two-level model appeared in some early operating systems, but has been left behind by modern systems.


Hierarchical, or Tree-Structured Directory

  • Master directory with user directories underneath it
  • Each user directory may have subdirectories and files as entries
  • Files can be located by following a path from the root, or master, directory down various branches
  • This is the pathname for the file
  • Can have several files with the same file name as long as they have unique path names
  • Current directory is the working directory
  • Files are referenced relative to the working directory

Hierarchical, or Tree-Structured Directory: More Detailed View


This picture differs from the Unix directory system in an important respect. Can you identify it?


Access Rights


The notion of access rights should already be fairly familiar to you, from your experience using files in operating systems.


Controlling Simultaneous Access


You should be familiar with some Unix API operations for synchronizing file access. What are they? For more detail see the discussion of Unix specifics, further below.


Data Blocking


Fixed Blocking


We may have some wasted space, due to inexact fit (internal fragmentation). However, we can calculate the position of a block simply, and we only need a single size of buffer.

By the way, the main context here is disk files. However, the discussion of blocking alternatives applies to both disk and tape storage. Due to the much slower access time of tape devices, blocking decisions have more impact on tape than on disk. Random access on tape is rather impractical, but even sequential access is inefficient if the tape must be stopped and restarted between records. If the tape is stopped, it needs to be backed up before restarting, to get the tape back up to speed again. For optimium performance, the tape should be able to "stream", i.e., the CPU should process the data at least as fast as it can come in (or generate the data at least as fast as it can be written, in the case of output), so that the tape drive never stops. If this cannot be achieved, we would at least like to keep to a minimum the number of stops and restarts. Longer tape blocks generally are better in this respect. On the other hand they require more buffer space. We must have room for at least two buffers.


Variable Blocking: Spanned


This allows variable-sized blocks. It avoids wasted space, by splitting logical records. A price is complexity in buffering, and sometimes multiple I/O operations to access a single record.


Variable Blocking: Unspanned


If record sizes are small enough, unspanned blocking is possible. This solves the problem of extra I/O operations, at the cost of internal fragmentation.


Secondary Storage Management


Allocation of storage to files is an OS function. The degree to which the user is required to provide information or influence the allocation may vary. Earlier operating systems invovled the user more. Real time operating system still provide a high degree of user control over file allocation, in order to provide more predictability and more efficiency. For example, a user may preallocate contiguous extents of storage, to allow for faster sequential access and to prevent the possibility of time delays for disk storage allocation during the operation of the application.


Preallocation


Methods of File Allocation


Contiguous Allocation

  • Single set of blocks is allocated to a file at the time of creation
  • Only a single entry in the file allocation table:
    (starting block, length of the file)
  • External fragmentation will occur,
    requiring periodic compaction

Contiguous File Allocaton (After Compaction)


Chained Allocation

  • Allocation on basis of individual block
  • Each block contains a pointer to the next block in the chain
  • Only single entry in the file allocation table:
    (starting block, length of file)
  • No external fragmentation
  • Best for sequential files
  • No accommodation of the principle of locality
    Periodic consoliation improves performance

Chained Allocation (after Consolidation)


Indexed Allocation with Block Portions

    • File allocation table contains a separate one-level index for each file
    • The index has one entry for each portion allocated to the file
    • The file allocation table contains block number for the index

Indexed Allocation with Variable-Length Portion


Free Space Management (Disk Allocation Table)


Unix File System Views

Like the other parts of an operating system, the filesystem can be viewed from several different perspectives.

By now, you should all be very familiar with the user views. We will review them briefly in the following notes, but try to move on quickly to the implementation views.


Abstract View: Concepts and Terms

The starting point for users is a abstract view. One way to approach this is via terminology, naming the components and their relationships.


Abstract View: Directories


The filesystem namespace is also used to keep track of other kinds of objects that do not otherwise behave like files. For example, there can be named semaphores and named shared memory objects.


Abstract View: Visualization of Directories


Abstract View: Visualization of Filesystem Namespace


File Sharing


Unix systems specify file access permissions via permission bits associated with a file.


File Access Permissions


Unix systems specify file access permissions via permission bits associated with a file.

These can be modified, usign the chmod() system call.

The group permission bits apply (only) to the members of the owner-group of the file other than the user-owner.

The other-user permission bits apply (only) to the users other than the user-owner and the members of the owner-group of the file.

Try using the shell command ls -la to look at the permnission bits and owners of various files.


Unix Management of Simultaneous Access to Files


"Advisory" means that it is up to the application to honor file locks. The OS keeps track of who has what locked, and informs the application, but does not prevent a process from accessing a file that is locked by another process.

The Solaris (SunOS) man-pages for chmod and fcntl aslo mention "mandatory locking". Can you find out more about what this is?


Abstract View: Types of Files

Ordinary/Regularfile
Contains text or binary data.
Executable if the "execute bit" set.
"Magic number" inside the file,
or filename extension of link (e.g. ".c") sometimes used to guess contents.
Directory
Contains links to other files, including directories
Named pipe
A FIFO buffer
Device special file
An interface to a hardware device.
Has a major and a minor device number.
Classified as:

Shell View

Ordinary users can view and manipulate files using shell commands, like the following:

> pwd
/faculty/baker
> ls -l .
total 20
drwxr-xr-x    2 baker    fac          4096 Jan  6 11:25 bin
-rw-------    1 baker    webmast        32 Apr  9 14:34 dead.letter
drwxr-xr-x   17 baker    fac          4096 Apr  8 05:20 public_html
-rw-r--r--    1 baker    fac            23 Jan  9 05:17 setup
drwxr-xr-x    3 baker    fac          4096 Jan  6 15:43 tools
rm dead.letter

Try using ls to look at some non-file objects in the feilsystem namespace, e.g.

[baker@ada]$ ls -lat /dev/hda
brw-rw----    1 root     disk       3,   0 Apr 11  2002 /dev/hda
[baker@ada]$ ls -lat /proc/27410
total 0
dr-xr-xr-x    3 baker    wheel           0 Nov 11 14:58 .
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 cmdline
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 cpu
lrwxrwxrwx    1 baker    wheel           0 Nov 11 14:58 cwd -> /home/baker
-r--------    1 baker    wheel           0 Nov 11 14:58 environ
lrwxrwxrwx    1 baker    wheel           0 Nov 11 14:58 exe -> /usr/local/bin/ssh2
dr-x------    2 baker    wheel           0 Nov 11 14:58 fd
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 maps
-rw-------    1 baker    wheel           0 Nov 11 14:58 mem
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 mounts
lrwxrwxrwx    1 baker    wheel           0 Nov 11 14:58 root -> /
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 stat
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 statm
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 status

API View

You have learned how to manipulate files from application programs, using sytem calls, like the following:


Example: Unix API for Synchronizing Access to Files


Filesystems & Mounting


The concept of mounting a filesystem comes from magnetic tapes. A tape reel is called a volume. In order to access the contents, the reel must be mounted. After the physical mounting of the reel on the tape drive, by the operator, the operating sytsem logically mounts the volume. The logical mounting involves initializing system data structures that identify the drive as being on line and identify the volume that resides on the drive. The mounting operation is the analog for an entire filesystem of the opening operation on a single file.

Partitions of fixed disk drives are always physically mounted, but they must be logically mounted by the file management system before the operating system can access the files on them. In the Unix and Windows 2000 operating systems, mounting a disk volume has the effect "grafting" it into the filesystem directory tree. The "root" directory of the volume is identified with an existing directory in the filesytem hierarchy. After that, software that uses the directory system can access the contents of the volume transparently, without being aware of the boundaries between volumes, or transitions between different types of filesystems.


Filesystems & Mounting

#include <sys/types.h>
#include <sys/mount.h>
#include <sys/mntent.h>

int mount
  (const char *spec, /* path to a device (block special file) */
   const char *dir,  /* path to the directory on which to mount */
   int mflag,        /* bit vector to select mounting options */
   char *fstype,     /* type of filesystem */
   char *dataptr,    /* filesystem-specific data (optional) */
   int datalen,    
   char *optptr,     /* filesystem-specific options (optional) */
   int optlen);

The mount operation can be called via a system API call. The interface shown here is for Solaris (SunOS). The API for Linux is slightly different. (This is not an area standardization has reached yet). On all Unix systems there is also a utility program, also called mount, that can be called from a shell to perform the same operation. Obviously, these are privileged operations.

Observe that this operation is "polymorphic" or "overloaded", for all the different types of filesystems. Certain parameters are common to all members of the "class" filesystem. Other parameters are specific to the particular type of filesystem.


Generic Filesystem Mounting Flags


Filesystem Types Supported by Linux

adfsaffsautofscodacoherentcramfs
devptsefsextext2ext3hfs
hpfsiso9660jfsminixmsdosncpfs
nfsntfsprocqnx4reiserfsromfs
smbfssysvtmpfsudfufsumsdos
vfatxenixxfsxiafs  

You can see that there are many filesystem types. You may recognize the traditional DOS FAT filesystem type (msdos), the Windows 95 filesytem type (vfat), the Windows NTFS filesystem type (ntfs), the Network File System (NFS) type (nfs), the proc filesystem type (proc), the ISO CD-ROM filesytem type (iso9660), and the native Linux filesystem types (ext, ext2, ext3).

Linux hides differences between filesystem types by a virtual filesystem (VFS) layer.


Linux System Configuration Files for Mounting

linux>more /etc/mtab
/dev/hda5 / ext3 rw 0 0
none /proc proc rw 0 0
/dev/hda1 /boot ext3 rw 0 0
none /dev/pts devpts rw,gid=5,mode=620 0 0
none /dev/shm tmpfs rw 0 0
/dev/hda8 /www/vhosts reiserfs rw 0 0
linux>more /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
LABEL=/boot             /boot                   ext3    defaults        1 2
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /proc                   proc    defaults        0 0
none                    /dev/shm                tmpfs   defaults        0 0
/dev/hda2               swap                    swap    defaults        0 0
/dev/sda1		swap			swap	defaults	0 0
/dev/cdrom              /mnt/cdrom              iso9660 noauto,owner,kudzu,ro 0 0
/dev/fd0                /mnt/floppy             auto    noauto,owner,kudzu 0 0
#/dev/sda2		/www			reiserfs	defaults 0 0
/dev/hda8		/www/vhosts		ext3	defaults	1 1

At system initialization time, the file /etc/fstab is read to determine which filesystems to mount. Each row of the table provides a set of parameters to the mount operation. The system file /etc/mtab keeps track of which files are currently mounted.


Abstract Internal View

The starting point for understanding how the filesystem is implemented is an abstract view of the implementation.


Traditional Unix Filesystem (Volume) Structure

Each filesystem occupies a continguous range of disk sectors, called a partition, and is organized internally as follows:

For fault tolerance, redundant copies of the superblock are stored, on different cylinders and different platters of the disk drive. This reduces the chance that a disk media failure will result in corruption of the entire file system.


Contents of Superblock


Contents of an Inode

Copies of the inodes of a mounted filesystem are kept in RAM, for efficiency. Updates are done on the in-RAM copies, which are periodically copied out to disk.


What is not in the inode, that you might expect would be?


Extra Contents of in-RAM Copies of Inodes


Why is this information not kept on disk?

What is the purpose of each of the above data items?

What happens if the system goes down and the in-RAM copies of the inodes are not written out?


Block Buffering

Direct and Indirect blocks are also buffered in RAM. When we are lucky they also stay in memory, if there is enough disk cache space.


Block Buffer Lookup Table

Block buffers are looked up using a hash table, with chaining.

The actual buffers are represented in the hash table by their headers.

The "data area" of the buffer header is what actually holds a disk block's worth of data


What is the purpose of this scheme?

How does allocating more RAM for disk cache affect the performance of the system?

How does this relate to paged virtual memory?


Relationship of inodes, Direct Blocks, and Indirect Blocks


Each file has one inode, which may point to direct and indirect blocks.

How may there may be "holes" in a file?

Why might a core dump file have holes?

What are the four segments of a process virtual memory?

How might a sparse file be useful in constructing a database with efficient lookup?

How does the read access time depend on the position of a block in the file?

What kinds of fragmentation-like performance degrading effects is such a filesystem subject to?

How is performance affected by the system's choice of logical block size (as compared to physical block size of the device)?


List of Free Inodes

A list of some of the free inodes is kept in the superblock, for efficiency.

When this list is empty, the disk is searched for more free inodes (filetype = 0), starting where the last free inode was found.


List of Free Disk Blocks


Free disk blocks are kept on a linked list, in which each free block is viewed as an array of pointers to other free blocks. (This is not a tree, though. Why not?)

How is this different from the free buffer list?

Is this list on disk or in RAM?


System Crashes


If you have studied databases before, some of the concepts here will already be familiar to you, since the problem here is essentially the same as the transaction problem encountered in the implementation of databases. In fact, since filesystems came before databases, this is where the problem and the foundational solutions originated. The filesystem internal data structures and directory structure of an OS amount to a specific kind of database. Updating the filesystem, with the more complex (and higher performance) forms of filesystems, involves modification to more than one disk block, and while the modifications are in progress the data structures on the disk may not be internally consistent. Since everything else in the system depends on these structures, being able to restore consistency after a system crash is critical.


Approaches to Dealing with Inconsistencies


All of these are concerned mainly with preserving the integrity of the fileystem internal structures, not user data. That is, the objective is to prevent inconsistencies in the directory structure and the lower level structures used to access individual files and disguish allocated from free space on the disk. Loss of all or portions of individual files is still possible.

With the Unix and NTFS filesytems, a user or application program can improve reliability, if desired, by forcing disk buffer cache flushing (e.g., see fsync API) at critical points.


Unix Disk Error Recovery: fsck

The program fsck checks a filesystem for inconsistencies, and then attempts to repair them.

  1. block belongs to more than one inode
  2. block belongs to an inode and the list of free inodes
  3. block is not on free list and not in a file
  4. non-zero link count but not in any directories
  5. free inode found in directory
  6. in general: more/less directory links than link count
  7. format of inode incorrect
  8. count of free blocks in super block does not match the number on disk
  9. count of free inodes in super block does not match the number on disk

How could each of these situations arise?

How might each of these situations be repaired?

e.g.: Situation (1) could happen if the kernel frees a disk block in a file, returns the block number to the in-RAM copy of the super block, and allocates the block to a new file; the kernel writes the inode and blocks of the new file to disk, but crashes before updating the inode of the old file to the disk. One possible repair would be to duplicate the block and assign the duplicate to one of the inodes. This would probably result in the old file containing a block of garbage, since the data in the block is coming from the new file. An alternative would be to simply delete the block from the least recently modified inode, leaving a hole in the file. This is one of the hardest cases to repair satisfactorily. Some of the other cases are much simpler and more likely to come out right.


Implementation Code View

We do not have time to go into the implementation code in detail. However, you might like to glance at some of the Linux filesystem implementation code. You can download and browse the full Linux source tree, or just look at the following two sample files:


Windows NT/2000 File System Support


Like most modern operating systems, Windows 2000 supports several different types of filesystem. We will look at two of them in more detail.


FAT Filesystem Organization


Two copies of the FAT are kept, for reliability. File Allocation Table (FAT) keeps track of which blocks are in which file.


FAT Entry Chaining


Directory entry points to first FAT entry of file. Each FAT entry points to next cluster in the file, if any.

The diagram is adapted from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.


FAT Directory Entry Components


FAT 12 Filesystem


FAT 16 Filesystem


FAT 32 Filesystem


The data are from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.

Additional informatoin on NFTS can be found at http://linux-ntfs.sourceforge.net/ntfs/index.html.


Windows NT/2000 File System (NTFS): Key Features


NTFS Volume & File Structure


NTFS Volume Layout


Per-Volume Metadata Files for NTFS

PositionFile
0$Mft - MFT
1$MftMirr - MFT mirror
2$LogFile - log file
3$Volume - volume file
4$AttrDef - attribute definition table
5\ - root directory
6$Bitmap - volume cluster allocation file
7$Boot - boot sector
8$BadClus - bad cluster file
9$Secure - security settings file
10$UpCase - uppercase character mapping
11$Extend - extended metadata directory
12Unused
13Unused
14Unused
15Unused
16User files and directories

These are implemented as files, but are used to keep information about the filesystem.


Master File Table


File Record Attribute Structure


File Reference Number


How to Handle Longer Files/Attributes


Filename Attribute


File with Short (Resident) Data Attribute


File with Longer (Nonresident) Data Attribute

  • VCN = "Virtual Cluster Number"
  • LCN = "Logical Cluster Number"

This is our "second resort", for files that do not fit entirely into one MFT file record. The size of file that may be represented in this way is limited by the size of the file and the degree of fragmentation of the filesystem's free storage.

If we need more runs than can fit inside one MFT file record, we need to switch over to the "third resort". That is to split up the attributes (including the data attribute header) into more than one MFT record. This is done by adding an "attributes list" attribute. The attributes list attribute contains the name, type code, and MFT file record reference for each attribute of the file.

Note that Microsoft uses the term fragmentation in a extended way from the usage up to now in Stallings. Up to now, we have spoken only of storage being fragmented. MS says a file is fragmented if it is broken into non-contiguous blocks. This is neither "internal fragmentation" nor "external fragmentation" in the senses used by Stallings for storage management. The connection to fragmentation is that if the disk's free storage is fragmented (external fragmentation sense), we will be forced to allocate the storage of a file in noncontiguous blocks, rendering the file fragmented (in the MS sense).


Directory - Index Root Attribute


If a directory is small enough, the directory entries are stored directly within the MFT record of the directory. Each directory entry contains:

The duplicate data from the MFT record is speeds directory content listing.

Directory entries are packed, so the number that fit depends on the lengths of the filenames.


Larger Directory - Index Root Attribute


If the directory entries don't all fit in the MFT record of the directory, some are put into index buffers.

The figure is simplified in several respects. The filenames "filen" are just examples, chosen to indicate that the filenames are in lexicographic order. The file entries are packed into the VCNs, so that each VCN can typically hold 20 or 30 of them, rather than just one.

The index allocation attribute of the directory specifies the mapping from LCN's to VCN's, that is used to find thei index buffers on the disk. The bitmap attribute specifies whether each VCN of an index buffer is in use.


NTFS Has Built-in Support for Indexing

How does this compare with the Unix directory structure described above?


The NTFS directory index structures are B+ trees. They keep the data in sorted order, and provide O(logkn) access, where n is between 20 and 30. B+ trees are normally covered in database courses, since they are one of the basic implementation techniques for databases. If you have not seen B+ trees before, you might want to look at http://www.seanster.com/BplusTree/BplusTree.html for a tutorial on them.

The B+ tree indexin scheme can be used for indexing on other attributes, besides filenames. For example, a file can have an $OBJECT_ID attribute. There is an API that allows opening a file using its object ID. There is a mapping from object IDs to file reference numbers in the \$EXtend\$ObjeID metadata file. This is another B+ tree.


NTFS Crash Recovery


Write-Ahead Logging


Logging is not unique to Windows NT/2000. It has been a feature of mainframe operating systems for years. The newer Unix/Linux filesystems (e.g., Linux ext3fs) also support logging.


Log Record Types


This is a standard technique from databases. The checkpoint records limit the distance it is necessary to go back into the log file, to recover after a crash. In turn, this allows some of the log file storage to be recovered, so that the log file "normally" does not fill up.

However, there is still a possibility of the log file space (which is preallocated) overflowing. This is possible with the NFTS. When it is about to happen, the OS essentially shuts down all other filesystem activities (file creation and deletion), and performs and emergency flush operation on the buffer cache. Note that this "short pause for I/O processing" could be horrible for a criticial realtime system.


NTFS Recovery Phases


If you are following this, you should be able to answer the following question:

Since the "redo" records contain enough information to perform the real disk I/O operation, why are we "doing the operation twice"? That is, we first write the redo record, and later write the actual disk record. These are obviously redundant. Why can't we just write the real disk record, and do it only once?

© 2002 T. P. Baker & Florida State University. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)