files.html

COP 4610: OPERATING SYSTEMS & CONCURRENT PROGRAMMING

File Management

Topics

Introduction
OS architecture (software module breakdown and organization)
Design issues for the implementation

General terms & concepts
Field and record structures
File organization & access methods
Directories
File data blocking
Unix filesystem externals
Unix filesystem internals
Windows 2000 filesystem internals

The following notes provide more detail on the Unix filesystem internals than is provided by the textbook. Most of that material comes from M. J. Bach's Design of the Unix Operating System.

The notes also provide more detail on the Unix filesystem than is provided by the textbook. Most of that material comes from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.

Note that the term "blocking" was used under process management in the sense of preventing a thread from executing for some reason. Here, it is also used in another sense, meaning how data is organized into contiguous blocks.

Multiple Views

User (via shell or API) it Implementor Views
- User view
  - abstractions behind OS-supported services
    - abstract objects, attributes, relastionships
    - abstract functions/operations
  - API, shell syntax
- Implementor view
  - software architecture, organization
  - data structures, algorithms, performance effects
Abstract versus Concrete Views
Generalities versus Specifics

As in other chapters, Stallings bounces around between several different viewpoints on files and filesystems, organizing his presentation by areas of functionality.

He tries to be general, most of the time, to explain each of the principles and implementation techniques in its simplest, isolated, form. As a result nothing here exactly matches what is done in a real operating system. Real systems are not so simple or pure. For example, we have already mentioned that in modern Unix and Windows operating systems the same block buffering and disk I/O mechanisms are used for file I/O and the implementation of virtual memory (for page swapping).

File Management

File management subsystem is considered part of the operating system
Applications view data in terms of files:
- Input to applications is by means of a file
- Output is saved in a file for long-term storage
File management subsystem is responsible for implementing this file abstraction, with all its ramifications, for the user
Programmer does not need to develop file management software

Objectives for a File Management System

Meet the data management needs and requirements of the user
Guarantee that the data in the file are valid
Optimize performance
Provide I/O support for a variety of storage device types
Minimize or eliminate the potential for lost or destroyed data
Provide I/O support for multiple users
Provide a standardized set of I/O interface routines
(OS may support several types of hardware devices and several types of filesystems,
but should hide the differences from users)
Organize code in such a way as to share as much as possible

File System in Traditional Unix Architecture

When we looked at the I/O system, we saw how the Unix filesystem fits between the the system call layer and the I/O system. The block buffering system and the device drivers are part of the I/O system.

File System Software Architecture - Windows NTFS Components

This figure shows how the file management subsystem fits into the Windows 2000 system architecture. Notice that the disk cache manager and the virtual memory manager sit between the NFTFS driver and the lower levels of the I/O system. In this respect, Windows 2000 is similar to Linux, Solaris, and other modern Unix systems.

File System Software Architecture - a Simple Model

It is easier to split out and understand the functions of the file management subsystem if we take a simpler view, corresponding to the way operating systems were structured before the introduction of virtual memory.

Device Drivers

Lowest level
Communicates directly with peripheral devices
Responsible for starting I/O operations on a device
Processes the completion of an I/O request

Basic File System

Physical I/O
Deals with exchanging blocks of data
Concerned with the placement of blocks
Concerned with buffering blocks in main memory

Basic I/O Supervisor

Responsible for file I/O initiation and termination
Control structures are maintained
Concerned with scheduling access to optimize performance

Logical I/O

Enables users and applications to access records
Provides general-purpose record I/O capability
Maintains basic data about file

Access Method

Reflect different file structures
Different ways to store and process data

File Management & OS Concerns

The diagram in the Stallings text tries to distinguish the file management system and the "operating system". Since the file management system is part of the operating system, Stallings really means the rest of the operating system.

There is a fuzzy line on the side toward the application program. An operating system may provide more or less support for various access methods and internal file structures, and an application (typically a database managment system) may bypass the OS filesystem entirely, by going directly to the I/O layer API of the OS. For example, Unix provides "raw" disk I/O operations for applications that need direct access. Normally, one should not mix raw access and filesystem access on the same device, just as one should not mix buffered and unbuffered I/O on the same file. (In general, mixing higher and lower level abstractions "voids the warranty" of the higher level abstraction, since the integrity of the implementation of the higher level abstraction depends invariants which are usally not documented and which can be violated by direct calls to the next lower layer.)

File Management Functions

Identify and locate a selected file
Use a directory to describe the location of all files plus their attributes
On a shared system describe user access control
Blocking for access to files
Allocate files to free blocks
Manage free storage for available blocks

Criteria for File Organization

Rapid access

Needed when accessing a single record
Not needed for batch mode

Ease of update

exception: File on CD-ROM will not be updated, so this is not a concern

Economy of storage

Should be minimum redundancy in the data
Redundancy can be used to speed access such as an index

Simple maintenance
Reliability

These principles apply both to the implementation of files, by the OS, and the organization of data within files, by applications.

General Design Considerations for the Filesystem Implementation

Design of filesystem layout is a data structuring problem
Data structures in memory have uniform access time, but structures on disk do not
Disk access times are about 1,000,000 times slower than RAM access times
Sequential access reduces access time
Disk granularity of transfer is larger than RAM
Long block transfers reduce average access time
One class of decisions is what to keep in RAM, versus what to keep only on disk
- structures only on disk
- structures only in RAM
- structures on disk cached in RAM
Another class of decisions is how to lay data out on disk:

sequential (contiguous)
linked
- chained
- indexed

Factors to consider in disk layout include:
- speed of access (sequential, random)
- speed of allocation (contiguous, linked)
- wasted storage (fragmentation)
- robustness to system crashes

We will look at some of these issues in more detail, and finally look at the Unix and Windows 2000 filesytem data structures and algorithms. While reading about these specific systems you should think about how each of the above design factors has been taken into account, what trade-offs that have been made, and the degree to which the design works well for various kinds of access.

This study of filesystem implementation design is useful for more than understanding operating systems. The issues and principles here apply also to many other computer applications, and exemplify some important fundamental engineering design principles.

Terms Used with Files

Field

Basic element of data
Contains a single value
Characterized by its length and data type

Record

Collection of related fields
Treated as a unit
Example: employee record

File

Collection of similar records
Treated as a single entity
Have unique file names (maybe)
May restrict access

Database

Collection of related data
Relationships exist among elements
Accessed by multiple applications, via database management software

Volume

Collection of files, structured as a filesystem
Usually a disk partition
May also be a tape

Note that the above is the traditional view of a file. It applies to what we might call a "normal", "regular" or "real" file.

Real Files versus Other Non-File Objects

In the filesystem namespace
- /dev - device descriptors
- /proc - process state information
- fifos
- Unix semaphores
- etc.
Outside the filesystem namespace
- sockets
- pipes
- POSIX semaphores

In the context of Chapter 12 of Stallings and in these notes, the term "file" is used to refer to a disk file, unless we say something specific to the contrary.

For uniformity and convenience in naming objects, operating systems have come to use the operating sytem's filesystem name space to hold many other kinds of objects, that do not fit the conventional description of a file. For some examples, see the list given here, and the discussion of the Unix filesystem, further below.

The operating system may also treat as "files" other objects (that don't necessarily even have names), if it makes sense to perform similar operations on them to the ones we normall perform on files, such as opening, closing, reading, and writing.

This follows the object-oriented mode of thinking, in which ojects are classified according to the operations (a.k.a. "methods") that can be applied to them. In this sense, "files" actually need to be divided up into a variety of classes, and traditional files form just one sub-class. The actual classification system will, of course, depend on the specific operating system.

Handling Variable-Length Objects & Sub-Objects

Length fields
Delimiter (a.k.a. terminator, demarcation symbol, or sentinel)

This is knowledge you should already have, from a data structures course. If you do not already know this, it is time to learn it now.

Incidental to the notion of fields and records is the mechanics of how they are accessed, including the treatment of variable lengths and variations in format. In files, it is generally assumed that a record will be laid out sequentially. In the simplest case, the layout of fields within a record is entirely static. The fields are addressed by static offsets, and their lengths are also static. Where variable lengths are involved, the length of a variable-length field must itself be stored as a field, or a delimited must be used. Where multiple variable length fields occur in a record, additional fields (with static offsets) are required to give the offsets of the variable-length fields of the record. In the more general case, the record may have multiple variants, in which case one or more tag fields are provided, to identify the variant.

Variable Length Fields

Here we do not mean "variable" in the sense that the size of a given field may be changed after it is written. We mean that the size of the field may vary, from one record to another.

The size of a field may be specified explicitly, or specified implicitly by means of a terminator. Neither one necessarily takes less space. Using a terminator rules out the terminator as a valid data value (unless we are prepared to encode or pad all data values using extra bits). Therefore, the explicit size approach is preferred.

Variable Size Records

When a record contains variable-size fields, the data of the variable length fields is placed at the end of the record. For each field (fixed or variable size) there is a fixed part. For the fixed size fields, the fixed part contains the data. For the variable size fields, the fixed part contains the information needed to locate the data. This includes at least an offset. It may also include a size.

A More Compact Way

If the data parts of variable-sized fields are always laid out in the same order as the fixed parts, one can eliminate the size or terminator, by using the next offset.

Variant Records

If we want to implement a class of record that may have different sets of fields, we need a way to distinguish between the different record variants (types of object within the class). This is done by starting each record with a tag, i.e., an integer value that identifies the type of the record. The tag value tells us which of a finite number of known field layout formats should be applied to the record. Another word for a tag is a "type ID".

By the way, all of the preceding can be applied to record structures in memory as well as to record structures on files. For objects in memory, we can go a bit further. If the object is a full object in the OOP sense, some fields may be pointers to functions (methods) that apply to the object. Alternatively, the method pointers may belong to a separate object (a class descriptor), that can be looked up using the tag or an explicit pointer within the object.

Typical File Operations

Retrieve_All
Retrieve_One
Retrieve_Next
Retrieve_Previous
Insert_One
Delete_One
Update_One
Retrieve_Few

This taxonomy of file operations best fits a structure view of a file as a sequence of records, as compared to the Unix/Linux view of a file as a sequence of bytes.

File Organization

Pile
Sequential File
Indexed Sequential File
Indexed File
Direct, or Hashed File

The material on file structures (coming up next) is important for everyone to know. It has more to do with application programming than operating systems. That is, in modern operating systems you are likely to see, the internal organization (sequential, indexed sequential, etc.) of ordinary files is up to the application, though the system may provide libraries to support various access methods. You should see this material again whenever you take a course on databases, if the course deals with database internals.

This material is relevant to the study of operating systems in that it motivates the design choices regarding file storage allocation and layout. That is, the operating system chooses how files are laid out physically on devices. How files are laid out physically will in turn determine how efficiently the various access methods can be using the operating system. At the extreme, high performance database systems bypass the operating system's file implementation entirely, since they can get better performance by designing their own layout of data on disk.

The "Pile"

Data are collected in the order they arrive
Purpose is to accumulate a mass of data and save it
Records may have different fields
No file structure
Records must self-describing
Description techniques include:
- record and field delimiters
- record and field tags
- record and field lengths
- record and field names
Record access is by exhaustive search

Sequential File

Fixed format used for records (normally)
Records are the same length (normally)
All fields the same (order and length) from one record to the next
Field names and lengths are attributes of the file
One field is the key field
Uniquely identifies the record
Records are stored in key sequence
New records are placed temporarily in a log file or transaction file
Batch update is performed to merge the log file with the master file

Indexed Sequential File

Index provides a lookup capability to quickly reach the vicinity of the desired record
Contains key field and a pointer to the main file
Indexed is searched to find highest key value that is equal or less than the desired key value
Search continues in the main file at the location indicated by the pointer
New records are added to an overflow file
Record in main file that precedes it is updated to contain a pointer to the new record
The overflow is merged with the main file during a batch update
Multiple indexes for the same key field can be set up to increase efficiency

You should be able to see how this concept led naturally to hierachical index structures with still faster access times, such as the B-tree.

Comparison of sequential and indexed sequential

Example: a file contains 1 million records
On average 500,000 accesses are required to find a record in a sequential file
If an index contains 1000 entries, it will take on average 500 accesses to find the key, followed by 500 accesses in the main file. Now on average it is 1000 accesses

Indexed File

Uses multiple indexes for different key fields
May contain an exhaustive index that contains one entry for every record in the main file
May contain a partial index

Direct or Hashed File

Directly access a block at a known address
Key field required for each record
Commonly used for high-performance databases

File Directories

Directory contains information about files

Attributes
Location
Ownership

Directory itself is a file owned by the operating system
Provides mapping between file names and the files themselves

Basic directoy concepts should already be very familiar to you. We will summarize the evolution to the present commonly accepted practice in operating systems, of providing a tree-structured (hierarchical) directory namespace.

Simple Structure for a Directory

List of entries, one for each file
Sequential file with the name of the file serving as the key
Provides no help in organizing the files
Forces user to be careful not to use the same name for two different files

Two-level Scheme for a Directory

One directory for each user and a master directory
Master directory contains entry for each user

Provides address and access control information

Each user directory is a simple list of files for that user
Still provides no help in structuring collections of files

This two-level model appeared in some early operating systems, but has been left behind by modern systems.

Hierarchical, or Tree-Structured Directory

Master directory with user directories underneath it
Each user directory may have subdirectories and files as entries
Files can be located by following a path from the root, or master, directory down various branches
This is the pathname for the file
Can have several files with the same file name as long as they have unique path names
Current directory is the working directory
Files are referenced relative to the working directory

Hierarchical, or Tree-Structured Directory: More Detailed View

This picture differs from the Unix directory system in an important respect. Can you identify it?

Access Rights

None

User may not know of the existence of the file
User is not allowed to read the user directory that includes the file

Knowledge

User can only determine that the file exists and who its owner is

Execution

The user can load and execute a program but cannot copy it

Reading

The user can read the file for any purpose, including copying and execution

Appending

The user can add data to the file but cannot modify or delete any of the file's contents

Updating

The user can modify, deleted, and add to the file's data. This includes creating the file, rewriting it, and removing all or part of the data

Changing protection

User can change access rights granted to other users

Deletion

User can delete the file

Owner

Has all rights previously listed
May grant rights to others using the following classes of users

Specific user
User groups
All for public files

The notion of access rights should already be fairly familiar to you, from your experience using files in operating systems.

Controlling Simultaneous Access

Exclusive: user locks entire file when it is to be updated
Shared: user locks individual records during the update
Mutual exclusion and deadlock are issues for shared access
(see the Readers-Writers problem)

You should be familiar with some Unix API operations for synchronizing file access. What are they? For more detail see the discussion of Unix specifics, further below.

Data Blocking

How are (logical) records broken up into blocks?
Should blocks be fixed or variable length?
Fixed length is most common:
- simplifies I/O, buffer allocation, organization on secondary storage
- only tape files are likely to use variable length blocking
Size of block, relative to record size?
- bigger means fewer I/O operations, if access is sequential, or otherwise local
- larger means larger buffers (more total memory) needed

Fixed Blocking

We may have some wasted space, due to inexact fit (internal fragmentation). However, we can calculate the position of a block simply, and we only need a single size of buffer.

By the way, the main context here is disk files. However, the discussion of blocking alternatives applies to both disk and tape storage. Due to the much slower access time of tape devices, blocking decisions have more impact on tape than on disk. Random access on tape is rather impractical, but even sequential access is inefficient if the tape must be stopped and restarted between records. If the tape is stopped, it needs to be backed up before restarting, to get the tape back up to speed again. For optimium performance, the tape should be able to "stream", i.e., the CPU should process the data at least as fast as it can come in (or generate the data at least as fast as it can be written, in the case of output), so that the tape drive never stops. If this cannot be achieved, we would at least like to keep to a minimum the number of stops and restarts. Longer tape blocks generally are better in this respect. On the other hand they require more buffer space. We must have room for at least two buffers.

Variable Blocking: Spanned

This allows variable-sized blocks. It avoids wasted space, by splitting logical records. A price is complexity in buffering, and sometimes multiple I/O operations to access a single record.

Variable Blocking: Unspanned

If record sizes are small enough, unspanned blocking is possible. This solves the problem of extra I/O operations, at the cost of internal fragmentation.

Secondary Storage Management

Space must be allocated to files
Someone must keep track of the space available for allocation

Allocation of storage to files is an OS function. The degree to which the user is required to provide information or influence the allocation may vary. Earlier operating systems invovled the user more. Real time operating system still provide a high degree of user control over file allocation, in order to provide more predictability and more efficiency. For example, a user may preallocate contiguous extents of storage, to allow for faster sequential access and to prevent the possibility of time delays for disk storage allocation during the operation of the application.

Preallocation

Need the maximum size for the file at the time of creation
Difficult to reliably estimate the maximum potential size of the file
User tends to overestimate file size so as not to run out of space
Still important for real time operating systems, for
- predictable I/O access time
- no storage allocation delays

Methods of File Allocation

Contiguous allocation
Chained allocation
Indexed allocation

Contiguous Allocation

Single set of blocks is allocated to a file at the time of creation
Only a single entry in the file allocation table:
(starting block, length of the file)
External fragmentation will occur,
requiring periodic compaction

Contiguous File Allocaton (After Compaction)

Chained Allocation

Allocation on basis of individual block
Each block contains a pointer to the next block in the chain
Only single entry in the file allocation table:
(starting block, length of file)
No external fragmentation
Best for sequential files
No accommodation of the principle of locality
Periodic consoliation improves performance

Chained Allocation (after Consolidation)

Indexed Allocation with Block Portions

File allocation table contains a separate one-level index for each file
The index has one entry for each portion allocated to the file
The file allocation table contains block number for the index

Indexed Allocation with Variable-Length Portion

Free Space Management (Disk Allocation Table)

bit table/vector
- one bit per block
- good for finding contiguous groups of blocks
- needs to be resident in memory for scanning
- problems with memory size and slow scanning time if filesystem is large
chained free portions
- good for contiguous groups of blocks
- fragmentation a problem, without compaction
indexing
- treats free space as a file with index table
- index points to free portions
free block list
- list of free block numbers stored on disk (maybe as a file)
- only head and tail need to be resident (buffered) in memory

Unix File System Views

Like the other parts of an operating system, the filesystem can be viewed from several different perspectives.

External/User Views
- Abstract User View
- Shell View
- API View
Internal/Implementor Views
- Abstract Internal View
- Implementation Code View

By now, you should all be very familiar with the user views. We will review them briefly in the following notes, but try to move on quickly to the implementation views.

Abstract View: Concepts and Terms

The starting point for users is a abstract view. One way to approach this is via terminology, naming the components and their relationships.

file
file attributes:
- last modification date, last access date
- owner-user, owner-group
- permission bits
- size

Abstract View: Directories

are files
contain named entries
collectively define the filesystem namespace
entries come in different kinds:

ordinary link: refers to a file
(a file may be referred to by more than one link)
symbolic (soft) link: gives a path
(may or may not have a corresponding file)

The filesystem namespace is also used to keep track of other kinds of objects that do not otherwise behave like files. For example, there can be named semaphores and named shared memory objects.

Abstract View: Visualization of Directories

Abstract View: Visualization of Filesystem Namespace

File Sharing

In multiuser system, we want to allow files to be shared among users
Two issues

Access rights/permissions
Management of simultaneous access

Unix systems specify file access permissions via permission bits associated with a file.

File Access Permissions

owner-user: rwx
members of owner-group: rwx
other users: rwx
set-user-id
set-group-id
if file is executed as a program it takes on userid of file's owner-user
set-group-id
if file is executed as a program it takes on groupid of file's owner-group
sticky bit
prevents deletion of file by non-owner, via write permission on file

Unix systems specify file access permissions via permission bits associated with a file.

These can be modified, usign the chmod() system call.

The group permission bits apply (only) to the members of the owner-group of the file other than the user-owner.

The other-user permission bits apply (only) to the users other than the user-owner and the members of the owner-group of the file.

Try using the shell command ls -la to look at the permnission bits and owners of various files.

Unix Management of Simultaneous Access to Files

File locking (advisory)
File record locking (advisory)

"Advisory" means that it is up to the application to honor file locks. The OS keeps track of who has what locked, and informs the application, but does not prevent a process from accessing a file that is locked by another process.

The Solaris (SunOS) man-pages for chmod and fcntl aslo mention "mandatory locking". Can you find out more about what this is?

Abstract View: Types of Files

Ordinary/Regularfile

Contains text or binary data.
Executable if the "execute bit" set.
"Magic number" inside the file,
or filename extension of link (e.g. ".c") sometimes used to guess contents.

Directory

Contains links to other files, including directories

Named pipe

A FIFO buffer

Device special file

An interface to a hardware device.
Has a major and a minor device number.
Classified as:

Block special: accessed by data blocks
Character special: accessed character-by-character

Shell View

Ordinary users can view and manipulate files using shell commands, like the following:

> pwd
/faculty/baker

> ls -l .
total 20
drwxr-xr-x    2 baker    fac          4096 Jan  6 11:25 bin
-rw-------    1 baker    webmast        32 Apr  9 14:34 dead.letter
drwxr-xr-x   17 baker    fac          4096 Apr  8 05:20 public_html
-rw-r--r--    1 baker    fac            23 Jan  9 05:17 setup
drwxr-xr-x    3 baker    fac          4096 Jan  6 15:43 tools

rm dead.letter

Try using ls to look at some non-file objects in the feilsystem namespace, e.g.

[baker@ada]$ ls -lat /dev/hda
brw-rw----    1 root     disk       3,   0 Apr 11  2002 /dev/hda

[baker@ada]$ ls -lat /proc/27410
total 0
dr-xr-xr-x    3 baker    wheel           0 Nov 11 14:58 .
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 cmdline
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 cpu
lrwxrwxrwx    1 baker    wheel           0 Nov 11 14:58 cwd -> /home/baker
-r--------    1 baker    wheel           0 Nov 11 14:58 environ
lrwxrwxrwx    1 baker    wheel           0 Nov 11 14:58 exe -> /usr/local/bin/ssh2
dr-x------    2 baker    wheel           0 Nov 11 14:58 fd
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 maps
-rw-------    1 baker    wheel           0 Nov 11 14:58 mem
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 mounts
lrwxrwxrwx    1 baker    wheel           0 Nov 11 14:58 root -> /
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 stat
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 statm
-r--r--r--    1 baker    wheel           0 Nov 11 14:58 status

API View

You have learned how to manipulate files from application programs, using sytem calls, like the following:

open
close
read
write
link
unlink

Example: Unix API for Synchronizing Access to Files

flock (int fd, int operation)
- advisory
- entire file
- shared (LOCK_SH) or exclusive (LOCK_EX)
lockf (int fd, int mcd, off_t len);
- advisory
- a portion of a file (specified by offset and length)
- exclusive lock, only
fcntl (int fd, int cmd, struct flock *lock)

advisory
all or a portion of a file
shared or exclusive
can find out user and process id of lock holder

Filesystems & Mounting

volume containing filesytem must be mounted to access its files
the filesytem on the volume is grafted into the system directory tree
mount operation specifies:
- identity of physical volume (device family, device number, partition)
- location where the filesystem should be inserted into the system directory hierarchy
options controlling how the filesystem is to be accessed, e.g.,
- what type of filesystem
- accessibility: read-only, read-write, etc.

The concept of mounting a filesystem comes from magnetic tapes. A tape reel is called a volume. In order to access the contents, the reel must be mounted. After the physical mounting of the reel on the tape drive, by the operator, the operating sytsem logically mounts the volume. The logical mounting involves initializing system data structures that identify the drive as being on line and identify the volume that resides on the drive. The mounting operation is the analog for an entire filesystem of the opening operation on a single file.

Partitions of fixed disk drives are always physically mounted, but they must be logically mounted by the file management system before the operating system can access the files on them. In the Unix and Windows 2000 operating systems, mounting a disk volume has the effect "grafting" it into the filesystem directory tree. The "root" directory of the volume is identified with an existing directory in the filesytem hierarchy. After that, software that uses the directory system can access the contents of the volume transparently, without being aware of the boundaries between volumes, or transitions between different types of filesystems.

Filesystems & Mounting

#include <sys/types.h>
#include <sys/mount.h>
#include <sys/mntent.h>

int mount
  (const char *spec, /* path to a device (block special file) */
   const char *dir,  /* path to the directory on which to mount */
   int mflag,        /* bit vector to select mounting options */
   char *fstype,     /* type of filesystem */
   char *dataptr,    /* filesystem-specific data (optional) */
   int datalen,    
   char *optptr,     /* filesystem-specific options (optional) */
   int optlen);

The mount operation can be called via a system API call. The interface shown here is for Solaris (SunOS). The API for Linux is slightly different. (This is not an area standardization has reached yet). On all Unix systems there is also a utility program, also called mount, that can be called from a shell to perform the same operation. Obviously, these are privileged operations.

Observe that this operation is "polymorphic" or "overloaded", for all the different types of filesystems. Certain parameters are common to all members of the "class" filesystem. Other parameters are specific to the particular type of filesystem.

Generic Filesystem Mounting Flags

MS_DATA - look at dataptr argument for filesystem-specific data
MS_OPTIONSTR - look at optptr argument for filesystem-specific options
MS_READONLY - mount for reading only (Solaris, Linux)
MS_NOSUID - do not allow set-user-ID and set-group-ID programs to execute (Solaris, Linux)
MS_REMOUNT - remounts a read-only file system as read-write (Solaris, Linux)
MS_OVERLAY - allow mounting over an already-mounted filesystem (Solaris)
(makes the other system inaccessible)
MS_GLOBAL - mount filesystem globally if on a cluster (Solaris)
MS_NODEV - no access to device special files (Linux)
MS_NOEXEC - no program execution (Linux)
MS_SYNCHRONOUS - write are syncyhronized at once (Linux)
MS_MANDLOCK - allow mandatory locks (Linux)
MS_NOATIME - not not update access times (Linux)
MS_BIND - bind subtree elsewhere (Linux)
MS_MOVE - atomically move subtree (Linux)

Filesystem Types Supported by Linux

adfs	affs	autofs	coda	coherent	cramfs
devpts	efs	ext	ext2	ext3	hfs
hpfs	iso9660	jfs	minix	msdos	ncpfs
nfs	ntfs	proc	qnx4	reiserfs	romfs
smbfs	sysv	tmpfs	udf	ufs	umsdos
vfat	xenix	xfs	xiafs

You can see that there are many filesystem types. You may recognize the traditional DOS FAT filesystem type (msdos), the Windows 95 filesytem type (vfat), the Windows NTFS filesystem type (ntfs), the Network File System (NFS) type (nfs), the proc filesystem type (proc), the ISO CD-ROM filesytem type (iso9660), and the native Linux filesystem types (ext, ext2, ext3).

Linux hides differences between filesystem types by a virtual filesystem (VFS) layer.

Linux System Configuration Files for Mounting

linux>more /etc/mtab
/dev/hda5 / ext3 rw 0 0
none /proc proc rw 0 0
/dev/hda1 /boot ext3 rw 0 0
none /dev/pts devpts rw,gid=5,mode=620 0 0
none /dev/shm tmpfs rw 0 0
/dev/hda8 /www/vhosts reiserfs rw 0 0

linux>more /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
LABEL=/boot             /boot                   ext3    defaults        1 2
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /proc                   proc    defaults        0 0
none                    /dev/shm                tmpfs   defaults        0 0
/dev/hda2               swap                    swap    defaults        0 0
/dev/sda1		swap			swap	defaults	0 0
/dev/cdrom              /mnt/cdrom              iso9660 noauto,owner,kudzu,ro 0 0
/dev/fd0                /mnt/floppy             auto    noauto,owner,kudzu 0 0
#/dev/sda2		/www			reiserfs	defaults 0 0
/dev/hda8		/www/vhosts		ext3	defaults	1 1

At system initialization time, the file /etc/fstab is read to determine which filesystems to mount. Each row of the table provides a set of parameters to the mount operation. The system file /etc/mtab keeps track of which files are currently mounted.

Abstract Internal View

The starting point for understanding how the filesystem is implemented is an abstract view of the implementation.

Traditional Unix Filesystem (Volume) Structure

Each filesystem occupies a continguous range of disk sectors, called a partition, and is organized internally as follows:

For fault tolerance, redundant copies of the superblock are stored, on different cylinders and different platters of the disk drive. This reduces the chance that a disk media failure will result in corruption of the entire file system.

Contents of Superblock

the size of the filesystem
the number of free blocks in the file system
a list of free blocks available on the file system
the index of the next free block in the free block list
the size of the inode list
the number of free inodes in the file system
a list of free inodes in the file system
the index of the next free inode in the free inode list
lock fields for the free block and free inode lists
a flag indicating that the super block has been modified

Contents of an Inode

file access permissions
file owner IDs (individual and group)
file type (regular, directory, character special, block special, FIFO)
file access times (file modified, file accessed, inode modified)
incoming link count
table of contents (disk addresses of direct and indirect blocks -- see below)
file size

Copies of the inodes of a mounted filesystem are kept in RAM, for efficiency. Updates are done on the in-RAM copies, which are periodically copied out to disk.

What is not in the inode, that you might expect would be?

Extra Contents of in-RAM Copies of Inodes

status of in-RAM copy
- locked?
- process waiting for unlock?
- differs from disk copy?
- mount point?
logical device number of filesystem containing file
inode number
pointers to other in-RAM inodes
- next inode in free list
- next inode in hash queue
reference count, indicating number of instances of the file that are active

Why is this information not kept on disk?

What is the purpose of each of the above data items?

What happens if the system goes down and the in-RAM copies of the inodes are not written out?

Block Buffering

Direct and Indirect blocks are also buffered in RAM. When we are lucky they also stay in memory, if there is enough disk cache space.

Block Buffer Lookup Table

Block buffers are looked up using a hash table, with chaining.

The actual buffers are represented in the hash table by their headers.

The "data area" of the buffer header is what actually holds a disk block's worth of data

What is the purpose of this scheme?

How does allocating more RAM for disk cache affect the performance of the system?

How does this relate to paged virtual memory?

Relationship of inodes, Direct Blocks, and Indirect Blocks

Each file has one inode, which may point to direct and indirect blocks.

How may there may be "holes" in a file?

Why might a core dump file have holes?

What are the four segments of a process virtual memory?

How might a sparse file be useful in constructing a database with efficient lookup?

How does the read access time depend on the position of a block in the file?

What kinds of fragmentation-like performance degrading effects is such a filesystem subject to?

How is performance affected by the system's choice of logical block size (as compared to physical block size of the device)?

List of Free Inodes

A list of some of the free inodes is kept in the superblock, for efficiency.

When this list is empty, the disk is searched for more free inodes (filetype = 0), starting where the last free inode was found.

List of Free Disk Blocks

Free disk blocks are kept on a linked list, in which each free block is viewed as an array of pointers to other free blocks. (This is not a tree, though. Why not?)

How is this different from the free buffer list?

Is this list on disk or in RAM?

System Crashes

Typical system crash causes: power failure, OS software defect
Some I/O operations in progress may not complete
Disk contents may be left in inconsistent state
e.g., directory entry for file is removed, but file space is not yet moved to the free list
OS must recover without serious loss of data
For Unix, this is done by the fsck utility

If you have studied databases before, some of the concepts here will already be familiar to you, since the problem here is essentially the same as the transaction problem encountered in the implementation of databases. In fact, since filesystems came before databases, this is where the problem and the foundational solutions originated. The filesystem internal data structures and directory structure of an OS amount to a specific kind of database. Updating the filesystem, with the more complex (and higher performance) forms of filesystems, involves modification to more than one disk block, and while the modifications are in progress the data structures on the disk may not be internally consistent. Since everything else in the system depends on these structures, being able to restore consistency after a system crash is critical.

Approaches to Dealing with Inconsistencies

"Write through" file systems
- MSDOS FAT is an example
- filesytem structure data is immediately written to disk
- filesystem itself is always consistent
"Careful write" file systems (e.g., Digital VAX VMS)
- I/O is buffered, to improve efficiency
- disk write operations ordered carefully
- system crashes produce predictable, noncritical inconsistencies
- filesystem can fix any inconsistencies, without urgency
- suboperations need to be written to disk serially, FIFO
"Lazy write" file systems (e.g., OS/2 HPFS, older Unix systems)
- I/O buffering is further optimized
- file modifications are written to buffer cache in one order
- operations are reordered, to optimize disk performance
- performance is gained, at the cost of recovery
"Recoverable" file systems (e.g., NTFS, ext3, Reiser)
- uses logging/journaling techniques, originally developed for transaction processing

All of these are concerned mainly with preserving the integrity of the fileystem internal structures, not user data. That is, the objective is to prevent inconsistencies in the directory structure and the lower level structures used to access individual files and disguish allocated from free space on the disk. Loss of all or portions of individual files is still possible.

With the Unix and NTFS filesytems, a user or application program can improve reliability, if desired, by forcing disk buffer cache flushing (e.g., see fsync API) at critical points.

Unix Disk Error Recovery: fsck

The program fsck checks a filesystem for inconsistencies, and then attempts to repair them.

block belongs to more than one inode
block belongs to an inode and the list of free inodes
block is not on free list and not in a file
non-zero link count but not in any directories
free inode found in directory
in general: more/less directory links than link count
format of inode incorrect
count of free blocks in super block does not match the number on disk
count of free inodes in super block does not match the number on disk

How could each of these situations arise?

How might each of these situations be repaired?

e.g.: Situation (1) could happen if the kernel frees a disk block in a file, returns the block number to the in-RAM copy of the super block, and allocates the block to a new file; the kernel writes the inode and blocks of the new file to disk, but crashes before updating the inode of the old file to the disk. One possible repair would be to duplicate the block and assign the duplicate to one of the inodes. This would probably result in the old file containing a block of garbage, since the data in the block is coming from the new file. An alternative would be to simply delete the block from the least recently modified inode, leaving a hole in the file. This is one of the hardest cases to repair satisfactorily. Some of the other cases are much simpler and more likely to come out right.

Implementation Code View

We do not have time to go into the implementation code in detail. However, you might like to glance at some of the Linux filesystem implementation code. You can download and browse the full Linux source tree, or just look at the following two sample files:

include/LInux/fs.h includes the declaration of the inode structure.
kernel/fs/inode.c contains functions that are used to manipulate inodes, such as update_atime, the function that updates the last access time of a file.

Windows NT/2000 File System Support

CDFS - for CD-ROM (legacy, superseded by UDF)
UDF - Universal Disk Format
FAT12, FAT15, FAT32 - legacy formats from DOS
classified by number of bits used to represent a cluster number
- used for floppy and hard disks
- FATn uses n bits to identify a cluster
- cluster ~= block (a multiple of the device sector size)
NTFS - the Win2K native filesystem

Like most modern operating systems, Windows 2000 supports several different types of filesystem. We will look at two of them in more detail.

FAT Filesystem Organization

Two copies of the FAT are kept, for reliability. File Allocation Table (FAT) keeps track of which blocks are in which file.

FAT Entry Chaining

Directory entry points to first FAT entry of file. Each FAT entry points to next cluster in the file, if any.

The diagram is adapted from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.

FAT Directory Entry Components

basic directory entry contains
- file name
- size
- starting cluster
- time stamps (last accessed, created, etc.)
optional longer-format filename

FAT 12 Filesystem

limited to 4096 clusters per partition
with maximum cluster size of 8 KB, max volume size is 34 MB
default format for floppy disks

FAT 16 Filesystem

limited to 2¹⁶ (65,536) clusters per partition
with maximum cluster size of 64 KB, max volume size is 4 GB
Default FAT16 Cluster Sizes

Volume Size (MB) Cluster Size (KB)

0-32 1/2

33-64 1

65-128 2

129-256 4

257-511 8

512-1023 16

1024-2047 32

2048-4095 64

Volume Size (MB)	Cluster Size (KB)
0-32	1/2
33-64	1
65-128	2
129-256	4
257-511	8
512-1023	16
1024-2047	32
2048-4095	64

FAT 32 Filesystem

Can handle larger numbers of clusters
Root directory is not stored at a predefined location on the volume, and does not have upper limit on its size
Boot sector is duplicated
Max file size is still 4 GB
Default Cluster Sizes

Volume Size Cluster Size (KB)

32 MB - 8 GB 4

8 GB - 16 GB 8

16 GB - 32 GB 16

32 GB 32

Volume Size	Cluster Size (KB)
32 MB - 8 GB	4
8 GB - 16 GB	8
16 GB - 32 GB	16
32 GB	32

The data are from Inside Microsoft Windows 2000 (3rd Edition), by Microsoft Press.

Additional informatoin on NFTS can be found at http://linux-ntfs.sourceforge.net/ntfs/index.html.

Windows NT/2000 File System (NTFS): Key Features

Recoverability
Security
Large disks and large files
Multiple data streams
General indexing facility - essentially, a built-in database engine

NTFS Volume & File Structure

Sector: normally 512 bytes
Cluster: basic unit of allocation
- size 2^k, some power of two
- contiguous set of sectors
- determines volume size: bigger cluster <:-> bigger volume
Volume: logical partition on disk
- filesystem information
- collection of files
- additional unallocated space

NTFS Volume Layout

Partition boot sector
- boot info & code
- volume layout & file system strucuturs
MFT (Master File Table)
- list of all contents
- organized as set of rows in relational database
- each row describes a folder
System Files
- MFT2 (mirror of 1st three rows of MFT), for tolerating single-sector failure
- Log file - list of transaction steps, for recovery
- Cluster bit map - shows which clusters are in use
- Attribute definition table - shows the types of attributes supported on this volume, whether they can be indexed, and whether they can be recovered

Per-Volume Metadata Files for NTFS

Position	File
0	$Mft - MFT
1	$MftMirr - MFT mirror
2	$LogFile - log file
3	$Volume - volume file
4	$AttrDef - attribute definition table
5	\ - root directory
6	$Bitmap - volume cluster allocation file
7	$Boot - boot sector
8	$BadClus - bad cluster file
9	$Secure - security settings file
10	$UpCase - uppercase character mapping
11	$Extend - extended metadata directory
12	Unused
13	Unused
14	Unused
15	Unused
16	User files and directories

These are implemented as files, but are used to keep information about the filesystem.

Master File Table

an array of file records, each 1KB in size
each file record can describe a file
if file is small, all of it may fit in one file record
each record contains, at least:
- "standard information", e.g., read-only, archive, time stamps, hard link count
- $FILE_NAME - may have multiple filename attributes
  - e.g., long NTFS file name, and short name autogenerated for access by MS-DOS
- $data - the "content" of the file

File Record Attribute Structure

NTFS I/O reads and writes attribute streams
certain "standard attributes" are predefined by the OS
other attributes may be defined by the user
streams within a file may be identified by type, with optional name if there is more than one stream of a type
an attribute stream is Microsoft's terminology for what appears to be a field (potentially variable-length):
- a numeric "type code" (a tag) identifies the type of the attribute
  attributes within an MFT file record are ordered by "type code"
- "value" of attribute is its byte stream
- name of attribute (optional)
e.g., the value of the $DATA attribute of a file is whatever data bytes the user stored in the file

File Reference Number

used to identify a file with a volume (same purpose as a Unix "inode" number)
consists of a pair: (file number, sequence number)
file number is position in MFT of the file record
sequence number distinguishes reuses of file records,
for internal consistency checks

How to Handle Longer Files/Attributes

every attribute begins with a header
attributes may be nonresident (decided by the filesystem, not the user)
header is always resident, and contains:
- if attribute is resident: (offset from header to value, length of value)
- if attribute is nonresident: "info needed to locate the value on disk"
only two standard attributes can grow: (1) data; (2) attribute list

Filename Attribute

File with Short (Resident) Data Attribute

File with Longer (Nonresident) Data Attribute

VCN = "Virtual Cluster Number"
LCN = "Logical Cluster Number"

This is our "second resort", for files that do not fit entirely into one MFT file record. The size of file that may be represented in this way is limited by the size of the file and the degree of fragmentation of the filesystem's free storage.

If we need more runs than can fit inside one MFT file record, we need to switch over to the "third resort". That is to split up the attributes (including the data attribute header) into more than one MFT record. This is done by adding an "attributes list" attribute. The attributes list attribute contains the name, type code, and MFT file record reference for each attribute of the file.

Note that Microsoft uses the term fragmentation in a extended way from the usage up to now in Stallings. Up to now, we have spoken only of storage being fragmented. MS says a file is fragmented if it is broken into non-contiguous blocks. This is neither "internal fragmentation" nor "external fragmentation" in the senses used by Stallings for storage management. The connection to fragmentation is that if the disk's free storage is fragmented (external fragmentation sense), we will be forced to allocate the storage of a file in noncontiguous blocks, rendering the file fragmented (in the MS sense).

Directory - Index Root Attribute

If a directory is small enough, the directory entries are stored directly within the MFT record of the directory. Each directory entry contains:

filename
file reference to the file's MFT record
time stamps, size (duplicated from the MFT record)

The duplicate data from the MFT record is speeds directory content listing.

Directory entries are packed, so the number that fit depends on the lengths of the filenames.

Larger Directory - Index Root Attribute

If the directory entries don't all fit in the MFT record of the directory, some are put into index buffers.

The figure is simplified in several respects. The filenames "filen" are just examples, chosen to indicate that the filenames are in lexicographic order. The file entries are packed into the VCNs, so that each VCN can typically hold 20 or 30 of them, rather than just one.

The index allocation attribute of the directory specifies the mapping from LCN's to VCN's, that is used to find thei index buffers on the disk. The bitmap attribute specifies whether each VCN of an index buffer is in use.

NTFS Has Built-in Support for Indexing

Filename is just one kind of file attribute that can be used as a key for indexing
Optimized for large directories
Uses B+ tree data structure
Can be searched much faster than a linear structure can be scanned

How does this compare with the Unix directory structure described above?

The NTFS directory index structures are B+ trees. They keep the data in sorted order, and provide O(log_kn) access, where n is between 20 and 30. B+ trees are normally covered in database courses, since they are one of the basic implementation techniques for databases. If you have not seen B+ trees before, you might want to look at http://www.seanster.com/BplusTree/BplusTree.html for a tutorial on them.

The B+ tree indexin scheme can be used for indexing on other attributes, besides filenames. For example, a file can have an $OBJECT_ID attribute. There is an API that allows opening a file using its object ID. There is a mapping from object IDs to file reference numbers in the \$EXtend\$ObjeID metadata file. This is another B+ tree.

NTFS Crash Recovery

technique used is "write ahead" transaction logging
guarantees filesystem structure will not be corrupted
does not guarantee user data (or internal structure of individual user files) will not be lost
like Unix, requires user explicitly flush cache to achieve consistent snapshot

Write-Ahead Logging

Log modifications to filesystems structure before doing modifications
Modify the volume, working on the cached copy only
The cache prompts the log file system to flush the logfile to disk
After the log file is up to date on disk, the cache manager flushes the cached volume changes to disk
Ensures all changes to disk copy of volume are first recorded on the log file.
After a crash, "undo" and "redo" records in log file can be used to roll back or complete any transactions that were in progress at the time the system crashed.
Log file can be zeroed out after cache is dumped completely.

Logging is not unique to Windows NT/2000. It has been a feature of mainframe operating systems for years. The newer Unix/Linux filesystems (e.g., Linux ext3fs) also support logging.

Log Record Types

update record
- written continually, for each logical disk update (sub)operation
- redo information - how to reapply one suboperation if system crashes before transaction is flushed from cache
- undo information - how to reverese one suboperation of a transaction that was only partially logged (not "committed") at the time of a crash
transaction table record
- describes active transactions, that have been started but not yet committed
- suboperations of these active transactions must be removed from disk during recovery
dirty page table record
- describes pages in the cache that have not yet been written to disk
- this data must be flushed to disk during recovery
checkpoint record
- written periodically (every 5 seconds)
- forces write of transaction & dirty page table records
- indicates LSNs of most recent transaction & dirty page table records

This is a standard technique from databases. The checkpoint records limit the distance it is necessary to go back into the log file, to recover after a crash. In turn, this allows some of the log file storage to be recovered, so that the log file "normally" does not fill up.

However, there is still a possibility of the log file space (which is preallocated) overflowing. This is possible with the NFTS. When it is about to happen, the OS essentially shuts down all other filesystem activities (file creation and deletion), and performs and emergency flush operation on the buffer cache. Note that this "short pause for I/O processing" could be horrible for a criticial realtime system.

NTFS Recovery Phases

Analysis
- starts from last checkpoint
- updates transaction & dirty page tables
- e.g., removes committed transactions from dirty page table
- finds LSN of the oldest update record that has not been carried out on disk
- at end, transaction table contains LSNs of uncommitted transactions & dirty page table contains LSNs of unflushed cache records
Redoing transactions
- rodoes uncompleted update operations, in the cache
Undoing transactions
- rolls back any transactions that were not yet committed at the time of the crash

If you are following this, you should be able to answer the following question:

Since the "redo" records contain enough information to perform the real disk I/O operation, why are we "doing the operation twice"? That is, we first write the redo record, and later write the actual disk record. These are obviously redundant. Why can't we just write the real disk record, and do it only once?

© 2002 T. P. Baker & Florida State University. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)