COP 4610: Operating Systems & Concurrent Programming

Linux Virtual Memory, Segmentation & Layout

 

These notes cover some of the specific details of memory management for Linux. Some of the information comes from The Linux Kernel Book, by Card, Dumas, & Mevel (John Wiley & Sons, 1997). Other information comes from the Linux man-pages, articles on the Web, and experiments done on the Red Hat 7.3 Linux distribution.

For information on the memory management API supported by Linux, see the notes on the Unix memory managment API.


Topics


Virtual Memory Layout

See also program segments.c, and its output, for more detail on process address space layout in Linux.


The kernel needs to allocate contiguous regions of the virtual address space of each process, even in a virtual memory system. By implementing virtual memory we solved the problem of allocating contiguous regions of the system's real memory to processes, so eliminated the need for compaction and complex placement algorithms. However, we still have the problem of finding contiguous ranges of addresses within the virtual address space of each process, for the various logical segments (stack, load modules, etc.) of the process.

The problem of allocating contiguous ranges is less critical in virtual memory, since we have a fairly huge space to work with. That means we can afford to allow large free ranges between regions that we might want to grow, like the stack and data areas of the process.

Linux simplifies the virtual address management problem by using a standard layout for the main logical segments of each process.

The figure shows how Linux allocates regions of the virtual address space of each process to various logical segments.

The "heap" out of which malloc() allocates memory is above the "bss" (uninitialized) portion of the data segment.

Observe that there is a region of virtual addresses between the stack and uninitialized data segments. This allows the two segments to grow toward one another, under control of the process, via the function brk() or sbrk().

Some of the range of virtual addresses above between these two segments is also used for shared memory and memory-mapping of files, as illustrated by the program examples/mmap/memshare.c.

Similarly, there is space below the text segment, so that the runtime dynamic loader can enlarge the text segment as it loads new modules.


Changing the Size of the Data Segment

#include <unistd.h>
int brk(void *endds);
void *sbrk(intptr_t incr);

A process can change the size of its data segment using the brk and sbrk system calls.

From the Unix man-pages:

The brk() and sbrk() functions are used to change dynamically the amount of space allocated for the calling process's data segment (see exec(2)). The change is made by resetting the process's break value and allocating the appropriate amount of space. The break value is the address of the first location beyond the end of the data segment. The amount of allocated space increases as the break value increases. Newly allocated space is set to zero. If, however, the same memory space is reallocated to the same process its contents are undefined.
The brk() function sets the break value to endds and changes the allocated space accordingly.
The sbrk() function adds incr bytes to the break value and changes the allocated space accordingly. The incr value can be negative, in which case the amount of allocated space is decreased.

The /proc "File" System

dr-xr-xr-x  108 root     root            0 Aug 30 12:23 .
  drwxr-xr-x   22 root     root         4096 Sep 16 13:21 ..
  dr-xr-xr-x    3 root     root            0 Oct 23 11:52 1
  dr-xr-xr-x    3 www      www             0 Oct 23 11:52 10594
  dr-xr-xr-x    3 www      www             0 Oct 23 11:52 10600
...
  dr-xr-xr-x    3 cop4610  cop4610         0 Oct 23 11:52 6992
  dr-xr-xr-x    3 www      www             0 Oct 23 11:52 10602
  dr-xr-xr-x    3 www      www             0 Oct 23 11:52 10606
  dr-xr-xr-x    3 cop4610  cop4610         0 Oct 23 11:52 10616
  dr-xr-xr-x    3 root     root            0 Oct 23 11:52 137
  dr-xr-xr-x    3 root     root            0 Oct 23 11:52 14969
  dr-xr-xr-x    3 sprague  fac             0 Oct 23 11:52 14971
  dr-xr-xr-x    3 sprague  fac             0 Oct 23 11:52 14979
  dr-xr-xr-x    3 root     fac             0 Oct 23 11:52 15329
  dr-xr-xr-x    3 root     root            0 Oct 23 11:52 15333
...

Linux (and other Unix-like systems) use the file system namespace (starting with "/") to keep track of many kinds of entities besides files. For example, hardware devices descriptors can be found in the directory /dev. Information about currently existing processes is kept in the directory /proc. There is a subdirectory for each process, named by the process id (translated to a character string).


/proc/6992 Directory

  dr-xr-xr-x    3 cop4610  cop4610         0 Oct 23 11:56 .
  dr-xr-xr-x  107 root     root            0 Aug 30 12:23 ..
  -r--r--r--    1 cop4610  cop4610         0 Oct 23 11:56 cmdline
  -r--r--r--    1 cop4610  cop4610         0 Oct 23 11:56 cpu
  lrwxrwxrwx    1 cop4610  cop4610         0 Oct 23 11:56 cwd -> /www/faculty/cop4610/public_html/restricted/notes
  -r--------    1 cop4610  cop4610         0 Oct 23 11:56 environ
  lrwxrwxrwx    1 cop4610  cop4610         0 Oct 23 11:56 exe -> /usr/bin/emacs
  dr-x------    2 cop4610  cop4610         0 Oct 23 11:56 fd
  -r--r--r--    1 cop4610  cop4610         0 Oct 23 11:56 maps
  -rw-------    1 cop4610  cop4610         0 Oct 23 11:56 mem
  lrwxrwxrwx    1 cop4610  cop4610         0 Oct 23 11:56 root -> /
  -r--r--r--    1 cop4610  cop4610         0 Oct 23 11:56 stat
  -r--r--r--    1 cop4610  cop4610         0 Oct 23 11:56 statm
  -r--r--r--    1 cop4610  cop4610         0 Oct 23 11:56 status

Within the directory for each process are a number of "files", including one named maps that describes the current memory region mapping for the process.


/proc/processid/maps

08048000-08188000 r-xp 00000000 03:05 327928     /usr/bin/emacs
08188000-08423000 rw-p 00140000 03:05 327928     /usr/bin/emacs
08423000-086a4000 rwxp 00000000 00:00 0
40000000-40013000 r-xp 00000000 03:05 1240325    /lib/ld-2.2.5.so
40013000-40014000 rw-p 00013000 03:05 1240325    /lib/ld-2.2.5.so
40014000-4004e000 r-xp 00000000 03:05 1191708    /usr/X11R6/lib/libXaw3d.so.7.0
4004e000-40054000 rw-p 0003a000 03:05 1191708    /usr/X11R6/lib/libXaw3d.so.7.0
40054000-40067000 rw-p 00000000 00:00 0
40067000-4007b000 r-xp 00000000 03:05 1191573    /usr/X11R6/lib/libXmu.so.6.2
4007b000-4007c000 rw-p 00014000 03:05 1191573    /usr/X11R6/lib/libXmu.so.6.2
4007c000-400c4000 r-xp 00000000 03:05 1191585    /usr/X11R6/lib/libXt.so.6.0
400c4000-400c8000 rw-p 00047000 03:05 1191585    /usr/X11R6/lib/libXt.so.6.0
400c8000-400cf000 r-xp 00000000 03:05 1191555    /usr/X11R6/lib/libSM.so.6.0
400cf000-400d0000 rw-p 00007000 03:05 1191555    /usr/X11R6/lib/libSM.so.6.0
400d0000-400e4000 r-xp 00000000 03:05 1191551    /usr/X11R6/lib/libICE.so.6.3
400e4000-400e5000 rw-p 00013000 03:05 1191551    /usr/X11R6/lib/libICE.so.6.3
400e5000-400e7000 rw-p 00000000 00:00 0
400e7000-400f3000 r-xp 00000000 03:05 1191565    /usr/X11R6/lib/libXext.so.6.4
400f3000-400f4000 rw-p 0000c000 03:05 1191565    /usr/X11R6/lib/libXext.so.6.4
400f4000-400f5000 r--p 00000000 03:05 1289283    /usr/lib/locale/en_US.iso885915/LC_IDENTIFICATION
400f5000-400fb000 r--s 00000000 03:05 2040184    /usr/lib/gconv/gconv-modules.cache
400fb000-400fc000 r--p 00000000 03:05 1289284    /usr/lib/locale/en_US.iso885915/LC_MEASUREMENT
400fc000-400fd000 r--p 00000000 03:05 1289288    /usr/lib/locale/en_US.iso885915/LC_TELEPHONE
400fd000-400fe000 r--p 00000000 03:05 1289282    /usr/lib/locale/en_US.iso885915/LC_ADDRESS
400fe000-400ff000 r--p 00000000 03:05 1289286    /usr/lib/locale/en_US.iso885915/LC_NAME
400ff000-40100000 r--p 00000000 03:05 1289287    /usr/lib/locale/en_US.iso885915/LC_PAPER
40100000-40101000 r--p 00000000 03:05 1305602    /usr/lib/locale/en_US.iso885915/LC_MESSAGES/SYS_LC_MESSAGES
40101000-40102000 r--p 00000000 03:05 1289285    /usr/lib/locale/en_US.iso885915/LC_MONETARY
40102000-40103000 r--p 00000000 03:05 1289289    /usr/lib/locale/en_US.iso885915/LC_TIME
40103000-40144000 r-xp 00000000 03:05 342941     /usr/lib/libtiff.so.3.5
40144000-40146000 rw-p 00041000 03:05 342941     /usr/lib/libtiff.so.3.5
40146000-40147000 rw-p 00000000 00:00 0
40147000-40164000 r-xp 00000000 03:05 342933     /usr/lib/libjpeg.so.62.0.0
40164000-40165000 rw-p 0001d000 03:05 342933     /usr/lib/libjpeg.so.62.0.0
40165000-40185000 r-xp 00000000 03:05 342940     /usr/lib/libpng.so.2.1.0.12
40185000-40186000 rw-p 0001f000 03:05 342940     /usr/lib/libpng.so.2.1.0.12
40186000-40192000 r-xp 00000000 03:05 342902     /usr/lib/libz.so.1.1.3
40192000-40194000 rw-p 0000b000 03:05 342902     /usr/lib/libz.so.1.1.3
40194000-401b5000 r-xp 00000000 03:05 2056327    /lib/i686/libm-2.2.5.so
401b5000-401b6000 rw-p 00020000 03:05 2056327    /lib/i686/libm-2.2.5.so
401b6000-401bd000 r-xp 00000000 03:05 343016     /usr/lib/libungif.so.4.1.0
401bd000-401be000 rw-p 00006000 03:05 343016     /usr/lib/libungif.so.4.1.0
401be000-401cc000 r-xp 00000000 03:05 1191579    /usr/X11R6/lib/libXpm.so.4.11
401cc000-401cd000 rw-p 0000d000 03:05 1191579    /usr/X11R6/lib/libXpm.so.4.11
401cd000-401ce000 rw-p 00000000 00:00 0
401ce000-402a0000 r-xp 00000000 03:05 1191557    /usr/X11R6/lib/libX11.so.6.2
402a0000-402a3000 rw-p 000d1000 03:05 1191557    /usr/X11R6/lib/libX11.so.6.2
402a3000-402d8000 r-xp 00000000 03:05 342855     /usr/lib/libncurses.so.5.2
402d8000-402e1000 rw-p 00035000 03:05 342855     /usr/lib/libncurses.so.5.2
402e1000-402e3000 r-xp 00000000 03:05 1240338    /lib/libdl-2.2.5.so
402e3000-402e4000 rw-p 00001000 03:05 1240338    /lib/libdl-2.2.5.so
402e4000-402e5000 rw-p 00000000 00:00 0
402e5000-402eb000 r--p 00000000 03:05 1289291    /usr/lib/locale/en_US.iso885915/LC_COLLATE
402eb000-402ec000 r--p 00000000 03:05 1289290    /usr/lib/locale/en_US.iso885915/LC_NUMERIC
402ec000-40317000 r--p 00000000 03:05 1289292    /usr/lib/locale/en_US.iso885915/LC_CTYPE
40318000-4031a000 r-xp 00000000 03:05 571313     /usr/X11R6/lib/X11/locale/common/xlcDef.so.2
4031a000-4031b000 rw-p 00001000 03:05 571313     /usr/X11R6/lib/X11/locale/common/xlcDef.so.2
4031c000-4031e000 r-xp 00000000 03:05 2040136    /usr/lib/gconv/ISO8859-15.so
4031e000-4031f000 rw-p 00001000 03:05 2040136    /usr/lib/gconv/ISO8859-15.so
40326000-4032f000 r-xp 00000000 03:05 1240358    /lib/libnss_files-2.2.5.so
4032f000-40330000 rw-p 00009000 03:05 1240358    /lib/libnss_files-2.2.5.so
40330000-4034b000 r-xp 00000000 03:05 571312     /usr/X11R6/lib/X11/locale/common/ximcp.so.2
4034b000-4034d000 rw-p 0001a000 03:05 571312     /usr/X11R6/lib/X11/locale/common/ximcp.so.2
4034d000-40356000 r-xp 00000000 03:05 571317     /usr/X11R6/lib/X11/locale/common/xomGeneric.so.2
40356000-40357000 rw-p 00008000 03:05 571317     /usr/X11R6/lib/X11/locale/common/xomGeneric.so.2
40357000-40368000 rw-p 00000000 00:00 0
42000000-4212c000 r-xp 00000000 03:05 2056325    /lib/i686/libc-2.2.5.so
4212c000-42131000 rw-p 0012c000 03:05 2056325    /lib/i686/libc-2.2.5.so
42131000-42135000 rw-p 00000000 00:00 0
bffbb000-c0000000 rwxp fffbc000 00:00 0

The Linux system file /proc/processid/maps describes the current memory mapping for memory regions of process processid. The example above is from an execution of the text editor program emacs.

The first two fields give the range of addresses that defines the region. The third field gives the permissions set for that region. (The character 'p' indicates that the region may be shared with other processes.)

The other fields indicate the mapping of the region, if any. The first of these is the shifting/offset of the region within the object. The second indicates the device. The third indicates the inode. The fourth indicates the pathname.

You may notice that each load module has two or three regions of memory associated with it. For example:

08048000-08188000 r-xp 00000000 03:05 327928     /usr/bin/emacs
08188000-08423000 rw-p 00140000 03:05 327928     /usr/bin/emacs
08423000-086a4000 rwxp 00000000 00:00 0

The first region, which is readable and executable (r-x), must contain the code.

The second region, which is readable and writeable but not executable (rw-), appears to for be the loader-initialized data.

The third region, which is readable and writeable, and in one case also executable, appears to be for the uninitialized data.

Notice that the dynamically loaded (shareable) libraries are mapped in the range of addresses starting at 40000000. This happens to be the same area where memory-mapped files are mapped. That is not a coincidence.

40014000-4004e000 r-xp 00000000 03:05 1191708    /usr/X11R6/lib/libXaw3d.so.7.0
4004e000-40054000 rw-p 0003a000 03:05 1191708    /usr/X11R6/lib/libXaw3d.so.7.0
40054000-40067000 rw-p 00000000 00:00 0

Notice that a library generally will have its own data segments, for initialized and uninitialized static data.

The runtime stack segment mapping is shown below:

bffbb000-c0000000 rwxp fffbc000 00:00 0

Notice that it is executable, as well as readable and writeable. If it were not executable, stack crashing would not be so easy.


x86 Segmentation


These details apply only to Linux on the Intel x86 architecture, as of the time and version that these notes were first written, in 2002. They are provided by way of illustration, only. Not only will they be different for different architectures; they are also likely to change as Linux evolves on the x86 family.

Linux makes use of the segment registers cs and ds to access the code and data segments, respectively. The register values are swapped when the process traps into kernel mode, and again when it returns back to user mode.

All processes have the same segment register values. This keeps memory management simpler.


Linux Paging


For the X86 (32-bit) architecture, Linux uses a 2-level page table scheme. For the Alpha processor (64-bit) is uses a three-level scheme. For portability, most of the code assumes a three-level scheme. For the x86, the middle-level table has only one entry.


Linux 3-Level Paging


Why does a 64-bit address require 3 levels of paging?

The key concept here is sparsity. If we needed to map the entire virtual address space, the 3-level page table would take up more memory than a 1-level page table. However, we do not map more than a small portion of the virtual address space. Normally, a process will have only a few small ranges of pages mapped. For example, if you examine the example /proc/processid/maps (process region list) above, you will see that most of the process address space is not mapped, i.e., the entire ranges 000000-0081000, 086a4000-40000000, and 42135000-bfffbb00, and much of the range c0000000-ffffffff are not mapped (and so do not require page table entries). By not having to store the middle or lowest level page tables for the unmapped regions, a 3-level scheme saves memory over a 2-level scheme.


x86 Paging Parameters


Linux Page Descriptors (Page Frame Table Entries)

Entries in table mem_map in mm/memory.c, of type struct page, in linux/mm.h:


Linux Page Status Bits

Declared in linux/mm.h:


Linux Page Sharing


Linux allows processes to share pages. This results in a huge gain in performance. Sharing code pages can reduce the number of disk reads necessary to load code into memory, and allows us to have more resident processes without an excessive page fault rate. Sharing data pages between parent and child processes after a fork speeds up the fork operation, and saves a lot of wasted work if the child process quickly does an exec operation (which ends up reinitializing the data segment).


Linux Page Cache


Unix-based operating systems all implement a caching system in main memory for recently accessed blocks of data from files. This cache, which we will consider again a bit later, under the topic of I/O and filesystems, allows the OS to reduce the number of disk I/O operations it must do. That is, if a block of data has been read from or written to disk recently, the corresponding data is likely to still be in main memory, in the disk block cache. If a program requests to read the same block of data again while it is in cache, there is no need to do a disk I/O operation. Similarly, if a block of data is rewritten before the cache copy is flushed out to the disk device, disk write operations may also be avoided.

Modern Unix-based operating systems have unified the treatment of page-swapping with other block-oriented I/O, so that they all use the same pool of block buffers. Given that files can be mapped to memory regions and each process image has a corresponding disk file (the swap area for the process), there is not much differnce between what the system needs to do for a page frame that is holding part of a process image and what it nees to do for a page frame that is holding a portion of the data of a file. The same software can do the moving of both kinds of data between page frames and disk files.

Implementation of disk block buffering requires a way to find out which page frame (if any) is currently holding a given block of data, given the identity of the file and offset in the file of the block of data. Unix systems implement this associative mapping using the fastest known (O(1)) data structure for associative mappings, i.e., a hash table. The block buffering scheme uses a hash table with chained overflow. Two pointers are provided in each page frame descriptor, to chain together the descriptors of pages whose current content hashes to the same location in the hash table.


Linux Page Allocation


Why would we like blocks of pages to be contiguous in real memory?

Disk I/O can be done more efficiently on contiguous blocks of pages. This is especially important for DMA I/O. But DMA may be limited to certain ranges of addresses (first 16 Mbytes on Intel x86). Therefore, Linux keeps two different buddy systems, one for DMA-suitable page frames, and one for the rest.


Linux Page Replacement


Who does the incrementing of the age variable? How is this done efficiently?

Stallings and The Linux Kernel Book imply the age is a count of page references. That is not true. What Linux keeps is only an approximation.

It is impossible for the operating system itself (in software) to count page references, for several good reasons. One reason is that executing the software to increment the count would itself make memory references, etc.. Therefore, the actual page reference counting must be done by the hardware. The hardware generally has only a use bit and a modified bit. The Linux system uses the use bit to approximate a count of the number of "recent" references to the page.

There are several schemes for extracting a more precise history of use from the a single use bit. The basic idea behind them is similar to the clock algorithm. The system periodically scans through the page frames, resetting the hardware-supported use bit to zero. As it does this scan, it saves the old values of the use bits into a data structure for each page frame. One version of this scheme keeps a short bit vector containing the last n (FIFO) values of the use bit for each frame. The Linux version keeps something like a count of the number of times the system has scanned the frame and found the use bit set, minus the number of times the system has scanned the frame and found the use bit not set. A larger number means the frame has been referenced often, in recent time.

To see exactly how this works, one can look at the Linux source code.


Page Frame Aging

static inline void age_page_up(struct page *page)
{page->age = min((int) (page->age + PAGE_AGE_ADV), PAGE_AGE_MAX); 
}
static inline void age_page_down(struct page *page)
{page->age -= min(PAGE_AGE_DECL, (int)page->age);
}

The above functions are from file mm/vmscan.c. The function age_page_up updates the he age field of the page frame structure to reflect a nonzero use bit. The function age_page_down handles the other case. The code below shows where these functions are called.


Page Frame Aging (continued)

int refill_inactive_zone(struct zone_struct * zone, int priority)
{
	int maxscan = zone->active_pages >> priority;
	int target = inactive_high(zone);
	struct list_head * page_lru;
	int nr_deactivated = 0;
	struct page * page;

	/* Take the lock while messing with the list... */
	spin_lock(&pagemap_lru_lock);
	while (maxscan-- && !list_empty(&zone->active_list)) {
		page_lru = zone->active_list.prev;
		page = list_entry(page_lru, struct page, lru);

		/* Wrong page on list?! (list corruption, should not happen) */
		if (unlikely(!PageActive(page))) {
			printk("VM: refill_inactive, wrong page on list.\n");
			list_del(page_lru);
			nr_active_pages--;
			continue;
		}
		
		/* Needed to follow page->mapping */
		if (TryLockPage(page)) {
			list_del(page_lru);
			list_add(page_lru, &zone->active_list);
			continue;
		}

		/*
		 * If the object the page is in is not in use we don't
		 * bother with page aging.  If the page is touched again
		 * while on the inactive_clean list it'll be reactivated.
		 * From here until the end of the current iteration
		 * both PG_locked and the pte_chain_lock are held.
		 */
		pte_chain_lock(page);
		if (!page_mapping_inuse(page)) {
			pte_chain_unlock(page);
			UnlockPage(page);
			drop_page(page);
			continue;
		}

		/*
		 * Do aging on the pages.
		 */
		if (page_referenced(page)) {
			age_page_up(page);
		} else {
			age_page_down(page);
		}

		/* 
		 * If the page age is 'hot' and the process using the
		 * page doesn't exceed its RSS limit we keep the page.
		 * Otherwise we move it to the inactive_dirty list.
		 */
		if (page->age && !page_over_rsslimit(page)) {
			list_del(page_lru);
			list_add(page_lru, &zone->active_list);
		} else {
			deactivate_page_nolock(page);
			if (++nr_deactivated > target) {
				pte_chain_unlock(page);
				UnlockPage(page);
				goto done;
			}
		}
		pte_chain_unlock(page);
		UnlockPage(page);

		/* Low latency reschedule point */
		if (current->need_resched) {
			spin_unlock(&pagemap_lru_lock);
			schedule();
			spin_lock(&pagemap_lru_lock);
		}
	}

done:
	spin_unlock(&pagemap_lru_lock);

	return nr_deactivated;
}

The code above is from file mm/vmscan.c.

The function refill_inactive_zone scans a portion of the active page list of a zone to find unused pages, which will then be moved to the inactive list. The parameter priority specifies how much to scan.

The page frame descriptor field age reflects how many times the page has been referenced. Apparently, page_referenced fetches the use-bit, which is maintained by the hardware. If this is nonzero, meaning the page has been referenced recently, the age is incremented (but not above PAGE_AGE_MAX). If the page has not been referenced recently, the age is decremented (but not below zero) In this way, the age reflects how many times the page has been referenced in recent history.


Linux Kernel Memory Allocation


Linux Virtual Memory Region Mapping


The kernel has a collection of data structures that it uses to keep track of the active virtual memory regions for each process. These include a per-process address space descriptor, which has a link to a list of per-region Virtual Memory Area (VMA) region descriptors.


Linux Virtual Memory Region Descriptor


The above is a partial list of the fields of struct vm_area_struct from the Linux kernel, version 2.4.18.

  • The Linux Kernel Book says that region descriptors are also organized in an AVL tree, for speed of lookup (O(log n)), and says the following additional fields are defined:

    I could not find these fields in the kernel sources for Linux 2.4.18. I don't know whether they were in an earlier version of Linux, or are in one of the newer versions (Linux 2.5.x).

    If true, this is a very nice application of AVL trees. If you look at the example list of process regions you will see that there are quite a few of them. At times the system needs to determine whether a given page is mapped, find a region of virtual memory that is not mapped, or find which file (if any) is mapped to a region of memory. To avoid doing a linear search of the entire linked list of memory region descriptors, the system uses the AVL tree.

    © 2002 T. P. Baker & Florida State University. Information obtained from The Linux Kernel Book, by Card, Dumas, & Mevel (John Wiley & Sons). No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)