Linux Kernel & Device Driver Programming

Ch 15 - Memory Mapping and DMA

This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom.

Sub-Topics

Memory Mapping

  • Translation of address issued by some device (e.g., CPU or I/O device) to address sent out on memory bus (physical address)
  • Mapping is performed by memory management units
  • CPU(s) and I/O devices may have different (or no) memory management units
    • No MMU means direct (trivial) mapping
  • Memory mapping is implemented by the MMU(s) using page (translation) tables stored in memory
  • The OS is responsible for defining the mappings, by managing the page tables
*

AGP and PCI Express graphics cards us a Graphics Remapping Table (GART), which is one example of an IOMMU. See Wiki article on IOMMU for more detail on memory mapping with I/O devices.

Address Mapping Function (Review)

Unmapped Pages

The mapping is sparse. Some pages are unmapped.

*

Why don't we map all pages?

Unmapped Pages

Pages may be mapped to locations on devices.

*

And some pages may be mapped to both.

MMU Function (Review)

Possible Handler Actions

  1. Map the page to a valid physical memory location
    • May require creating a page table entry
    • May require bringing data in to memory from a device
  2. Treat the event as an erro (e.g., SIG_SEGV)
  3. Pass the exception on to a device-specific handler
    • The device's fault method

Linux Page Tables Have 4 Levels

*

How does having multiple levels save memory?

Linux Page Tables

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Kernel 2.6 provides functions that allow drivers to ignore page table internals, so these details can change without affecting drivers.

Don't confuse page table entries with struct page.

Linux has page tables for internal use, even if the hardware doesn't require them. For example, apparently, the PowerPC hardware does not access the page tables at all. It uses some memory and a hashing scheme to cache recently used TLB entries, but if that misses, the software handler needs to explicitly load the TLB by doing the actual page

Kernel Memory Mapping

*

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The reason the Linux kernel cannot use the whole virtual address range for kernel logical addresses on a 32-bit machine is a design choice, to split the range of virtual addresses 3:1 between user virtual addresses and kernel virtual addresses.

The reason for this is to avoid changing page table entries when a process traps into the kernel, and to allow the kernel to copy data to and from user space efficiently (without the overhead of manipulating page tables). That is, the process already has page table entries to map all of the kernel memory. These pages are protected while in user mode, but while executing in kernel mode they can all be accessed. The other 3 GB of virtual address space are available to the user-mode process for its own (swappable) code, data, etc. In effect, pages of real memory may have up to two distinct active page table entries that refer to them, for (1) a kernel memory page and (2) a user memory page.

Kernel Logical Addresses

*

Examples of PAGE_OFFSET values:

Kernel Logical Addresses

Physical Address Extension (PAE)

See Wiki article on PAE for more detail.

Linux "High" vs. "Low" Memory

Applies only to machines with larger physical than virtual address space.

*

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Do not confuse Linux low memory here with DOS/Windows 0-640K "low memory" range.

It seems that some Linux writers also use the term "low memory" for kernel memory, which can be confusing.

The following output of /proc/meminfo on a Pentium III machine with 1 GB of RAM shows the split between high and low memory as follows:

MemTotal:      1030888 kB
HighTotal:      131008 kB
LowTotal:       899880 kB

Kernel Virtual Addresses

In a machine whose virtual paddresses do not permit addressing all of physical memory (e.g., 32-bit machine with PAE), part of the kernel memory is not mapped in the above fashion, so that we can access "high memory".

*

Diagram from LDD3 Book

types of addresses figure

Page Size Symbolic Constants

What are the advantages of larger page size? Disadvantages?

How does page size relate to the size of physical memory supported, and the amount of space occupied by page tables?

struct page

Describes a page of physical memory.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Do not confuse the struct page with a page table entry. A struct page entry corresponds to a page of real memory, and seems to correspond to what is generically called a "page frame table" entry in OS textbooks.

struct page pointers ↔ virtual addresses

kmap() &am; kunmap()

Some Page Table Operations

Device drivers should not need to use the above functions, because of the generic memory mapping services described below.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Kernel 2.6 provides functions that allow drivers to ignore page table internals, so the above details can now change without affecting drivers. However, the following is important to drivers.

Virtual Memory Areas

Process Memory Map

Virtual Memory Regions

From http://duartes.org/gustavo/blog/post/2009/02. I recommend reading the entire article.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Please read Gustavo Duarte's blog Page Cache, the Affair Between Memory and Files. It contains two diagrams, which I have reproduced here for reference class while explaining the way Linux manages virtual-to-physical address mappings.

The article goes in to more details and includes links to Linux kernel code on the LXR site. It also explains how memory-mapped I/O works and how it is integrated into the paged virtual memory system.

Virtual Memory Area Mapping Descriptors

From http://duartes.org/gustavo/blog/post/2009/02. I recommend reading the entire article.

Example of VMA's of a Process

Output of /proc/.../maps for an emacs editing session:

vm_start-vm_stop       mv_pgoff       inode
↓                 vm_page_prot  major:minor      image
↓                 ↓    ↓        ↓     ↓          ↓
baker@websrv: cat /proc/24692/maps
08048000-08195000 r-xp 00000000 09:02 294678     /usr/bin/emacs
08195000-0842f000 rw-p 0014c000 09:02 294678     /usr/bin/emacs
0842f000-08827000 rwxp 00000000 00:00 0          zero-mapped BSS for emacs
40000000-40012000 r-xp 00000000 09:02 211751     /lib/ld-2.2.93.so
40012000-40013000 rw-p 00012000 09:02 211751     /lib/ld-2.2.93.so
40013000-40052000 r-xp 00000000 09:02 896219     /usr/X11R6/lib/libXaw3d.so.7.0
40052000-40058000 rw-p 0003e000 09:02 896219     /usr/X11R6/lib/libXaw3d.so.7.0
40058000-4006b000 rw-p 00000000 00:00 0          BSS for libXaw3d
4006b000-40080000 r-xp 00000000 09:02 896170     /usr/X11R6/lib/libXmu.so.6.2
40080000-40081000 rw-p 00015000 09:02 896170     /usr/X11R6/lib/libXmu.so.6.2
40081000-400cf000 r-xp 00000000 09:02 896182     /usr/X11R6/lib/libXt.so.6.0
400cf000-400d3000 rw-p 0004d000 09:02 896182     /usr/X11R6/lib/libXt.so.6.0
400d3000-400db000 r-xp 00000000 09:02 896152     /usr/X11R6/lib/libSM.so.6.0
400db000-400dc000 rw-p 00007000 09:02 896152     /usr/X11R6/lib/libSM.so.6.0
400dc000-400f0000 r-xp 00000000 09:02 896148     /usr/X11R6/lib/libICE.so.6.3
400f0000-400f1000 rw-p 00013000 09:02 896148     /usr/X11R6/lib/libICE.so.6.3
400f1000-400f3000 rw-p 00000000 00:00 0
400f3000-40100000 r-xp 00000000 09:02 896162     /usr/X11R6/lib/libXext.so.6.4
40100000-40101000 rw-p 0000c000 09:02 896162     /usr/X11R6/lib/libXext.so.6.4
40102000-40104000 r-xp 00000000 09:02 130365     /usr/X11R6/lib/X11/locale/common/xlcDef.so.2
40104000-40105000 rw-p 00001000 09:02 130365     /usr/X11R6/lib/X11/locale/common/xlcDef.so.2
40106000-40108000 r-xp 00000000 09:02 928554     /usr/lib/gconv/ISO8859-1.so
40108000-40109000 rw-p 00001000 09:02 928554     /usr/lib/gconv/ISO8859-1.so
40109000-40149000 r-xp 00000000 09:02 309641     /usr/lib/libtiff.so.3.5
40149000-4014b000 rw-p 0003f000 09:02 309641     /usr/lib/libtiff.so.3.5
4014b000-4014c000 rw-p 00000000 00:00 0
4014c000-40169000 r-xp 00000000 09:02 309545     /usr/lib/libjpeg.so.62.0.0
40169000-4016a000 rw-p 0001c000 09:02 309545     /usr/lib/libjpeg.so.62.0.0
4016a000-40193000 r-xp 00000000 09:02 309633     /usr/lib/libpng12.so.0.1.2.5
40193000-40194000 rw-p 00028000 09:02 309633     /usr/lib/libpng12.so.0.1.2.5
40194000-401a0000 r-xp 00000000 09:02 309632     /usr/lib/libz.so.1.1.4
401a0000-401a2000 rw-p 0000b000 09:02 309632     /usr/lib/libz.so.1.1.4
401a2000-401c3000 r-xp 00000000 09:02 1042441    /lib/i686/libm-2.2.93.so
401c3000-401c4000 rw-p 00021000 09:02 1042441    /lib/i686/libm-2.2.93.so
401c4000-401cb000 r-xp 00000000 09:02 309702     /usr/lib/libungif.so.4.1.0
401cb000-401cc000 rw-p 00007000 09:02 309702     /usr/lib/libungif.so.4.1.0
401cc000-401da000 r-xp 00000000 09:02 896176     /usr/X11R6/lib/libXpm.so.4.11
401da000-401db000 rw-p 0000d000 09:02 896176     /usr/X11R6/lib/libXpm.so.4.11
401db000-401dc000 rw-p 00000000 00:00 0
401dc000-402b7000 r-xp 00000000 09:02 896154     /usr/X11R6/lib/libX11.so.6.2
402b7000-402ba000 rw-p 000da000 09:02 896154     /usr/X11R6/lib/libX11.so.6.2
402ba000-402f0000 r-xp 00000000 09:02 309587     /usr/lib/libncurses.so.5.2
402f0000-402f9000 rw-p 00035000 09:02 309587     /usr/lib/libncurses.so.5.2
402f9000-402fb000 r-xp 00000000 09:02 211764     /lib/libdl-2.2.93.so
402fb000-402fc000 rw-p 00001000 09:02 211764     /lib/libdl-2.2.93.so
402fc000-402fd000 rw-p 00000000 00:00 0
402fd000-404bc000 r--p 00000000 09:02 390925     /usr/lib/locale/locale-archive
404bc000-404c5000 r-xp 00000000 09:02 211784     /lib/libnss_files-2.2.93.so
404c5000-404c6000 rw-p 00008000 09:02 211784     /lib/libnss_files-2.2.93.so
404c6000-404e2000 r-xp 00000000 09:02 130364     /usr/X11R6/lib/X11/locale/common/ximcp.so.2
404e2000-404e4000 rw-p 0001b000 09:02 130364     /usr/X11R6/lib/X11/locale/common/ximcp.so.2
404e4000-404ea000 r--s 00000000 09:02 928609     /usr/lib/gconv/gconv-modules.cache
404ea000-404f3000 r-xp 00000000 09:02 130369     /usr/X11R6/lib/X11/locale/common/xomGeneric.so.2
404f3000-404f4000 rw-p 00008000 09:02 130369     /usr/X11R6/lib/X11/locale/common/xomGeneric.so.2
42000000-42126000 r-xp 00000000 09:02 1042439    /lib/i686/libc-2.2.93.so
42126000-4212b000 rw-p 00126000 09:02 1042439    /lib/i686/libc-2.2.93.so
4212b000-4212f000 rw-p 00000000 00:00 0          stack segment
bffcb000-c0000000 rwxp fffcc000 00:00 0          vsyscall

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Can you make sense of all the VMA's above?

vsyscall is a page of kernel functions mapped into to user space. These are functions which do things that may not require supervisor privilege (like getting the time of day on some systems). Putting them in user space lowers the overhead of calls.

struct vm_area_struct

vm_operations_struct.vm_ops

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The populate() method has been phased out.

The nopage() and nopfn() methods are obsolescent. The new method that replaces both is fault()

Uses of Memory Mapping by Device Drivers

A device driver is likely to use memory mapping for two main purposes:

Using mmap() to Access Device (I/O) Memory

The following I/O device mappings are from /proc/.../maps for an X server process

08048000-081bc000 r-xp 00000000 09:00 742770     /usr/X11R6/bin/XFree86
081bc000-081ee000 rw-p 00174000 09:00 742770     /usr/X11R6/bin/XFree86
081ee000-08d57000 rwxp 00000000 00:00 0
40000000-40015000 r-xp 00000000 09:00 32269      /lib/ld-2.3.2.so
40015000-40016000 rw-p 00014000 09:00 32269      /lib/ld-2.3.2.so
40016000-40017000 rw-p 00000000 00:00 0
40017000-40027000 rw-s fe5e0000 09:00 96996      /dev/mem
4002d000-40039000 r-xp 00000000 09:00 371074     /usr/lib/libz.so.1.1.4
40039000-4003b000 rw-p 0000b000 09:00 371074     /usr/lib/libz.so.1.1.4
4003b000-4005c000 r-xp 00000000 09:00 212428     /lib/tls/libm-2.3.2.so
4005c000-4005d000 rw-p 00020000 09:00 212428     /lib/tls/libm-2.3.2.so
4005d000-40064000 r-xp 00000000 09:00 32396      /lib/libpam.so.0.75
40064000-40065000 rw-p 00007000 09:00 32396      /lib/libpam.so.0.75
40065000-40066000 rw-p 00000000 00:00 0
40066000-40069000 r-xp 00000000 09:00 33817      /lib/libdl-2.3.2.so
40069000-4006a000 rw-p 00002000 09:00 33817      /lib/libdl-2.3.2.so
4006a000-4006c000 r-xp 00000000 09:00 34152      /lib/libpam_misc.so.0.75
4006c000-4006d000 rw-p 00001000 09:00 34152      /lib/libpam_misc.so.0.75
4006d000-4006e000 rw-p 00000000 00:00 0
4006e000-40079000 r-xp 00000000 09:00 34945      /lib/libnss_files-2.3.2.so
40079000-4007a000 rw-p 0000a000 09:00 34945      /lib/libnss_files-2.3.2.so
4007a000-400cd000 rw-p 00000000 00:00 0
400cd000-400dd000 rw-s 000a0000 09:00 96996      /dev/mem
400dd000-4013d000 rw-s 00000000 00:04 71958528   /SYSV00000000 (deleted)
401fe000-40b91000 rw-p 00121000 00:00 0
42000000-4212f000 r-xp 00000000 09:00 212241     /lib/tls/libc-2.3.2.so
4212f000-42132000 rw-p 0012f000 09:00 212241     /lib/tls/libc-2.3.2.so
42132000-42134000 rw-p 00000000 00:00 0
42134000-44134000 rw-s fc000000 09:00 96996      /dev/mem
bffaf000-c0000000 rwxp fffb0000 00:00 0

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Distinguish the above, which are user virtual address mappings implemented by drivers, from the (lower level) mappings of IO addresses to kernel logical addresses implemented by the IO system, like the following from /proc/iomem. The ones shown in color are exported to the X server, via (higher level) memory mapping to user space.

00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000c8000-000c8fff : Extension ROM
000c9000-000cafff : Extension ROM
000f0000-000fffff : System ROM
00100000-3ffeffff : System RAM
  00100000-0025122b : Kernel code
  0025122c-0034b1e3 : Kernel data
3fff0000-3fffefff : ACPI Tables
3ffff000-3fffffff : ACPI Non-volatile Storage
f3fff000-f3ffffff : ServerWorks CNB20HE Host Bridge
f4000000-f5ffffff : ServerWorks CNB20HE Host Bridge
f6200000-fe2fffff : PCI Bus #01
  fa000000-fbffffff : Number 9 Computer Company Revolution 4
  fc000000-fdffffff : Number 9 Computer Company Revolution 4
fe500000-fe5fffff : PCI Bus #01
  fe5e0000-fe5effff : Number 9 Computer Company Revolution 4
  fe5ff000-fe5fffff : Number 9 Computer Company Revolution 4
fe900000-fe9fffff : Intel Corp. 82557/8/9 [Ethernet Pro 100]
  fe900000-fe9fffff : e100
feaed000-feaedfff : Intel Corp. 82557/8/9 [Ethernet Pro 100]
  feaed000-feaedfff : e100
feaee000-feaeefff : ServerWorks OSB4/CSB5 OHCI USB Controller
  feaee000-feaeefff : usb-ohci
feaef000-feaeffff : Adaptec AHA-7850
  feaef000-feaeffff : aic7xxx
feafe000-feafefff : Adaptec AIC-7899P U160/m
  feafe000-feafefff : aic7xxx
feaff000-feafffff : Adaptec AIC-7899P U160/m (#2)
  feaff000-feafffff : aic7xxx
febfc000-febfffff : Promise Technology, Inc. 20268
fec00000-fec01fff : reserved
fee00000-fee00fff : reserved
fff80000-ffffffff : reserved

The mmap() Interfaces

User-level API function:

void *mmap (caddr_t start, size_t len, int prot, int flags, int fd, off_t offset);

Driver-level file operation:

int (*mmap) (struct file *filp, struct vm_area_struct *vma);

Implementing the mmap() Method in a Driver

  1. build suitable page tables for the address range
    two ways:
    1. right away, using remap_pfn_range or vm_insert_page
    2. later (on demand), using the fault() VMA method
  2. replace vma->vm_ops with a new set of operations, if necessary

The remap_pfn_range() Kernel Function

int remap_pfn_range (struct vm_area_struct *vma, unsigned long from,
        unsigned long pfn, unsigned long size, pgprot_t prot);
int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
        unsigned long phys_addr, unsigned long size, pgprot_t prot);

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

As usual, these return 0 for success, and negative error code for failure.

io_remap_pfn_range() is for the case when phys_addr refers to I/O memory.

It may be necessary to do something machine-dependent in the prot to disable caching of specific VMAs. This is architecture-dependent. See pgprot_noncached and treatment of i386 video frame buffer memory protection in drivers/video/fbmem.c.

A simple implementation of mmap()

static struct vm_operations_struct simple_remap_vm_ops = {
   .open =  simple_vma_open,
   .close = simple_vma_close,
};

int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma){ 
   if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
                       vma->vm_end - vma->vm_start,
                       vma->vm_page_prot))
   return -EAGAIN;
   vma->vm_ops = &simple_remap_vm_ops;
   simple_vma_open(vma);  /* does nothing but print out a message */
   return 0;
}

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

This is from example module simple.c. It does a simple linear 1:1 mapping of physical memory into a user address space. (This is not a serious example. I think you would not ordinarily want to do this. The range of addresses passed by the kernel in vma is a range of unused virtual addresses in the user space. The odds of this range corresponding to useful physical memory appear to be slim.)

The explicit call to simple_vma_open() is necessary in the driver here because the system will not call it on the call to mmap().

remap_pfn_range() can only be used for reserved (always resident) pages, and for pages above the top of physical memory. Otherwise, it would not be safe for a device driver to map them to user space. It can be used to remap high PCI buffers and ISA memory.

It cannot be used to remap pages of "conventional addresses, including ones you obtain by calling get_free_page()". If you attempt to map such pages the user process will see a page of zeroes.

To remap RAM you want to use the fault() method, described further below.

Narrowing the Mapped Region

unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
unsigned long physical = simple_region_start + off;
/* assumes simple_region_start is already page-aligned */
unsigned long vsize = vma->vm_end - vma->vm_start;
unsigned long psize = simple_region_size - off;
if (vsize > psize) return -EINVAL;
if (remap_pfn_range(vma, vma->vm_start, physical, vsize, vma_vm_page_prog)) ...

Using fault()

struct page (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

This interface has changed since the LDD3 book was written. See the lwn.net article on fault handler for explanation.

The fault() method of a VMA is called when a user process attempts to access a page in the VMA that is not present in memory.

Return value is a word of flags that gives details about how the fault was handled:

Flags in

It "must locate and return the struct page pointer that refers to the page the user wanted". It must also "take care to increment the usage count for the page it returns by calling the get_page macro".

Note that there is no need to call put_page() (to decrement the page reference count) in this case, since the system will do that automatically for all pages when it deletes the VMA. It is essential that we increment the count here, to prevent that automatic decrement from prematurely putting the page on the free list.

Example Using fault()

static int simple_vma_fault(struct vm_area_struct *vma,
                struct vm_fault *vmf)
{
        struct page *pageptr;
        unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
        unsigned long address = (unsigned long) vmf->virtual_address;
        unsigned long physaddr = address - vma->vm_start + offset;
        unsigned long pageframe = physaddr >> PAGE_SHIFT;
        if (!pfn_valid(pageframe))
                return VM_FAULT_SIGBUS;
        pageptr = pfn_to_page(pageframe);
        printk (KERN_NOTICE "---- Fault, off %lx pageframe %lx\n", offset, pageframe);
        printk (KERN_NOTICE "page->index = %ld mapping %p\n", pageptr->index, pageptr->mapping);
        get_page(pageptr);
        vmf->page = pageptr;
        return 0;
}

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

This example is still not good style. The lwn.net article on fault says of the virtual_address field:

... anybody who is tempted to use that field should be prepared to justify that use to a crowd of skeptical kernel developers. Most handlers should not care where the page lives in user space, and use of virtual_address will make it impossible to support nonlinear VMAs. So, if at all possible, virtual_address should be ignored. If your code only uses pgoff, it should also set the VM_CAN_NONLINEAR flag in the VMA's vm_flags field to let the kernel know that it is playing by the rules.

I don't yet have enough confidence in my understanding of vmf->pgoff to rewrite the example using that. I believe it is the offset of the faulted page relative to the start of the vm area.

Note that pfn_to_page() assumes there is a corresponding kernel logical address, and so it does not work for high memory. In particular, this is true of PCI memory, which is mapped above the highest system memory. This is the reason for the check pfn_valid().

If the nopage() method is left as NULL, the page is mapped to a copy-on-write page that reads as zero. (So, there is no segmentation fault.)

If the nopage() method cannot map the page, it returns NOPAGE_SIGBUS (page out of range) or NOPAGE_OOM (out of memory).

The specific type of fault is returned via the parameter type, if that is non-null.

Preventing Extension of the Mapping for a Device

struct page *simple_fault(struct vm_area_struct *vma,
    struct vm_fault *vmf)
{ return VM_FAULT_SIGBUS; 
}

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The LDD3 book says that between kernel versions 2.4 and 2.6 support was dropped for the remap() method, which allowed a driver respond to the mremap() system call (to extend a memory mapping) explicitly. Instead, the system would quietly process such calls without notifying the driver.

If a driver does not want the user to be able to extend (using mremap()) a region of memory that has been mapped earlier (using mmap()), the driver can define the nopage() method to generate a segmentation fault. Otherwise, the user will simply see a copy-on-write page of zeros.

It is not clear whether this is still a valid concern, as I have not yet been able to find any instances in the 2.6.25 kernel where this is done.

Warning

The mmap examples in the LDD3 text are not correct. They need to be updated for the transition from nopage() to fault().

I have updated one example, sculld. You can see my full code for sculld by following this link.

A Slightly More Complete Example

static int sculld_vma_fault(struct vm_area_struct *vma,
                              struct vm_fault *vmf)
{
        struct sculld_dev *ptr, *dev = vma->vm_private_data;
        int result = VM_FAULT_SIGBUS;
        struct page *page;
        void * pageptr = NULL;
        pgoff_t pgoff = vmf->pgoff;  

        down(&dev->sem);
        printk (KERN_NOTICE "sculld_vma_fault: pgoff   = %lx\n", pgoff);
        if (pgoff >= dev->size) goto out;

        /*
         * Now retrieve the sculld device from the list, then the page.
         * If the device has holes, the process receives a SIGBUS when
         * accessing the hole.
         */
        for (ptr = dev; ptr && pgoff >= dev->qset;) {
                ptr = ptr->next;
                pgoff -= dev->qset;
        }
        if (ptr && ptr->data) pageptr = ptr->data[pgoff];
        if (!pageptr) goto out; /* hole or end-of-file */
        /* got it, now convert pointer to a struct page and increment the count */
        page = virt_to_page(pageptr);
        get_page(page);
        vmf->page = page;
        result = 0;
  out:
        up(&dev->sem);
        return result;
}

A Slightly More Complete Example (continued)

struct vm_operations_struct sculld_vm_ops = {
        .open =     sculld_vma_open,
        .close =    sculld_vma_close,
        .fault =    sculld_vma_fault,
};
int sculld_mmap(struct file *filp, struct vm_area_struct *vma)
{
        struct inode *inode = filp->f_dentry->d_inode;

        printk (KERN_NOTICE "sculld: mmap starting\n");
        /* refuse to map if order is not 0 */
        if (sculld_devices[iminor(inode)].order)
                return -ENODEV;

        /* don't do anything here: "fault" will set up page table entries */
        vma->vm_ops = &sculld_vm_ops;
        vma->vm_flags |= VM_RESERVED;
        vma->vm_private_data = filp->private_data;
        sculld_vma_open(vma);
        printk (KERN_NOTICE "sculld: mmap done\n");
        return 0;
}

The above is only for mapping kernel logical addresses. For other kernel virtual addresses, you need to use vmalloc_to_page() instead of virt_to_page().

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

I still have not updated scullp, scullv, etc.

The example in the bttv driver (below) is actually more realistic, since it makes nontrivial use of reference counts in the vm_close() method.

A Special Case: Remapping I/O Memory

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Recall that we used ioremap_nocache() to map the PCI-card's I/O memory into the kernel memory space. Exporting it to user space via mmap() would be an instance of this special case.

Recurrence of Patterns

Even though Microsoft's Windows operating systems have a different API for device drivers, the same underlying operations, such as memory mapping, still need to be done. So, if you read device driver code for Windows you will see some of the same programming patterns. So, you probably can make some sense of the following fragment of code from a Windows NT driver, which performs a function similar to a Linux fault method.

Memory mapping in Windows NT(TM)

static PUCHAR getMappedAddress(
   IN unsigned baseAddr, // User-mode address to convert
   IN INTERFACE_TYPE interfaceType, // PCI, ISA
   IN unsigned busNum,
   IN unsigned bytesNeeded, // Bytes needed at baseAddr
   OUT int *pRetCode) // Extended error info
{
#define MEM_SPACE 0 // 0 => memory space, 1 => I/O space
   PHYSICAL_ADDRESS translatedAddress, physicalAddress; // physical address to map 
   PUCHAR mappedBaseAddr = NULL;
   ULONG memType = MEM_SPACE; // Resource is a memory address, not a port
   BOOLEAN nRc;
   // Reformat base address for function we need to use
   physicalAddress.HighPart = 0;
   physicalAddress.LowPart = baseAddr;
   nRc = HalTranslateBusAddress(
      interfaceType, 
      busNum,
      physicalAddress,
      &memType,
      &translatedAddress);
   if(nRc == FALSE) { ...error recovery...
      }
   else {
      // Assume memType = MEM_SPACE
      mappedBaseAddr = MmMapIoSpace(translatedAddress, bytesNeeded, FALSE);
      if(mappedBaseAddr == NULL) { ...error recovery...
         }
      }
   return(mappedBaseAddr);
}

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Whether the OS be Linux or MS Windows, the work that needs to be performed is still similar, and the internal mechanisms are likely to be similar. The code above is from a Windows NT driver for the PixelSmart video frame grabber.

This use of memory mapping is for the internal use of the driver, to map device memory into kernel memory. A driver may also want to map device memory into user space.

Observe that this seems to be closer to the remap_pfn_range() technique, and it seems to be mapping the actual device memory.

The HalTranslateBusAddress() function "translates a bus-relative physical address into the corresponding system physical address", and the MmMapIoSpace() function "maps a given physical address to nonpage system space".

Performing Direct I/O

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

It is generally a good thing to avoid recopying data. The above is one way of doing this, i.e., by putting the data from the device directly into a user-space buffer. Another way of doing this is to put it into a kernel buffer and then map the kernel buffer into user space. Linux supports both of these methods for video stream input.

Examples from V4L and the bttv driver

The V4L video driver model supports two models of memory-mapped streaming I/O:

How to Use V4L2_MEMORY_MAP Buffers

  1. request number and type of buffers desired, using ioctl VIDIOC_REQBUFS
  2. get the address of each buffer, using ioctl VIDIOC_QUERYBUF
  3. map the buffer into user space, using a call to mmap()
  4. queue all the buffers for input, using ioctl VIDIOC_QBUF
  5. start capturing data using the ioctly VIDIOC_STREAMON
  6. repeat for each frame:
    • wait for a buffer to dequeue, using ioctl VIDIOC_DQBUF, or use select() or poll()
    • process the data in the buffer
    • return the buffer to the driver, using ioctl VIDIOC_QBUF
  7. stop capturing data using the ioctly VIDIOC_STREAMOFF

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The application xawtv uses the memory-mapped stream I/O mode to continuously capture video frames, in function v4l2_start_streaming.

How to Use V4L2_MEMORY_USERPTR Buffers

  1. tell the driver to expect user buffers, using ioctl VIDIOC_REQBUFS
  2. allocate some buffers
  3. pass the driver the addresses of the buffer, using calls to ioctl VIDIOC_QBUF
  4. start capturing data using the ioctly VIDIOC_STREAMON
  5. repeat for each frame:
    • wait for a buffer to dequeue, using ioctl VIDIOC_DQBUF, or use select() or poll()
    • process the data in the buffer
    • return the buffer to the driver, using ioctl VIDIOC_QBUF
  6. stop capturing data using the ioctl VIDIOC_STREAMOFF

Implementation of V4L2_MEMORY_MAP Buffers

How to Use V4L2_MEMORY_MAP Buffers

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Most of the buffer manipulation for the bttv driver is implemented in a device-independent way, that is intended to be shared by other V4L2 drivers. That code is in files with generic-looking names, like video-buf.c.

Note that the PixelSmart (HRT) device does not support DMA, so to support the V4L2 streaming API the driver (instead of the hardware) would need to do the copying from device memory to RAM. If one were to try to re-use the generic V4L2 code to do this, the parts that refer to DMA might be a problem. However, it might be possible to simulate DMA in the driver, having the copy-work done by timer/IRQ handlers.

Use of Memory Mapping for Direct I/O with DMA

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

videobuf_iolock() does other interesting work. Notice the handling of the case where no user buffer is provided, and so I/O requires a "bounce" buffer.

The function videobuf_iolock() is called from vbi_buffer_prepare, which is one of the methods of videobuf_queue_ops.

Asynchronous I/O

Due to the time limitation, and because the aynchronous I/O kernel interfaces have changed between the time the LDD3 text was written and the present, we will skip over driver support for asynchronous I/O. If time permits, we may come back to this at the end of the course.

Synchronous ("pulled") DMA Input Steps

  1. process calls read: driver allocates a DMA buffer, instructs the hardware to transfer the data, blocks the process
  2. hardware writes data to the DMA buffer, raises an interrupt when transfer is complete
  3. interrupt handler acknowledges the interrupt, and awakens the process
  4. the process reads the data from the buffer

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

e.g., for a disk read operation

Asynchronous ("pushed") DMA Input Steps

  1. hardware raises an interrupt to announce arrival of new data
  2. interrupt handler allocates buffer, tells hardware where to transfer data
  3. devices writes data to the buffer, raises another interrupt when done
  4. interrupt handler dispatches new data, wakes up any relevant process, does housekeeping

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

e.g., for arrival of a packet on a network interface

For network devices, the process above may be simplified. The driver keeps a ring of buffers available, into which the device writes the data as it arrives. Interrupts are only needed to announce the arrival of new packets, or if the ring of buffers becomes empty or nearly empty.

Allocation of DMA Buffers

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The reason getting large blocks of contiquous DMA memory is hard is that the physical memory becomes fragmented over time, as various blocks are allocated and deallocated.

Bus Addresses

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Note that virt_to_bus() and bus_to_virt() are not sufficient in situations where IOMMU must be programmed, or bounce buffers must be used. The generic DMA layer includes architecture-specific code to handle these ugly details.

Jobs of a DMA Device Driver

Generic DMA Layer

This area seems to have evolved a lot since the LDD3 book was last revised. See the new improved Dynamic DMA Mapping Guide, for explanation, which is new since the 2.6.31 kernel documentation on this course's own LXR site.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

It seems that the PCI-specific versions of the DMA support functions are being phased out, but they are still in use within the OS.

Specifying Address Range Addressable by HW

DMA Mappings

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Design of driver to use streaming DMA mappings is preferred, since it is more portable and can be implemented more efficiently.

Calls to Set Up Coherent DMA Mappings

See an example of use of pci_alloc_consistent(), the pci version of the above function, in the bttv driver.

Other examples are in the e1000 network driver, which calls it in the following functions:

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

There is not a lot of internal explanation of the BTTV driver, especially regarding the DMA I/O model. I found the following on the web:

The "risc" here is a simple language that is used to tell the BT848 how to do DMA.

You may also want to look at the definition of pci_alloc_consistent in cross-referenced source code. It will require quite a lot of drilling down through calls, and includes architecture-specific code.

Obsolete Material

The LDD3 text treatment of the following topics appears to be obsolete. Due to lack of time, I have not been able to provide updates notes, so I have left out these topics entirely, for now.

Instead, I am relying on the code in actual drivers for examples.

Example: DMA in BTTV Driver (on PCI Bus)

DMA for ISA Devices

Registering DMA Usage (only for ISA device)

int dad_open (struct inode *inode, struct file *filp)
{  struct dad_device *my_device;
   /* ... */
   if ( (error = request_irq(my_device.irq, dad_interrupt,
                             SA_INTERRUPT, "dad", NULL)) )
      return error;
   /* or implement blocking open */
   if ((error = request_dma(my_device.dma, "dad")) ) {
      free_irq(my_device.irq, NULL); return error;
   /* or implement blocking open */
   }
   /* ... */ return 0;
}
void dad_close (struct inode *inode, struct file *filp)
{  struct dad_device *my_device;
   /* ... */
   free_dma(my_device.dma);
   free_irq(my_device.irq, NULL);
   /* ... */
}

From LDD3 - may be obsolete.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The /proc/dma file on a system with a DMA sound card installed:

1: Sound Blaster8
4: cascade

The last entry is a place-holder for the controller used to cascade the primary DMA controller into the slave controller, on a system with two DMA controllers.

Talking to DMA Controller (only for ISA device)

Low-level DMA Operations

A device driver should not need to use these if it uses the generic DMA layer.

Example Code

int dad_dma_prepare(int channel, int mode, unsigned int buf,
    unsinged int count)
{  unsigned long flags;
   flags = claim_dma_lock();
   disable_dma(channel);
   clear_dma_ff(channel);
   set_dma_mode(channel, mode);
   set_dma_addr(channel, virt_to_bus (buf));
   set_dma_count(channel, coutn);
   enable_dma(channel);
   release_dma_lock(flags);
   return 0;
}
int dad_dma_isdone(int channel)
{ int residue;
  unsigned long flags = claim_dma_lock();
  residue = get_dma_residue(channel);
  release_dma_lock(flags);
  return (residue == 0);
}

From LDD3 - may be obsolete.