| This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom. |
![]() |
| ![]() |
AGP and PCI Express graphics cards us a Graphics Remapping Table (GART), which is one example of an IOMMU. See Wiki article on IOMMU for more detail on memory mapping with I/O devices.
The mapping is sparse. Some pages are unmapped.

Why don't we map all pages?
Pages may be mapped to locations on devices.

And some pages may be mapped to both.

How does having multiple levels save memory?
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Kernel 2.6 provides functions that allow drivers to ignore page table internals, so these details can change without affecting drivers.
Don't confuse page table entries with struct page.
Linux has page tables for internal use, even if the hardware doesn't require them. For example, apparently, the PowerPC hardware does not access the page tables at all. It uses some memory and a hashing scheme to cache recently used TLB entries, but if that misses, the software handler needs to explicitly load the TLB by doing the actual page

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The reason the Linux kernel cannot use the whole virtual address range for kernel logical addresses on a 32-bit machine is a design choice, to split the range of virtual addresses 3:1 between user virtual addresses and kernel virtual addresses.
The reason for this is to avoid changing page table entries when a process traps into the kernel, and to allow the kernel to copy data to and from user space efficiently (without the overhead of manipulating page tables). That is, the process already has page table entries to map all of the kernel memory. These pages are protected while in user mode, but while executing in kernel mode they can all be accessed. The other 3 GB of virtual address space are available to the user-mode process for its own (swappable) code, data, etc. In effect, pages of real memory may have up to two distinct active page table entries that refer to them, for (1) a kernel memory page and (2) a user memory page.

Examples of PAGE_OFFSET values:
See Wiki article on PAE for more detail.
Applies only to machines with larger physical than virtual address space.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Do not confuse Linux low memory here with DOS/Windows 0-640K "low memory" range.
It seems that some Linux writers also use the term "low memory" for kernel memory, which can be confusing.
The following output of /proc/meminfo on a Pentium III machine with 1 GB of RAM shows the split between high and low memory as follows:
MemTotal: 1030888 kB HighTotal: 131008 kB LowTotal: 899880 kB
In a machine whose virtual paddresses do not permit addressing all of physical memory (e.g., 32-bit machine with PAE), part of the kernel memory is not mapped in the above fashion, so that we can access "high memory".


What are the advantages of larger page size? Disadvantages?
How does page size relate to the size of physical memory supported, and the amount of space occupied by page tables?
Describes a page of physical memory.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Do not confuse the struct page with a page table entry. A struct page entry corresponds to a page of real memory, and seems to correspond to what is generically called a "page frame table" entry in OS textbooks.
Device drivers should not need to use the above functions, because of the generic memory mapping services described below.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Kernel 2.6 provides functions that allow drivers to ignore page table internals, so the above details can now change without affecting drivers. However, the following is important to drivers.

From http://duartes.org/gustavo/blog/post/2009/02. I recommend reading the entire article.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Please read Gustavo Duarte's blog Page Cache, the Affair Between Memory and Files. It contains two diagrams, which I have reproduced here for reference class while explaining the way Linux manages virtual-to-physical address mappings.
The article goes in to more details and includes links to Linux kernel code on the LXR site. It also explains how memory-mapped I/O works and how it is integrated into the paged virtual memory system.
From http://duartes.org/gustavo/blog/post/2009/02. I recommend reading the entire article.
Output of /proc/.../maps for an emacs editing session:
vm_start-vm_stop mv_pgoff inode ↓ vm_page_prot major:minor image ↓ ↓ ↓ ↓ ↓ ↓ baker@websrv: cat /proc/24692/maps 08048000-08195000 r-xp 00000000 09:02 294678 /usr/bin/emacs 08195000-0842f000 rw-p 0014c000 09:02 294678 /usr/bin/emacs 0842f000-08827000 rwxp 00000000 00:00 0 zero-mapped BSS for emacs 40000000-40012000 r-xp 00000000 09:02 211751 /lib/ld-2.2.93.so 40012000-40013000 rw-p 00012000 09:02 211751 /lib/ld-2.2.93.so 40013000-40052000 r-xp 00000000 09:02 896219 /usr/X11R6/lib/libXaw3d.so.7.0 40052000-40058000 rw-p 0003e000 09:02 896219 /usr/X11R6/lib/libXaw3d.so.7.0 40058000-4006b000 rw-p 00000000 00:00 0 BSS for libXaw3d 4006b000-40080000 r-xp 00000000 09:02 896170 /usr/X11R6/lib/libXmu.so.6.2 40080000-40081000 rw-p 00015000 09:02 896170 /usr/X11R6/lib/libXmu.so.6.2 40081000-400cf000 r-xp 00000000 09:02 896182 /usr/X11R6/lib/libXt.so.6.0 400cf000-400d3000 rw-p 0004d000 09:02 896182 /usr/X11R6/lib/libXt.so.6.0 400d3000-400db000 r-xp 00000000 09:02 896152 /usr/X11R6/lib/libSM.so.6.0 400db000-400dc000 rw-p 00007000 09:02 896152 /usr/X11R6/lib/libSM.so.6.0 400dc000-400f0000 r-xp 00000000 09:02 896148 /usr/X11R6/lib/libICE.so.6.3 400f0000-400f1000 rw-p 00013000 09:02 896148 /usr/X11R6/lib/libICE.so.6.3 400f1000-400f3000 rw-p 00000000 00:00 0 400f3000-40100000 r-xp 00000000 09:02 896162 /usr/X11R6/lib/libXext.so.6.4 40100000-40101000 rw-p 0000c000 09:02 896162 /usr/X11R6/lib/libXext.so.6.4 40102000-40104000 r-xp 00000000 09:02 130365 /usr/X11R6/lib/X11/locale/common/xlcDef.so.2 40104000-40105000 rw-p 00001000 09:02 130365 /usr/X11R6/lib/X11/locale/common/xlcDef.so.2 40106000-40108000 r-xp 00000000 09:02 928554 /usr/lib/gconv/ISO8859-1.so 40108000-40109000 rw-p 00001000 09:02 928554 /usr/lib/gconv/ISO8859-1.so 40109000-40149000 r-xp 00000000 09:02 309641 /usr/lib/libtiff.so.3.5 40149000-4014b000 rw-p 0003f000 09:02 309641 /usr/lib/libtiff.so.3.5 4014b000-4014c000 rw-p 00000000 00:00 0 4014c000-40169000 r-xp 00000000 09:02 309545 /usr/lib/libjpeg.so.62.0.0 40169000-4016a000 rw-p 0001c000 09:02 309545 /usr/lib/libjpeg.so.62.0.0 4016a000-40193000 r-xp 00000000 09:02 309633 /usr/lib/libpng12.so.0.1.2.5 40193000-40194000 rw-p 00028000 09:02 309633 /usr/lib/libpng12.so.0.1.2.5 40194000-401a0000 r-xp 00000000 09:02 309632 /usr/lib/libz.so.1.1.4 401a0000-401a2000 rw-p 0000b000 09:02 309632 /usr/lib/libz.so.1.1.4 401a2000-401c3000 r-xp 00000000 09:02 1042441 /lib/i686/libm-2.2.93.so 401c3000-401c4000 rw-p 00021000 09:02 1042441 /lib/i686/libm-2.2.93.so 401c4000-401cb000 r-xp 00000000 09:02 309702 /usr/lib/libungif.so.4.1.0 401cb000-401cc000 rw-p 00007000 09:02 309702 /usr/lib/libungif.so.4.1.0 401cc000-401da000 r-xp 00000000 09:02 896176 /usr/X11R6/lib/libXpm.so.4.11 401da000-401db000 rw-p 0000d000 09:02 896176 /usr/X11R6/lib/libXpm.so.4.11 401db000-401dc000 rw-p 00000000 00:00 0 401dc000-402b7000 r-xp 00000000 09:02 896154 /usr/X11R6/lib/libX11.so.6.2 402b7000-402ba000 rw-p 000da000 09:02 896154 /usr/X11R6/lib/libX11.so.6.2 402ba000-402f0000 r-xp 00000000 09:02 309587 /usr/lib/libncurses.so.5.2 402f0000-402f9000 rw-p 00035000 09:02 309587 /usr/lib/libncurses.so.5.2 402f9000-402fb000 r-xp 00000000 09:02 211764 /lib/libdl-2.2.93.so 402fb000-402fc000 rw-p 00001000 09:02 211764 /lib/libdl-2.2.93.so 402fc000-402fd000 rw-p 00000000 00:00 0 402fd000-404bc000 r--p 00000000 09:02 390925 /usr/lib/locale/locale-archive 404bc000-404c5000 r-xp 00000000 09:02 211784 /lib/libnss_files-2.2.93.so 404c5000-404c6000 rw-p 00008000 09:02 211784 /lib/libnss_files-2.2.93.so 404c6000-404e2000 r-xp 00000000 09:02 130364 /usr/X11R6/lib/X11/locale/common/ximcp.so.2 404e2000-404e4000 rw-p 0001b000 09:02 130364 /usr/X11R6/lib/X11/locale/common/ximcp.so.2 404e4000-404ea000 r--s 00000000 09:02 928609 /usr/lib/gconv/gconv-modules.cache 404ea000-404f3000 r-xp 00000000 09:02 130369 /usr/X11R6/lib/X11/locale/common/xomGeneric.so.2 404f3000-404f4000 rw-p 00008000 09:02 130369 /usr/X11R6/lib/X11/locale/common/xomGeneric.so.2 42000000-42126000 r-xp 00000000 09:02 1042439 /lib/i686/libc-2.2.93.so 42126000-4212b000 rw-p 00126000 09:02 1042439 /lib/i686/libc-2.2.93.so 4212b000-4212f000 rw-p 00000000 00:00 0 stack segment bffcb000-c0000000 rwxp fffcc000 00:00 0 vsyscall
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Can you make sense of all the VMA's above?
vsyscall is a page of kernel functions mapped into to user space. These are functions which do things that may not require supervisor privilege (like getting the time of day on some systems). Putting them in user space lowers the overhead of calls.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Implementing the mmap() method requires filling in a VMA structure in the address space of the process mapping the device
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The populate() method has been phased out.
The nopage() and nopfn() methods are obsolescent. The new method that replaces both is fault()
A device driver is likely to use memory mapping for two main purposes:
The following I/O device mappings are from /proc/.../maps for an X server process
08048000-081bc000 r-xp 00000000 09:00 742770 /usr/X11R6/bin/XFree86 081bc000-081ee000 rw-p 00174000 09:00 742770 /usr/X11R6/bin/XFree86 081ee000-08d57000 rwxp 00000000 00:00 0 40000000-40015000 r-xp 00000000 09:00 32269 /lib/ld-2.3.2.so 40015000-40016000 rw-p 00014000 09:00 32269 /lib/ld-2.3.2.so 40016000-40017000 rw-p 00000000 00:00 0 40017000-40027000 rw-s fe5e0000 09:00 96996 /dev/mem 4002d000-40039000 r-xp 00000000 09:00 371074 /usr/lib/libz.so.1.1.4 40039000-4003b000 rw-p 0000b000 09:00 371074 /usr/lib/libz.so.1.1.4 4003b000-4005c000 r-xp 00000000 09:00 212428 /lib/tls/libm-2.3.2.so 4005c000-4005d000 rw-p 00020000 09:00 212428 /lib/tls/libm-2.3.2.so 4005d000-40064000 r-xp 00000000 09:00 32396 /lib/libpam.so.0.75 40064000-40065000 rw-p 00007000 09:00 32396 /lib/libpam.so.0.75 40065000-40066000 rw-p 00000000 00:00 0 40066000-40069000 r-xp 00000000 09:00 33817 /lib/libdl-2.3.2.so 40069000-4006a000 rw-p 00002000 09:00 33817 /lib/libdl-2.3.2.so 4006a000-4006c000 r-xp 00000000 09:00 34152 /lib/libpam_misc.so.0.75 4006c000-4006d000 rw-p 00001000 09:00 34152 /lib/libpam_misc.so.0.75 4006d000-4006e000 rw-p 00000000 00:00 0 4006e000-40079000 r-xp 00000000 09:00 34945 /lib/libnss_files-2.3.2.so 40079000-4007a000 rw-p 0000a000 09:00 34945 /lib/libnss_files-2.3.2.so 4007a000-400cd000 rw-p 00000000 00:00 0 400cd000-400dd000 rw-s 000a0000 09:00 96996 /dev/mem 400dd000-4013d000 rw-s 00000000 00:04 71958528 /SYSV00000000 (deleted) 401fe000-40b91000 rw-p 00121000 00:00 0 42000000-4212f000 r-xp 00000000 09:00 212241 /lib/tls/libc-2.3.2.so 4212f000-42132000 rw-p 0012f000 09:00 212241 /lib/tls/libc-2.3.2.so 42132000-42134000 rw-p 00000000 00:00 0 42134000-44134000 rw-s fc000000 09:00 96996 /dev/mem bffaf000-c0000000 rwxp fffb0000 00:00 0
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Distinguish the above, which are user virtual address mappings implemented by drivers, from the (lower level) mappings of IO addresses to kernel logical addresses implemented by the IO system, like the following from /proc/iomem. The ones shown in color are exported to the X server, via (higher level) memory mapping to user space.
00000000-0009fbff : System RAM 0009fc00-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000c8000-000c8fff : Extension ROM 000c9000-000cafff : Extension ROM 000f0000-000fffff : System ROM 00100000-3ffeffff : System RAM 00100000-0025122b : Kernel code 0025122c-0034b1e3 : Kernel data 3fff0000-3fffefff : ACPI Tables 3ffff000-3fffffff : ACPI Non-volatile Storage f3fff000-f3ffffff : ServerWorks CNB20HE Host Bridge f4000000-f5ffffff : ServerWorks CNB20HE Host Bridge f6200000-fe2fffff : PCI Bus #01 fa000000-fbffffff : Number 9 Computer Company Revolution 4 fc000000-fdffffff : Number 9 Computer Company Revolution 4 fe500000-fe5fffff : PCI Bus #01 fe5e0000-fe5effff : Number 9 Computer Company Revolution 4 fe5ff000-fe5fffff : Number 9 Computer Company Revolution 4 fe900000-fe9fffff : Intel Corp. 82557/8/9 [Ethernet Pro 100] fe900000-fe9fffff : e100 feaed000-feaedfff : Intel Corp. 82557/8/9 [Ethernet Pro 100] feaed000-feaedfff : e100 feaee000-feaeefff : ServerWorks OSB4/CSB5 OHCI USB Controller feaee000-feaeefff : usb-ohci feaef000-feaeffff : Adaptec AHA-7850 feaef000-feaeffff : aic7xxx feafe000-feafefff : Adaptec AIC-7899P U160/m feafe000-feafefff : aic7xxx feaff000-feafffff : Adaptec AIC-7899P U160/m (#2) feaff000-feafffff : aic7xxx febfc000-febfffff : Promise Technology, Inc. 20268 fec00000-fec01fff : reserved fee00000-fee00fff : reserved fff80000-ffffffff : reserved
User-level API function:
void *mmap (caddr_t start, size_t len, int prot, int flags, int fd, off_t offset);
Driver-level file operation:
int (*mmap) (struct file *filp, struct vm_area_struct *vma);
int remap_pfn_range (struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot); int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long phys_addr, unsigned long size, pgprot_t prot);
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
As usual, these return 0 for success, and negative error code for failure.
io_remap_pfn_range() is for the case when phys_addr refers to I/O memory.
It may be necessary to do something machine-dependent in the prot to disable caching of specific VMAs. This is architecture-dependent. See pgprot_noncached and treatment of i386 video frame buffer memory protection in drivers/video/fbmem.c.
static struct vm_operations_struct simple_remap_vm_ops = {
.open = simple_vma_open,
.close = simple_vma_close,
};
int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma){
if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
return -EAGAIN;
vma->vm_ops = &simple_remap_vm_ops;
simple_vma_open(vma); /* does nothing but print out a message */
return 0;
}
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
This is from example module simple.c. It does a simple linear 1:1 mapping of physical memory into a user address space. (This is not a serious example. I think you would not ordinarily want to do this. The range of addresses passed by the kernel in vma is a range of unused virtual addresses in the user space. The odds of this range corresponding to useful physical memory appear to be slim.)
The explicit call to simple_vma_open() is necessary in the driver here because the system will not call it on the call to mmap().
remap_pfn_range() can only be used for reserved (always resident) pages, and for pages above the top of physical memory. Otherwise, it would not be safe for a device driver to map them to user space. It can be used to remap high PCI buffers and ISA memory.
It cannot be used to remap pages of "conventional addresses, including ones you obtain by calling get_free_page()". If you attempt to map such pages the user process will see a page of zeroes.
To remap RAM you want to use the fault() method, described further below.
unsigned long off = vma->vm_pgoff << PAGE_SHIFT; unsigned long physical = simple_region_start + off; /* assumes simple_region_start is already page-aligned */ unsigned long vsize = vma->vm_end - vma->vm_start; unsigned long psize = simple_region_size - off; if (vsize > psize) return -EINVAL; if (remap_pfn_range(vma, vma->vm_start, physical, vsize, vma_vm_page_prog)) ...
struct page (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
This interface has changed since the LDD3 book was written. See the lwn.net article on fault handler for explanation.
The fault() method of a VMA is called when a user process attempts to access a page in the VMA that is not present in memory.
Return value is a word of flags that gives details about how the fault was handled:
Flags in
It "must locate and return the struct page pointer that refers to the page the user wanted". It must also "take care to increment the usage count for the page it returns by calling the get_page macro".
Note that there is no need to call put_page() (to decrement the page reference count) in this case, since the system will do that automatically for all pages when it deletes the VMA. It is essential that we increment the count here, to prevent that automatic decrement from prematurely putting the page on the free list.
static int simple_vma_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
{
struct page *pageptr;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long address = (unsigned long) vmf->virtual_address;
unsigned long physaddr = address - vma->vm_start + offset;
unsigned long pageframe = physaddr >> PAGE_SHIFT;
if (!pfn_valid(pageframe))
return VM_FAULT_SIGBUS;
pageptr = pfn_to_page(pageframe);
printk (KERN_NOTICE "---- Fault, off %lx pageframe %lx\n", offset, pageframe);
printk (KERN_NOTICE "page->index = %ld mapping %p\n", pageptr->index, pageptr->mapping);
get_page(pageptr);
vmf->page = pageptr;
return 0;
}
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
This example is still not good style. The lwn.net article on fault says of the virtual_address field:
... anybody who is tempted to use that field should be prepared to justify that use to a crowd of skeptical kernel developers. Most handlers should not care where the page lives in user space, and use of virtual_address will make it impossible to support nonlinear VMAs. So, if at all possible, virtual_address should be ignored. If your code only uses pgoff, it should also set the VM_CAN_NONLINEAR flag in the VMA's vm_flags field to let the kernel know that it is playing by the rules.
I don't yet have enough confidence in my understanding of vmf->pgoff to rewrite the example using that. I believe it is the offset of the faulted page relative to the start of the vm area.
Note that pfn_to_page() assumes there is a corresponding kernel logical address, and so it does not work for high memory. In particular, this is true of PCI memory, which is mapped above the highest system memory. This is the reason for the check pfn_valid().
If the nopage() method is left as NULL, the page is mapped to a copy-on-write page that reads as zero. (So, there is no segmentation fault.)
If the nopage() method cannot map the page, it returns NOPAGE_SIGBUS (page out of range) or NOPAGE_OOM (out of memory).
The specific type of fault is returned via the parameter type, if that is non-null.
struct page *simple_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
{ return VM_FAULT_SIGBUS;
}
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The LDD3 book says that between kernel versions 2.4 and 2.6 support was dropped for the remap() method, which allowed a driver respond to the mremap() system call (to extend a memory mapping) explicitly. Instead, the system would quietly process such calls without notifying the driver.
If a driver does not want the user to be able to extend (using mremap()) a region of memory that has been mapped earlier (using mmap()), the driver can define the nopage() method to generate a segmentation fault. Otherwise, the user will simply see a copy-on-write page of zeros.
It is not clear whether this is still a valid concern, as I have not yet been able to find any instances in the 2.6.25 kernel where this is done.
The mmap examples in the LDD3 text are not correct. They need to be updated for the transition from nopage() to fault().
I have updated one example, sculld. You can see my full code for sculld by following this link.
static int sculld_vma_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
{
struct sculld_dev *ptr, *dev = vma->vm_private_data;
int result = VM_FAULT_SIGBUS;
struct page *page;
void * pageptr = NULL;
pgoff_t pgoff = vmf->pgoff;
down(&dev->sem);
printk (KERN_NOTICE "sculld_vma_fault: pgoff = %lx\n", pgoff);
if (pgoff >= dev->size) goto out;
/*
* Now retrieve the sculld device from the list, then the page.
* If the device has holes, the process receives a SIGBUS when
* accessing the hole.
*/
for (ptr = dev; ptr && pgoff >= dev->qset;) {
ptr = ptr->next;
pgoff -= dev->qset;
}
if (ptr && ptr->data) pageptr = ptr->data[pgoff];
if (!pageptr) goto out; /* hole or end-of-file */
/* got it, now convert pointer to a struct page and increment the count */
page = virt_to_page(pageptr);
get_page(page);
vmf->page = page;
result = 0;
out:
up(&dev->sem);
return result;
}
struct vm_operations_struct sculld_vm_ops = {
.open = sculld_vma_open,
.close = sculld_vma_close,
.fault = sculld_vma_fault,
};
int sculld_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct inode *inode = filp->f_dentry->d_inode;
printk (KERN_NOTICE "sculld: mmap starting\n");
/* refuse to map if order is not 0 */
if (sculld_devices[iminor(inode)].order)
return -ENODEV;
/* don't do anything here: "fault" will set up page table entries */
vma->vm_ops = &sculld_vm_ops;
vma->vm_flags |= VM_RESERVED;
vma->vm_private_data = filp->private_data;
sculld_vma_open(vma);
printk (KERN_NOTICE "sculld: mmap done\n");
return 0;
}
The above is only for mapping kernel logical addresses. For other kernel virtual addresses, you need to use vmalloc_to_page() instead of virt_to_page().
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
I still have not updated scullp, scullv, etc.
The example in the bttv driver (below) is actually more realistic, since it makes nontrivial use of reference counts in the vm_close() method.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Recall that we used ioremap_nocache() to map the PCI-card's I/O memory into the kernel memory space. Exporting it to user space via mmap() would be an instance of this special case.
Even though Microsoft's Windows operating systems have a different API for device drivers, the same underlying operations, such as memory mapping, still need to be done. So, if you read device driver code for Windows you will see some of the same programming patterns. So, you probably can make some sense of the following fragment of code from a Windows NT driver, which performs a function similar to a Linux fault method.
static PUCHAR getMappedAddress(
IN unsigned baseAddr, // User-mode address to convert
IN INTERFACE_TYPE interfaceType, // PCI, ISA
IN unsigned busNum,
IN unsigned bytesNeeded, // Bytes needed at baseAddr
OUT int *pRetCode) // Extended error info
{
#define MEM_SPACE 0 // 0 => memory space, 1 => I/O space
PHYSICAL_ADDRESS translatedAddress, physicalAddress; // physical address to map
PUCHAR mappedBaseAddr = NULL;
ULONG memType = MEM_SPACE; // Resource is a memory address, not a port
BOOLEAN nRc;
// Reformat base address for function we need to use
physicalAddress.HighPart = 0;
physicalAddress.LowPart = baseAddr;
nRc = HalTranslateBusAddress(
interfaceType,
busNum,
physicalAddress,
&memType,
&translatedAddress);
if(nRc == FALSE) { ...error recovery...
}
else {
// Assume memType = MEM_SPACE
mappedBaseAddr = MmMapIoSpace(translatedAddress, bytesNeeded, FALSE);
if(mappedBaseAddr == NULL) { ...error recovery...
}
}
return(mappedBaseAddr);
}
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Whether the OS be Linux or MS Windows, the work that needs to be performed is still similar, and the internal mechanisms are likely to be similar. The code above is from a Windows NT driver for the PixelSmart video frame grabber.
This use of memory mapping is for the internal use of the driver, to map device memory into kernel memory. A driver may also want to map device memory into user space.
Observe that this seems to be closer to the remap_pfn_range() technique, and it seems to be mapping the actual device memory.
The HalTranslateBusAddress() function "translates a bus-relative physical address into the corresponding system physical address", and the MmMapIoSpace() function "maps a given physical address to nonpage system space".
down_read(¤t->mm->mmap_sem); if (result=get_user_pages(current, current->mm, ...) ... up_read(¤t->mm->mmap_sem);
if (! PageReserved(page)) set_page_dirty (page);
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
It is generally a good thing to avoid recopying data. The above is one way of doing this, i.e., by putting the data from the device directly into a user-space buffer. Another way of doing this is to put it into a kernel buffer and then map the kernel buffer into user space. Linux supports both of these methods for video stream input.
The V4L video driver model supports two models of memory-mapped streaming I/O:
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The application xawtv uses the memory-mapped stream I/O mode to continuously capture video frames, in function v4l2_start_streaming.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Most of the buffer manipulation for the bttv driver is implemented in a device-independent way, that is intended to be shared by other V4L2 drivers. That code is in files with generic-looking names, like video-buf.c.
Note that the PixelSmart (HRT) device does not support DMA, so to support the V4L2 streaming API the driver (instead of the hardware) would need to do the copying from device memory to RAM. If one were to try to re-use the generic V4L2 code to do this, the parts that refer to DMA might be a problem. However, it might be possible to simulate DMA in the driver, having the copy-work done by timer/IRQ handlers.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
videobuf_iolock() does other interesting work. Notice the handling of the case where no user buffer is provided, and so I/O requires a "bounce" buffer.
The function videobuf_iolock() is called from vbi_buffer_prepare, which is one of the methods of videobuf_queue_ops.
Due to the time limitation, and because the aynchronous I/O kernel interfaces have changed between the time the LDD3 text was written and the present, we will skip over driver support for asynchronous I/O. If time permits, we may come back to this at the end of the course.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
e.g., for a disk read operation
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
e.g., for arrival of a packet on a network interface
For network devices, the process above may be simplified. The driver keeps a ring of buffers available, into which the device writes the data as it arrives. Interrupts are only needed to announce the arrival of new packets, or if the ring of buffers becomes empty or nearly empty.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The reason getting large blocks of contiquous DMA memory is hard is that the physical memory becomes fragmented over time, as various blocks are allocated and deallocated.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Note that virt_to_bus() and bus_to_virt() are not sufficient in situations where IOMMU must be programmed, or bounce buffers must be used. The generic DMA layer includes architecture-specific code to handle these ugly details.
This area seems to have evolved a lot since the LDD3 book was last revised. See the new improved Dynamic DMA Mapping Guide, for explanation, which is new since the 2.6.31 kernel documentation on this course's own LXR site.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
It seems that the PCI-specific versions of the DMA support functions are being phased out, but they are still in use within the OS.
int dma_set_mask(struct device *Dev, u64 mask);
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
Design of driver to use streaming DMA mappings is preferred, since it is more portable and can be implemented more efficiently.
void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, int flag);
See an example of use of pci_alloc_consistent(), the pci version of the above function, in the bttv driver.
Other examples are in the e1000 network driver, which calls it in the following functions:
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
There is not a lot of internal explanation of the BTTV driver, especially regarding the DMA I/O model. I found the following on the web:
The "risc" here is a simple language that is used to tell the BT848 how to do DMA.
You may also want to look at the definition of pci_alloc_consistent in cross-referenced source code. It will require quite a lot of drilling down through calls, and includes architecture-specific code.
The LDD3 text treatment of the following topics appears to be obsolete. Due to lack of time, I have not been able to provide updates notes, so I have left out these topics entirely, for now.
Instead, I am relying on the code in actual drivers for examples.
int dad_open (struct inode *inode, struct file *filp)
{ struct dad_device *my_device;
/* ... */
if ( (error = request_irq(my_device.irq, dad_interrupt,
SA_INTERRUPT, "dad", NULL)) )
return error;
/* or implement blocking open */
if ((error = request_dma(my_device.dma, "dad")) ) {
free_irq(my_device.irq, NULL); return error;
/* or implement blocking open */
}
/* ... */ return 0;
}
void dad_close (struct inode *inode, struct file *filp)
{ struct dad_device *my_device;
/* ... */
free_dma(my_device.dma);
free_irq(my_device.irq, NULL);
/* ... */
}
From LDD3 - may be obsolete.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The /proc/dma file on a system with a DMA sound card installed:
1: Sound Blaster8 4: cascade
The last entry is a place-holder for the controller used to cascade the primary DMA controller into the slave controller, on a system with two DMA controllers.
A device driver should not need to use these if it uses the generic DMA layer.
int dad_dma_prepare(int channel, int mode, unsigned int buf,
unsinged int count)
{ unsigned long flags;
flags = claim_dma_lock();
disable_dma(channel);
clear_dma_ff(channel);
set_dma_mode(channel, mode);
set_dma_addr(channel, virt_to_bus (buf));
set_dma_count(channel, coutn);
enable_dma(channel);
release_dma_lock(flags);
return 0;
}
int dad_dma_isdone(int channel)
{ int residue;
unsigned long flags = claim_dma_lock();
residue = get_dma_residue(channel);
release_dma_lock(flags);
return (residue == 0);
}
From LDD3 - may be obsolete.