Linux Kernel & Device Driver Programming

Ch 17b - Network Drivers

This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom.

Image from

Special Features of Network Devices

  • Do not look/act like files
  • Do not correspond to inodes in /dev
  • Read and write do not make sense
  • Multiple sockets and protocols can be multiplexed on a single NIC
  • Packets received asynchronously
  • "Pushed" from outside, rather than "pulled" from inside

Image from

Linux Network Subsystem

  • Protocol-independent
  • Driver and kernel interact one packet at a time
  • Separated into:
    • Protocol-independent networking infrastructure (e.g., for all packets)
    • Protocol-specific infrastructure (e.g., for TCP, or for UDP)
    • Generic device-class infrastructure (e.g., for all ethernet devices, or for all PCI devices)
    • Device-specific implementations (e.g., for just the e1000)

Image adapted from

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The goal of using generic code wherever possible results in many call-back interfaces:>

Sometimes it is even worse, with both sides using pointers to call one another, or indirectly doing so, via chains of calls involving other subsystems.

Network Device Descriptor

The datatype struct net_device has many fields, including:

Registering A Network Device: e1000e Ethernet Driver

This driver handles a variety of ethernet controllers based on the Intel 82571, 82572, 82573, 82574, and 82583 chips. The main file is e1000/netdev.c.

The module initialization function e1000_init_module calls


That structure contains several fields, including a pointer to a probe function.

static struct pci_driver e1000_driver = {
        .name     = e1000e_driver_name,
        .id_table = e1000_pci_tbl,
        .probe    =  e1000_probe,
        .remove   = __devexit_p(e1000_remove),
        .suspend  = e1000_suspend,
        .resume   = e1000_resume,
        .shutdown = e1000_shutdown,
        .err_handler = &e1000_err_handler};

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The probe function will be called when the module is inserted, and when udev detects a insertion event for a device that this driver has claimed to able to manage in the table e1000_pci_tbl.

PCI Driver Registration

The pci registration process proceeds through a series of calls

Driver Registration: bus_add_driver

Driver Registration: driver_attach

Applies __driver_attach to each device on the driver's bus, which

Driver Registration: really_probe

Driver & Device Descriptors

Image from

Driver Registration: e1000_probe

This function does a lot of things, most of which are specific to the device. We focus on the parts that relate to its role as a generic network device.

Driver Registration: alloc_etherdev

Driver Registration: e1000_probe

Opening/Closing A Network Device



See e1000_open for example. Notice the up-call to the network layer function netif_start_queue.

Network Driver Interrupts (typical)

Socket Buffers

The datatype struct sk_buff, is used to hold packets that go to or from sockets. It has quite a few fields, including some tagged as "important" by the LDD3 text. Unfortunately, the names of these fields appear to have changed since the book was written. The names of the "important" fields are:

sk_buff without Scatter-Gather

contiguous sk_buff layout diagram, without scatter gather

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The book's comment about being prepared to have code that depends on the internals of type struct sk_buff be broken by future kernel releases is certainly true, but also applies to other aspects of the kernel. The differences even between versions 2.6.6 and 2.6.11 have been quite noticeable, as were the differences between version 2.6.11 and 2.6.25, and between 2.6.25 and 2.6.31.

Scatter-Gather I/O

sk_buff with Scatter-Gather

uncontiguous sk_buff layout diagram, with scatter gather

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

See also scatter-gather mappings under DMA I/O.

Scatter-gather allows sharing of common fields, like MAC-address, between packets.

Scatter-gather is also implemented by the user API. See the man-page for sendmsg for specifics. From the user level, the scatter-gather mechanism is implemented by the following structures:

struct msghdr {
      void         * msg_name;     /* optional address */
      socklen_t    msg_namelen;    /* size of address */
      struct iovec * msg_iov;      /* scatter/gather array */
      size_t       msg_iovlen;     /* # elements in msg_iov */
      void         * msg_control;  /* ancillary data, see below */
      socklen_t    msg_controllen; /* ancillary data buffer len */
      int          msg_flags;      /* flags on received message */
  struct iovec {
      void *iov_base;   /* Starting address */
      size_t iov_len;   /* Number of bytes */

It is up to the device driver to decide whether scatter-gather can be supported directly by the device, or whether to force the a higher layer of the network implementation to copy scatter-gather structures provided by a user into contiguous ranges of memory.

The LDD3 text says sending is less complicated than receiving, and so chooses to treat it first. However, sending is also less interesting, since it is synchronous, pushed I/O, similar to examples we have seen with other types of devices. Since we have limited time, we will look at the more interesting case first.

The e1000 driver does not handle scatter-gather output. (It does not set the NETIF_F_SG bit in netdev->features.) For an example of scatter-gather see the function e1000_tx_map in the e1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list. Note where this driver sets netdev->features to include the feature NETIF_F_SG, so that the network device layer knows that this driver expects scatter-gather sk_buffs to this device.

Receiving a Packet

The end goal of the driver is to pass off each packet that it receives, to the higher-level protocol handling routines of the system.

There are two models of packet reception that a driver may implement

All the input processing builds up to this.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

It may be surprising that polling is more efficient than interrupt-driven input, since historically interrupts were developed to address the inefficiency of polling. The reality is that we are talking about a hybrid approach, which involves both polling and interrupts and is superior to either of them alone.

Since the LDD3 text concentrates on the pure interrupt-driven case, in class we will concentrate on the polling approach, which is more effective, more interesting, and the recommended approach for all Linux network drivers.

Starting with kernel 2.6.6, the e100 driver (an earlier, more primitive chip than the e1000) had an option CONFIG_E100_NAPI to choose between interrupt-driven I/O and polling (a.k.a. "NAPI"), but by kernel 2.6.11 NAPI (hybrid of polling and interrupts) was the only option supported.


  • Linux "Newer API" for network drivers
  • Combines benefits of polling and interrupts

We will not discuss the API presented in the LDD3 text, since it is no longer supported.

Image from

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Pre-NAPI (interrupt-driven) Linux Networking

pre-napi control flow diagram

Data Flow of Pre-NAPI Linux Network Driver API

pre-napi data flow diagram

The diagrams above are reproduced from TCP/seminar/TDAQ-April02/

NAPI Model

  • Interrupt from NIC is normally disabled
  • Driver has thread that polls to see if there is work for the driver to do:
    1. Handling received (RX) packets:
      undo DMA mapping, parse link-level packet header, pass packet up to next level of protocol stack
    2. Cleaning up transmitted (TX) packets:
      undo DMA mapping, recycle buffer, possibly notify sender
  • If thread finds no work to do, it enables the interrupt
  • Interrupt handler disables interrupt and signals polling thread

NAPI Advantages

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The Ethernet-HOWTO says:

When a card receives a packet from the network, what usually happens is that the card asks the CPU for attention by raising an interrupt. Then the CPU determines who caused the interrupt, and runs the card's driver interrupt handler which will in turn read the card's interrupt status to determine what the card wanted, and then in this case, run the receive portion of the card's driver, and finally exits.

Now imagine you are getting lots of Rx data, say 10 thousand packets per second all the time on some server. You can imagine that the above IRQ run-around into and out of the Rx portion of the driver adds up to a lot of overhead. A lot of CPU time could be saved by essentially turning off the Rx interrupt and just hanging around in the Rx portion of the driver, since it knows there is pretty much a steady flow of Rx work to do. This is the basic idea of NAPI.

For additional explanation, see:

struct napi_struct

Value NAPI_STATE_NPSVS provides support for the "Netpoll" option, i.e., pure polling.

Packet Reception in the e1000 Driver

The e1000 driver provides optional support for Netpoll (the pure polling), but we will not discuss that further here.

Receive-Buffer Allocation

The data structure passed to netif_rx() is a struct sk_buff, which is allocated here:

= netdev_alloc_skb(netdev, bufsz);

The function netdev_alloc_skb() is called from e1000_alloc_rx_buffers which allocates a socket buffer of size ETH_FRAME_LEN + VLAN_HLEN + ETH_FCS_LEN.

Receive Descriptor Ring

  • Driver
    • Allocates buffer
    • Maps it for DMA access
    • Places buffer address into the Receive Descriptor Queue Structure (a circular/ring buffer)
    • Advances the Tail pointer
  • Device
    • Fills the Head buffer
    • Puts some data into the corresponding Receive Descriptor
    • Advances the Head pointer
Receive Ring

Receive Descriptor (RDESC)

Receive Descriptor

This is the "legacy" format. There is also an extended version, which allows the device to provide additional information.

Receive Buffers (Split Mode)

  • Optionally, the device can split the packet header into a separate buffer
  • This requires an extended RDESC format, with room for multiple buffer addresses
Receive Ring

The e1000e driver does support this option, but we will not discuss any of these extended formats further here.

Transmit Descriptor Ring

  • Driver
    • Allocates buffer
    • Maps it for DMA access
    • Places buffer address and other fields into the Receive Descriptor Queue Structure (a circular/ring buffer)
    • Advances the Tail pointer
  • Device
    • Transmits data from the Head buffer
    • Writes some status into the corresponding Transmit Descriptor
    • Advances the Head pointer
Transmit Ring

Transmit Descriptor

Transmit Descriptor

Again, this is the "legacy" format. There are also several extended versions, which allows the device to perform additional functions, offloading some of the work that otherwise would need to be done by the operating system.

e1000 Interrupts

For more details on the e1000 chips, see the Intel documentation.

Network Device Polling

In net_dev_init, which is called during system initialization to initialize the network device subsystem, the function net_rx_action is attached to the softirq NET_RX_SOFTIRQ.

When the function net_rx_action is called, in response to this softirq, it runs though a per-CPU softnet_data queue, executing the poll method of each device on the queue.

e1000 Polling

In the case of the e100, the polling function is e1000_clean. The most interesting parts of this function are the actual polling call

adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);

and the NAPI logic

if (work_done < budget) {
   if (likely(adapter->itr_setting & 3)) e1000_set_itr(adapter);
   if (!test_bit(__E1000_DOWN, &adapter->flags))

See also the corresponding code, which calls the polling function, in net_rx_action.

e1000 Polling

The function called by adapter->cleanrx(...) is one of the three different functions that may be assigned in e1000_configure_rx, depending on the MTU size.

The most typical one is e1000_clean_rx_irq.

After many steps, it gets to netif_receive_skb(skb), and from there to deliver_skb, which calls the delivery method appropriate to the protocol

return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);

Sending a Packet

Sending of a packet starts with one application-level calls, like send() or sendto(). After the system determines the route and assembles the packet with the appropriate headers, it eventually makes an internal call to the ndo_start_xmit method of the actual device.

For the e1000 driver, the actual function bound to this link is e1000_xmit_frame, which calls e1000_tx_queue to actuallyl enqueue the frame on the transmit-buffer ring of the devices.


Review of Uses of Memory Mapping in the e1000 Driver

Review of Uses of Interrupts in the e1000 Driver