Linux Kernel & Device Driver Programming

Ch 17b - Network Drivers

 

Special Features of Network Devices


Linux Network Subsystem


The goal of using generic code wherever possible results in many call-back interfaces:

Sometimes it is even worse, with both sides using pointers to call one another, or indirectly doing so, via chains of calls involving other subsystems.


Network Device Descriptor

The datatype struct net_device has many fields, including:


Registering A Network Device

This is done by a call to register_netdev, which calls register_netdevice.


Example: Initialization of E100 Ethernet Driver

e100.c handles a variety of ethernet controllers based on Intel chips, including the 82557, 82558, 82559, 82550, 82551, and 82562 devices.

e100_init_module calls pc_module_init with a reference to a variable of type struct pci_driver called e100_driver.

That structure contains several fields, including a pointer to a probe function, e100_probe.

The e100_probe() function allocates an object of type struct nic, which contains a net_device and a reference to a pci_dev. A pointer to this nic structure is accessible from the net_device structure via the macro netdev_priv.

The e100_probe() function calls alloc_etherdev, which calls alloc_etherdev_mq(), which in turn calls alloc_netdev_mq.

Much of the work of alloc_netdev_mq() is done by a function parameter that is passed in, named setup. In the current case, this is the function ether_setup.

The function e100_probe() initializes the netdev structure returned by alloc_etherdev(), including pointers to methods such as open, and stop, and does a lot of other initialization, including all the generic initialization for a PCI device that uses DMA.

One of the interesting features is the setting up of a timer with the handler e100_watchdog.


Opening/Closing A Network Device

open

stop

See e100_open for example. It does most of the work in e100_up.


Socket Buffers

The datatype struct sk_buff, is used to hold packets that go to or from sockets. It has quite a few fields, including, some which are tagged as "important" by the text. Unfortunately, the names of these fields appear to have changed. The names are:


sk_buff without Scatter-Gather

contiguous sk_buff layout diagram, without scatter gather

The book's comment about being prepared to have code that depends on the internals of type struct sk_buff be broken by future kernel releases is certainly true, but may also apply to other aspects of the kernel. The differences even between versions 2.6.6 and 2.6.11 were quite noticeable, as were the differences between version 2.6.11 and 2.6.25.


Scatter-Gather I/O


sk_buff with Scatter-Gather

uncontiguous sk_buff layout diagram, with scatter gather

See also scatter-gather mappings under DMA I/O.

Scatter-gather allows sharing of common fields, like MAC-address, between packets.

Scatter-gather is also implemented by the user API. See the man-page for sendmsg for specifics. From the user level, the scatter-gather mechanism is implemented by the following structures:

struct msghdr {
      void         * msg_name;     /* optional address */
      socklen_t    msg_namelen;    /* size of address */
      struct iovec * msg_iov;      /* scatter/gather array */
      size_t       msg_iovlen;     /* # elements in msg_iov */
      void         * msg_control;  /* ancillary data, see below */
      socklen_t    msg_controllen; /* ancillary data buffer len */
      int          msg_flags;      /* flags on received message */
  };
  struct iovec {
      void *iov_base;   /* Starting address */
      size_t iov_len;   /* Number of bytes */
  };

It is up to the device driver to decide whether scatter-gather can be supported directly by the device, or whether to force the a higher layer of the network implementation to copy scatter-gather structures provided by a user into contiguous ranges of memory.


The text says sending is less complicated than receiving, and so chooses to treat it first. However, sending is also less interesting, since it is synchronous, pushed I/O, similar to examples we have seen with other types of devices. Since we have limited time, we will look at the more interesting case first.

The E100 driver does not handle scatter-gather output. (It does not set the NETIF_F_SG bit in netdev->features.) For an example of scatter-gather see the function e1000_tx_map in the E1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list. Note where this driver sets netdev->features to include the feature NETIF_F_SG, so that the network device layer knows that this driver expects scatter-gather sk_buffs to this device.


Receiving a Packet

The end goal of the driver is to pass off each packet that it receives, to the higher-level protocol handling routines of the system.

There are two models of packet reception that a driver may implement

All the input processing builds up to this.


It may be surprising that "polling" is more efficient than interrupt-driven input, since historically interrupts were developed to address the inefficiency of polling. The reality is that we are talking about a hybrid approach, which involves both polling and interrupts and is superior to either of them alone.

Since the text concentrates on the pure interrupt-driven case, in class we will concentrate on the polling approach, which is more effective, more interesting, and the recommended approach for all Linux network drivers.

Starting with kernel 2.6.6, the E100 driver had an option CONFIG_E100_NAPI to choose between interrupt-driven I/O and polling (a.k.a. "NAPI"), but by kernel 2.6.11 NAPI (polling) was the only option supported.


Pre-NAPI (interrupt-driven) Linux Networking

pre-napi control flow diagram

Data Flow of Pre-NAPI Linux Network Driver API

pre-napi data flow diagram

The diagrams above are reproduced from piters.home.cern.ch/piters/ TCP/seminar/TDAQ-April02/TDAQ-April02.ps.


NAPI -- the "New API" for Network Device Drivers


The Ethernet-HOWTO says:

When a card receives a packet from the network, what usually happens is that the card asks the CPU for attention by raising an interrupt. Then the CPU determines who caused the interrupt, and runs the card's driver interrupt handler which will in turn read the card's interrupt status to determine what the card wanted, and then in this case, run the receive portion of the card's driver, and finally exits.

Now imagine you are getting lots of Rx data, say 10 thousand packets per second all the time on some server. You can imagine that the above IRQ run-around into and out of the Rx portion of the driver adds up to a lot of overhead. A lot of CPU time could be saved by essentially turning off the Rx interrupt and just hanging around in the Rx portion of the driver, since it knows there is pretty much a steady flow of Rx work to do. This is the basic idea of NAPI.

For additional explanation, see:


Packet Reception in the E100 Driver


Polling Function Registration

The function e100_poll is bound as the poll_controller method of netdev in the e100_probe function, and the polling weight (priority) is specified.


Receive-Buffer Allocation

The data structure passed to netif_rx() is a struct sk_buff, which is allocated by calling netdev_alloc_skb().

The function netdev_alloc_skb() is called from e100_rx_alloc_skb(), which allocates a socket buffer of size determined by:

Device-dependent details: The "RFD" in the comments refers to a "Receive Frame Descriptor", and the "RFA" refers to the "Receive Frame Area", which is a linked list of RFDs. For output, there is a linked list of CFDs (Command Frame Descriptors), called the CBL (Command Block List). The other communication with the device is through a third memory area, called the CSR (Control/Status Registers).


Receive Frame Descriptors (Simple Mode)

rfd diagram

Receive Frame Descriptors (Flexible Mode)

cb diagram

RFD Contents

Each RFD (Receive Frame Descriptor) has the following fields:


Command Block List

cb diagram

Driver can add blocks to the end of the chain while the device is processing blocks earlier in the chain.


Command Block Contents

Each CB (Command Block) has the following fields:


E100 Interrupts

For more information on the E100 programming interface, see the Intel documentation.

All interrupts can be masked by the Mask bit in the SCB command word.


Back to the E100 Driver

The function e100_rx_alloc_skb() is called in two places, principally in e100_rx_alloc_list().

The function e100_rx_alloc_list() is called from several places, including e100_up().
It actually allocates objects of the datatype struct rx, which includes an sk_buff along with forward and backward links and a DMA address.

The function e100_up() is called from several places, principally from e100_open(), which is one of the standard network device entry points (exported methods).


Both struct rx and struct sk_buff include forward and backward links. How is each pair used?


Interrupt Handler

Besides allocating a list of receive frame descriptors, e100_up() installs an interrupt handler, e100_intr.

The function e100_intr() does several things, including reading the status of the NIC, acknowledging the interrupt, and calling __netif_rx_schedule_prep and then __netif_rx_schedule.

These two are actually wrappers for napi_schedule_prep and __napi_schedule. The former is part of a two-phase protocol for waking up the NAPI polling thread without accidentally running two copies of the polling routine (perhaps on different CPUs). The latter adds the device to this CPU's softnet_data poll_list and raises the softirq NET_RX_SOFTIRQ to indicate that the list needs polling. The handler for that softirq will do the next step in processing of the incoming packet.


Why is no lock needed to add this device to the "poll list" for the current CPU?


Network Device Polling

In net_dev_init, which is called during system initialization to initialize the network device subsystem, the function net_rx_action is attached to the softirq NET_RX_SOFTIRQ.

When the function net_rx_action is called, in response to this softirq, it runs though a per-CPU softnet_data queue, executing the poll method of each device on the queue.


E100 Polling

In the case of the e100, the polling function is e100_poll. The most interesting part of this function is the call to e100_rx_clean.

The function e100_rx_clean() calls e100_rx_indicate() for each received frame, and replenishes the receive receive sk_buff list for the device.

The function e100_rx_indicate() does several things, including setting the data and tail pointers of the sk_buff to indicate the actual beginning and end of the dta (via calls to skb_reserve and skb_put), and setting the pkt_type (via eth_type_trans) and the (link layer) protocol field of the sk_buff. It finally calls netif_receive_skb()).

The function netif_receive_skb() eventually calls deliver_skb(), does the actual delivery to the protocol-specific handler, via the function pt_prev->func().


The loop structure in netif_receive_skb is interesting. Why is pt_prev passed to deliver_skb rather than ptype? If you are not yet familiar with the Linux kernel struct list_head usage (covered in Ch 11 under "Linked Lists"), this may be a good time to learn how and why this kind of loop works.

The function netif_receive_skb conditionally calls netpoll_rx(), which appears to directly handle ARP and UDP packets, but seems to simply return for TCP packets. This seems to be an optional optimization to cut down on the overhead of passing certain packets up through the protocol hierarchy.

Because we have limited classroom time, we have concentrated on the e100 driver (a complex real driver) instead of the toy driver snull provided in the text. Although the textbook examples are now mostly obsolete, I have updated them enough that they will at least compile with the 2.6.25 kernel, and it may still help to review the network driver concepts in that simpler context.

In particular, it may be useful to now look back at the snull driver, starting with snull_rx, snull_regular_interrupt, and snull_napi_interrupt.


Sending a Packet

Sending of a packet starts with one application-level calls, like send() or sendto(). After the system determines the route and assembles the packet with the appropriate headers, it eventually makes an internal call to the netdev->hard_start_xmit method of the actual device.

For the e100 driver, the actual function bound to this link is e100_xmit_frame(), which calls e100_exec_cb(), which calls e100_exec_cmd() to do finally deliver a command to the controller. Most of the code in these functions is device-specific. Since this is a DMA device, the data is passed to and from the device indirectly, by giving pointers to memory-mapped buffers containing the commands.

Observe the call to netif_stop_queue from e100_xmit_fram() in the case that the packet used up the last available queue space for this device.


This driver does not address the short packet information leakage problem discussed in the text. That is OK, because the comments say "hardware padding of short packets to the minimum packet size is enabled".


Timeouts


Review of Uses of Memory Mapping in the E100 Driver


Review of Uses of Interrupts in the E100 Driver

© 2004-2008 T. P. Baker. ($Id: ch17.html,v 1.1 2008/04/28 12:41:35 baker Exp baker $)