| Linux Kernel & Device Driver Programming |
The goal of using generic code wherever possible results in many call-back interfaces:
Sometimes it is even worse, with both sides using pointers to call one another, or indirectly doing so, via chains of calls involving other subsystems.
The datatype struct net_device has many fields, including:
This is done by a call to register_netdev, which calls register_netdevice.
e100.c handles a variety of ethernet controllers based on Intel chips, including the 82557, 82558, 82559, 82550, 82551, and 82562 devices.
e100_init_module calls pc_module_init with a reference to a variable of type struct pci_driver called e100_driver.
That structure contains several fields, including a pointer to a probe function, e100_probe.
The e100_probe() function allocates an object of type struct nic, which contains a net_device and a reference to a pci_dev. A pointer to this nic structure is accessible from the net_device structure via the macro netdev_priv.
The e100_probe() function calls alloc_etherdev, which calls alloc_etherdev_mq(), which in turn calls alloc_netdev_mq.
Much of the work of alloc_netdev_mq() is done by a function parameter that is passed in, named setup. In the current case, this is the function ether_setup.
The function e100_probe() initializes the netdev structure returned by alloc_etherdev(), including pointers to methods such as open, and stop, and does a lot of other initialization, including all the generic initialization for a PCI device that uses DMA.
One of the interesting features is the setting up of a timer with the handler e100_watchdog.
open
stop
See e100_open for example. It does most of the work in e100_up.
The datatype struct sk_buff, is used to hold packets that go to or from sockets. It has quite a few fields, including, some which are tagged as "important" by the text. Unfortunately, the names of these fields appear to have changed. The names are:
The book's comment about being prepared to have code that depends on the internals of type struct sk_buff be broken by future kernel releases is certainly true, but may also apply to other aspects of the kernel. The differences even between versions 2.6.6 and 2.6.11 were quite noticeable, as were the differences between version 2.6.11 and 2.6.25.
See also scatter-gather mappings under DMA I/O.
Scatter-gather allows sharing of common fields, like MAC-address, between packets.
Scatter-gather is also implemented by the user API. See the man-page for sendmsg for specifics. From the user level, the scatter-gather mechanism is implemented by the following structures:
struct msghdr {
void * msg_name; /* optional address */
socklen_t msg_namelen; /* size of address */
struct iovec * msg_iov; /* scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void * msg_control; /* ancillary data, see below */
socklen_t msg_controllen; /* ancillary data buffer len */
int msg_flags; /* flags on received message */
};
struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes */
};
It is up to the device driver to decide whether scatter-gather can be supported directly by the device, or whether to force the a higher layer of the network implementation to copy scatter-gather structures provided by a user into contiguous ranges of memory.
The text says sending is less complicated than receiving, and so chooses to treat it first. However, sending is also less interesting, since it is synchronous, pushed I/O, similar to examples we have seen with other types of devices. Since we have limited time, we will look at the more interesting case first.
The E100 driver does not handle scatter-gather output. (It does not set the NETIF_F_SG bit in netdev->features.) For an example of scatter-gather see the function e1000_tx_map in the E1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list. Note where this driver sets netdev->features to include the feature NETIF_F_SG, so that the network device layer knows that this driver expects scatter-gather sk_buffs to this device.
The end goal of the driver is to pass off each packet that it receives, to the higher-level protocol handling routines of the system.
There are two models of packet reception that a driver may implement
All the input processing builds up to this.
It may be surprising that "polling" is more efficient than interrupt-driven input, since historically interrupts were developed to address the inefficiency of polling. The reality is that we are talking about a hybrid approach, which involves both polling and interrupts and is superior to either of them alone.
Since the text concentrates on the pure interrupt-driven case, in class we will concentrate on the polling approach, which is more effective, more interesting, and the recommended approach for all Linux network drivers.
Starting with kernel 2.6.6, the E100 driver had an option CONFIG_E100_NAPI to choose between interrupt-driven I/O and polling (a.k.a. "NAPI"), but by kernel 2.6.11 NAPI (polling) was the only option supported.


The diagrams above are reproduced from piters.home.cern.ch/piters/ TCP/seminar/TDAQ-April02/TDAQ-April02.ps.
The Ethernet-HOWTO says:
When a card receives a packet from the network, what usually happens is that the card asks the CPU for attention by raising an interrupt. Then the CPU determines who caused the interrupt, and runs the card's driver interrupt handler which will in turn read the card's interrupt status to determine what the card wanted, and then in this case, run the receive portion of the card's driver, and finally exits.
Now imagine you are getting lots of Rx data, say 10 thousand packets per second all the time on some server. You can imagine that the above IRQ run-around into and out of the Rx portion of the driver adds up to a lot of overhead. A lot of CPU time could be saved by essentially turning off the Rx interrupt and just hanging around in the Rx portion of the driver, since it knows there is pretty much a steady flow of Rx work to do. This is the basic idea of NAPI.
For additional explanation, see:
The function e100_poll is bound as the poll_controller method of netdev in the e100_probe function, and the polling weight (priority) is specified.
The data structure passed to netif_rx() is a struct sk_buff, which is allocated by calling netdev_alloc_skb().
The function netdev_alloc_skb() is called from e100_rx_alloc_skb(), which allocates a socket buffer of size determined by:
Device-dependent details: The "RFD" in the comments refers to a "Receive Frame Descriptor", and the "RFA" refers to the "Receive Frame Area", which is a linked list of RFDs. For output, there is a linked list of CFDs (Command Frame Descriptors), called the CBL (Command Block List). The other communication with the device is through a third memory area, called the CSR (Control/Status Registers).


Each RFD (Receive Frame Descriptor) has the following fields:

Driver can add blocks to the end of the chain while the device is processing blocks earlier in the chain.
Each CB (Command Block) has the following fields:
For more information on the E100 programming interface, see the Intel documentation.
All interrupts can be masked by the Mask bit in the SCB command word.
The function e100_rx_alloc_skb() is called in two places, principally in e100_rx_alloc_list().
The function e100_rx_alloc_list() is called
from several places, including
e100_up().
It actually allocates objects of the datatype
struct rx,
which includes an sk_buff along with forward and backward links and a DMA address.
The function e100_up() is called from several places, principally from e100_open(), which is one of the standard network device entry points (exported methods).
Both struct rx and struct sk_buff include forward and backward links. How is each pair used?
Besides allocating a list of receive frame descriptors, e100_up() installs an interrupt handler, e100_intr.
The function e100_intr() does several things, including reading the status of the NIC, acknowledging the interrupt, and calling __netif_rx_schedule_prep and then __netif_rx_schedule.
These two are actually wrappers for napi_schedule_prep and __napi_schedule. The former is part of a two-phase protocol for waking up the NAPI polling thread without accidentally running two copies of the polling routine (perhaps on different CPUs). The latter adds the device to this CPU's softnet_data poll_list and raises the softirq NET_RX_SOFTIRQ to indicate that the list needs polling. The handler for that softirq will do the next step in processing of the incoming packet.
Why is no lock needed to add this device to the "poll list" for the current CPU?
In net_dev_init, which is called during system initialization to initialize the network device subsystem, the function net_rx_action is attached to the softirq NET_RX_SOFTIRQ.
When the function net_rx_action is called, in response to this softirq, it runs though a per-CPU softnet_data queue, executing the poll method of each device on the queue.
In the case of the e100, the polling function is e100_poll. The most interesting part of this function is the call to e100_rx_clean.
The function e100_rx_clean() calls e100_rx_indicate() for each received frame, and replenishes the receive receive sk_buff list for the device.
The function e100_rx_indicate() does several things, including setting the data and tail pointers of the sk_buff to indicate the actual beginning and end of the dta (via calls to skb_reserve and skb_put), and setting the pkt_type (via eth_type_trans) and the (link layer) protocol field of the sk_buff. It finally calls netif_receive_skb()).
The function netif_receive_skb() eventually calls deliver_skb(), does the actual delivery to the protocol-specific handler, via the function pt_prev->func().
The loop structure in netif_receive_skb is interesting. Why is pt_prev passed to deliver_skb rather than ptype? If you are not yet familiar with the Linux kernel struct list_head usage (covered in Ch 11 under "Linked Lists"), this may be a good time to learn how and why this kind of loop works.
The function netif_receive_skb conditionally calls netpoll_rx(), which appears to directly handle ARP and UDP packets, but seems to simply return for TCP packets. This seems to be an optional optimization to cut down on the overhead of passing certain packets up through the protocol hierarchy.
Because we have limited classroom time, we have concentrated on the e100 driver (a complex real driver) instead of the toy driver snull provided in the text. Although the textbook examples are now mostly obsolete, I have updated them enough that they will at least compile with the 2.6.25 kernel, and it may still help to review the network driver concepts in that simpler context.
In particular, it may be useful to now look back at the snull driver, starting with snull_rx, snull_regular_interrupt, and snull_napi_interrupt.
Sending of a packet starts with one application-level calls, like send() or sendto(). After the system determines the route and assembles the packet with the appropriate headers, it eventually makes an internal call to the netdev->hard_start_xmit method of the actual device.
For the e100 driver, the actual function bound to this link is e100_xmit_frame(), which calls e100_exec_cb(), which calls e100_exec_cmd() to do finally deliver a command to the controller. Most of the code in these functions is device-specific. Since this is a DMA device, the data is passed to and from the device indirectly, by giving pointers to memory-mapped buffers containing the commands.
Observe the call to netif_stop_queue from e100_xmit_fram() in the case that the packet used up the last available queue space for this device.
This driver does not address the short packet information leakage problem discussed in the text. That is OK, because the comments say "hardware padding of short packets to the minimum packet size is enabled".
A call to pci_iomap(pdev, ...) is used to provide memory-mapped access the device registers.
The receive frame descriptors (struct rx) are allocated using kcalloc() (contiguous/array allocation).
Each receive frame descriptor (RFD) contains a pointer to a sk_buf, which is allocated by a call to dev_alloc_skb(). This is the pointer that is passed to pci_map_single().
A call to pci_map_single(...) is used to make each sk_buf accessible to the device, in e1000_tx_map, e1000_clean_rx_irq_ps, and e1000_alloc_rx_buffers_ps. For the i386, this seems to do very little, just calling flush_write_buffers() and then virt_to_phys() since the buffer is assumed to be already in a range that the virtual address is in the range of pages that are always mapped into kernel virtual memory.
The E100 driver does not handle scatter-gather output. For an example of memory mapping to implement scatter-gather see the function e1000_tx_map in the E1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list.
request_irq is called in e100_up to bind the handler e100_intr.
e100_intr reads the status of the device, acknowledges the interrupt to the device. If the device has hit "receive no resource" (meaning it is out of receive buffer space), a flag is set to indicate that the device is not receiving any more. In all cases, netif_rx_schedule is called to make certain an effort is under way to process the packets already received.
| © 2004-2008 T. P. Baker. ($Id: ch17.html,v 1.1 2008/04/28 12:41:35 baker Exp baker $) |