|
This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom. |
|
|
|
Image from http://www.checkcomputersetup.com/hardware/computer_network_adapter.html
|
|
Image adapted from http://www.linuxjournal.com/article/4896
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The goal of using generic code wherever possible results in many call-back interfaces:>
Sometimes it is even worse, with both sides using pointers to call one another, or indirectly doing so, via chains of calls involving other subsystems.
The datatype struct net_device has many fields, including:
This driver handles a variety of ethernet controllers based on the Intel 82571, 82572, 82573, 82574, and 82583 chips. The main file is e1000/netdev.c.
The module initialization function e1000_init_module calls
pci_register_driver(&e1000_driver);
That structure contains several fields, including a pointer to a probe function.
static struct pci_driver e1000_driver = {
.name = e1000e_driver_name,
.id_table = e1000_pci_tbl,
.probe = e1000_probe,
.remove = __devexit_p(e1000_remove),
.suspend = e1000_suspend,
.resume = e1000_resume,
.shutdown = e1000_shutdown,
.err_handler = &e1000_err_handler};
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The probe function will be called when the module is inserted, and when udev detects a insertion event for a device that this driver has claimed to able to manage in the table e1000_pci_tbl.
The pci registration process proceeds through a series of calls
pci_register_driver(&e1000_driver);
__pci_register_driver(driver, THIS_MODULE, KBUILD_MODNAME);
driver_register(&drv->driver);
bus_add_driver(drv);
bus_add_driverkobject_init_and_add(&priv->kobj, &driver_ktype, NULL; "%s", drv->name);and then link it into various kernel object structures.
driver_attach(drv);
kobject_uevent(&priv->kobj, KOBJ_ADD);
driver_attachApplies __driver_attach to each device on the driver's bus, which
drv->bus->match (dev, drv)really_probedriver_sysfs_add(dev);
dev->bus->probe or driver drv->probe,
which in our case is the function
e1000_probe
Image from http://book.chinaunix.net/special/ebook/oreilly/Understanding_Linux_Network_Internals/
e1000_probeThis function does a lot of things, most of which are specific to the device. We focus on the parts that relate to its role as a generic network device.
netdev = alloc_etherdev(sizeof(struct e1000_adapter));
netdev_priv(netdev)
alloc_etherdevalloc_etherdev_mq(sizeof(struct e1000_adapter), 1);which calls
alloc_netdev_mq(sizeof(struct e1000_adapter), "eth%d", ether_setup, 1);
setup(dev);
to function parameter that is passed in, which in this case is ether_setup.
e1000_probendo_open
ndo_stop
See e1000_open for example. Notice the up-call to the network layer function netif_start_queue.
The datatype struct sk_buff, is used to hold packets that go to or from sockets. It has quite a few fields, including some tagged as "important" by the LDD3 text. Unfortunately, the names of these fields appear to have changed since the book was written. The names of the "important" fields are:
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The book's comment about being prepared to have code that depends on the internals of type struct sk_buff be broken by future kernel releases is certainly true, but also applies to other aspects of the kernel. The differences even between versions 2.6.6 and 2.6.11 have been quite noticeable, as were the differences between version 2.6.11 and 2.6.25, and between 2.6.25 and 2.6.31.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
See also scatter-gather mappings under DMA I/O.
Scatter-gather allows sharing of common fields, like MAC-address, between packets.
Scatter-gather is also implemented by the user API. See the man-page for sendmsg for specifics. From the user level, the scatter-gather mechanism is implemented by the following structures:
struct msghdr {
void * msg_name; /* optional address */
socklen_t msg_namelen; /* size of address */
struct iovec * msg_iov; /* scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void * msg_control; /* ancillary data, see below */
socklen_t msg_controllen; /* ancillary data buffer len */
int msg_flags; /* flags on received message */
};
struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes */
};
It is up to the device driver to decide whether scatter-gather can be supported directly by the device, or whether to force the a higher layer of the network implementation to copy scatter-gather structures provided by a user into contiguous ranges of memory.
The LDD3 text says sending is less complicated than receiving, and so chooses to treat it first. However, sending is also less interesting, since it is synchronous, pushed I/O, similar to examples we have seen with other types of devices. Since we have limited time, we will look at the more interesting case first.
The e1000 driver does not handle scatter-gather output. (It does not set the NETIF_F_SG bit in netdev->features.) For an example of scatter-gather see the function e1000_tx_map in the e1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list. Note where this driver sets netdev->features to include the feature NETIF_F_SG, so that the network device layer knows that this driver expects scatter-gather sk_buffs to this device.
The end goal of the driver is to pass off each packet that it receives, to the higher-level protocol handling routines of the system.
There are two models of packet reception that a driver may implement
All the input processing builds up to this.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
It may be surprising that polling is more efficient than interrupt-driven input, since historically interrupts were developed to address the inefficiency of polling. The reality is that we are talking about a hybrid approach, which involves both polling and interrupts and is superior to either of them alone.
Since the LDD3 text concentrates on the pure interrupt-driven case, in class we will concentrate on the polling approach, which is more effective, more interesting, and the recommended approach for all Linux network drivers.
Starting with kernel 2.6.6, the e100 driver (an earlier, more primitive chip than the e1000) had an option CONFIG_E100_NAPI to choose between interrupt-driven I/O and polling (a.k.a. "NAPI"), but by kernel 2.6.11 NAPI (hybrid of polling and interrupts) was the only option supported.
|
We will not discuss the API presented in the LDD3 text, since it is no longer supported. |
Image from http://www.goodexperience.com/tib/archives/signs/
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The diagrams above are reproduced from piters.home.cern.ch/piters/ TCP/seminar/TDAQ-April02/TDAQ-April02.ps.
|
|
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The Ethernet-HOWTO says:
When a card receives a packet from the network, what usually happens is that the card asks the CPU for attention by raising an interrupt. Then the CPU determines who caused the interrupt, and runs the card's driver interrupt handler which will in turn read the card's interrupt status to determine what the card wanted, and then in this case, run the receive portion of the card's driver, and finally exits.
Now imagine you are getting lots of Rx data, say 10 thousand packets per second all the time on some server. You can imagine that the above IRQ run-around into and out of the Rx portion of the driver adds up to a lot of overhead. A lot of CPU time could be saved by essentially turning off the Rx interrupt and just hanging around in the Rx portion of the driver, since it knows there is pretty much a steady flow of Rx work to do. This is the basic idea of NAPI.
For additional explanation, see:
struct napi_structstate:
Value NAPI_STATE_NPSVS
provides support for the "Netpoll" option, i.e., pure polling.
netif_napi_add(netdev, &adapter->napi, e1000_clean, 64);
adapter->clean_rx = e1000_clean_rx_irq; adapter->alloc_rx_buf = e1000_alloc_rx_buffers;
The e1000 driver provides optional support for Netpoll (the pure polling), but we will not discuss that further here.
The data structure passed to netif_rx() is a struct sk_buff, which is allocated here:
skb = netdev_alloc_skb(netdev, bufsz);
The function netdev_alloc_skb() is called from
e1000_alloc_rx_buffers
which allocates a socket buffer of size
ETH_FRAME_LEN + VLAN_HLEN + ETH_FCS_LEN.
|
|

This is the "legacy" format. There is also an extended version, which allows the device to provide additional information.
|
|
The e1000e driver does support this option, but we will not discuss any of these extended formats further here.
|
|

Again, this is the "legacy" format. There are also several extended versions, which allows the device to perform additional functions, offloading some of the work that otherwise would need to be done by the operating system.
|
|
For more details on the e1000 chips, see the Intel documentation.
In net_dev_init, which is called during system initialization to initialize the network device subsystem, the function net_rx_action is attached to the softirq NET_RX_SOFTIRQ.
When the function net_rx_action is called, in response to this softirq, it runs though a per-CPU softnet_data queue, executing the poll method of each device on the queue.
In the case of the e100, the polling function is e1000_clean. The most interesting parts of this function are the actual polling call
adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);
and the NAPI logic
if (work_done < budget) {
if (likely(adapter->itr_setting & 3)) e1000_set_itr(adapter);
napi_complete(napi);
if (!test_bit(__E1000_DOWN, &adapter->flags))
e1000_irq_enable(adapter);
}
See also the corresponding code, which calls the polling function, in net_rx_action.
The function called by adapter->cleanrx(...) is
one of the three different functions that may be assigned in
e1000_configure_rx,
depending on the MTU size.
The most typical one is e1000_clean_rx_irq.
After many steps, it gets to netif_receive_skb(skb), and from there to deliver_skb, which calls the delivery method appropriate to the protocol
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
Sending of a packet starts with one application-level calls, like send() or sendto(). After the system determines the route and assembles the packet with the appropriate headers, it eventually makes an internal call to the ndo_start_xmit method of the actual device.
For the e1000 driver, the actual function bound to this link is e1000_xmit_frame, which calls e1000_tx_queue to actuallyl enqueue the frame on the transmit-buffer ring of the devices.
e1000_probe
adapter->hw.hw_addr = ioremap(mmio_start, mmio_len);
e1000_alloc_rx_buffers
buffer_info->dma = pci_map_single(pdev, skb->data, adapter->rx_buffer_len, PCI_DMA_FROMDEVICE);
err = request_irq(adapter->pdev->irq, &e1000_intr, IRQF_SHARED, netdev->name, netdev);
e1000_intr
napi_schedule_prep
to make certain an effort is under way to process the
packets already received