Linux Kernel & Device Driver Programming nil t

Ch 16 - Block Drivers

This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom.

Note: There have been changes to the block-device driver API since the time the LDD3 book was last updated. The examples in the book will no longer compile. In particular, the function end_request() has been replaced.

In addition to the sbull example driver of the text, these notes use some actual Linux block device drivers, including brd (RAM-disk driver) and sd (SCSI-disk driver). The advantage these over sbull is that they are real drivers, that are kept up to date with the Linux kernel. A disadvantage is that they are more complicated.

What is a block device?

Registration

These calls have very little effect in 2.6 kernel, and may be removed altogether, in time.

Block device operations (struct block_device_operations)

int (*open) (struct inode *, struct file *);
int (*release) (struct inode *, struct file *);
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
long (*unlocked_ioctl) (struct file *, unsigned, unsigned long);
long (*compat_ioctl) (struct file *, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t, unsigned long *);
int (*media_changed) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
struct module *owner;

For example, see the declaration of sd_fops in sd.c:

static struct block_device_operations sd_fops = {
        .owner                  = THIS_MODULE,
        .open                   = sd_open,
        .release                = sd_release,
        .ioctl                  = sd_ioctl,
        .getgeo                 = sd_getgeo,
#ifdef CONFIG_COMPAT
        .compat_ioctl           = sd_compat_ioctl,
#endif
        .media_changed          = sd_media_changed,
        .revalidate_disk        = sd_revalidate_disk,
};

The gendisk structure

Used by kernel to represent a disk device, or disk partition.

struct gendisk {
  int major;                  /* major number of driver */
  int first_minor;            
  int minors;                 /* maximum number of minors, = 1 for
                                 disks that can't be partitioned. */
  char disk_name[32];         /* name of major driver */
  struct hd_struct **part;    /* [indexed by minor] */
  int part_uevent_suppress;
  struct block_device_operations *fops;
  struct request_queue *queue;
  void *private_data;
  sector_t capacity;          /* count of 512-byte sectors */
  int flags;                  /*
      GENHD_FL_REMOVABLE               = removable media
      GENHD_FL_CD                      = CD-ROM         
      GENHD_FL_SUPPRESS_PARTITION_INFO = no /proc/partitions */
  struct device *driverfs_dev;
  struct kobject *holder_dir;
  struct kobject *slave_dir;
  struct timer_rand_state *random;
  int policy;
  atomic_t sync_io;               /* RAID */
  unsigned long stamp;
  int in_flight;
#ifdef  CONFIG_SMP
  struct disk_stats *dkstats;
#else
  struct disk_stats dkstats;
#endif
  struct work_struct async_notify;}

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The del_disk function mentioned in the text seems to have been renamed to del_gendisk.

Example: aic7xxx SCSI Disk Controller Driver Initialization

Example: the SD Driver

The function sd_probe is passed to function scsi_register_driver via the sd_template object, in a call from the module initialization routine init_sd. This calls driver_register (generic driver code), etc. Along one of the control paths, this eventually leads to a call to really_probe, which calls the actual probe function.

The probe function initializes the structures for the SCSI disk device.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

Also look at at the module initialization, in init_sd(), including:

Note that add_disk() requires that the driver be fully initialize and ready to run already, since it calls methods of the driver, including fops->open() and fops->revalidate_disk(), and also generates hot-plug events.

Note that blk_alloc_queue() and blk_queue_make_request() are an alternate way of setting up a request queue, compared to the function blk_init_queue() used in the sbull example of LDD3.

Open and Release

Open:

Release:

Request Processing

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The function blk_queue_make_request is used to make a request queue for drivers that want to bypass the generic block layer, and handle requests directly. One example is the ram-disk driver, which installs its own request method, brd_make_request. Another is the sbull driver, for the case RM_NOQUEUE, which uses the function sbull_request.

Example: sbull_request

static void sbull_request(request_queue_t *q)
{ struct request *req;
  while ((req = elv_next_request(q)) != NULL) {
    struct sbull_dev *dev = req->rq_disk->private_data;
    if (! blk_fs_request(req)) {
       printk (KERN_NOTICE "Skip non-fs request\n");
       //  end_request(req, 0);  -- now obsolete
       blk_end_request(req, -EIO, req->current_nr_sectors << 9);  -- now obsolete
       continue;
    }
    sbull_transfer(dev, req->sector, req->current_nr_sectors,
                   req->buffer, rq_data_dir(req));
    //  end_request(req, 1); -- now obsolete
    blk_end_request(req, 1, req->current_nr_sectors << 9);
}}

Basic Fields of struct request

Each request represents a range of contiguous sectors in the device address space, but they may go to/from different (non-contiguous) buffers in memory.

Request Queues in More Detail

Queue Creation and Deletion

request_queue_t blk_init_queue(request_fn_proc *request, spinlock_t *lock);

Allocates a request queue and binds the request method and lock.

request_queue_t blk_cleanup_queue(request_queue_t *);

Returns the queue to the system.

Queueing Functions

struct request elv_next_request(request_queue_t *queue);

Returns the next request on the queue, and marks it as active.

void elv_dequeue_request(struct request_queue *q, struct request *rq);

Removes a request from the queue. (This is called from blkdev_dequeue_request().)

void elv_requeue_request(request_queue_t *q, struct request *rq);

Puts a dequeued request back onto the queue.

Queue Control Functions

void blk_stop_queue(request_queue_t *queue);
void blk_start_queue(request_queue_t *queue);

Allow driver to suspend and resume queueing.

void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);
void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);
void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);
void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);
void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);
void blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);
void blk_queue_dma_alignment(request_queue_t *queue, int mask);
void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);

Set parameters of queue.

bio Structure

How bio structs are chained

diagram from page 482

Iteration over bi_io_vec array

int segno;
struct bio_vec *bvec;
bio_for_each_segment(bvec, bio, segno) {
/* Do something with this segment */
}

For a simple example, see usage in bd_make_request.

For more complex examples, see the trace of how a disk write request is processed, at the end of this file.

bio Operations

Macros and inline functions are provided for operating on bio structures. A driver should use these, rather than operating directly on the structure.

#define __bio_kmap_atomic(bio, idx, kmtype) \
        (kmap_atomic(bio_iovec_idx((bio), (idx))->bv_page, kmtype) + \
         bio_iovec_idx((bio), (idx))->bv_offset)
#define __bio_kunmap_atomic(addr, kmtype) kunmap_atomic(addr, kmtype)

#define bio_iovec_idx(bio, idx) (&((bio)->bi_io_vec[(idx)]))
#define bio_iovec(bio)          bio_iovec_idx((bio), (bio)->bi_idx)
#define bio_page(bio)           bio_iovec((bio))->bv_page
#define bio_offset(bio)         bio_iovec((bio))->bv_offset
#define bio_segments(bio)       ((bio)->bi_vcnt - (bio)->bi_idx)
#define bio_sectors(bio)        ((bio)->bi_size >> 9)
#define bio_cur_sectors(bio)    (bio_iovec(bio)->bv_len >> 9)
#define bio_data(bio)           (page_address(bio_page((bio))) + bio_offset((bio)))
#define bio_barrier(bio)        ((bio)->bi_rw & (1 << BIO_RW_BARRIER))
#define bio_sync(bio)           ((bio)->bi_rw & (1 << BIO_RW_SYNC))
#define bio_failfast(bio)       ((bio)->bi_rw & (1 << BIO_RW_FAILFAST))
#define bio_rw_ahead(bio)       ((bio)->bi_rw &  (1 << BIO_RW_AHEAD))
#define bio_rw_meta(bio)        ((bio)->bi_rw &  (1 << BIO_RW_META))

static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags);
static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags);

Request Structure Fields

These are just some of the fields of struct request.

How request structs are queued

diagram from page 484

Barrier Requests

Whether to Retry a Failed Request

Request Completion Functions

int end_that_request_first(struct request *req, int error, int nr_bytes);

Indicates driver has finished with nr_bytes, so far.

void end_that_request_last(struct request *req, int error);

It seems that in the 2.6.25 kernel the above functions are not intended to be called directly by drivers. Most of them use the following function, or call it indirectly through end_request().

void _blk_end_request(struct request *req, int error, unsigned int nr_bytes)
{
         if (blk_fs_request(rq) || blk_pc_request(rq)) {
                 if (__end_that_request_first(rq, error, nr_bytes))
                         return 1;
         }
         add_disk_randomness(rq->rq_disk);
         end_that_request_last(rq, error);
         return 0;
}

Working with bios

For example, see sbull_full_request.

That calls sbull_xfer_request.

And that calls sbull_transfer.

Block Requests and DMA

blk_rq_map_sg provides a way to set up DMA mappings for all the elements of a scatter-gather list.

For example, see use in scsi_init_scgtable called from scsi_init_io.

Doing without a Request Queue

For example, see use in sbull_make_request.

Command Pre-Preparation

typedef int (prep_rq_fn) (request_queue_t *, struct request *);
...
void blk_queue_prep_rq(request_queue_t *q, prep_rq_fn *pfn);

Tagged Command Queueing

int blk_queue_init_tags(request_queue_t *queue,
                             int depth,
                             struct blk_queue_tag *tags);

Is called at initialization time to register driver as one that supports TCQ.

The functions to support TCQ are:

int blk_queue_start_tag(request_queue_t *queue, struct request *req);
struct request *blk_queue_find_tag(request_queue_t *queue, int tag);
void blk_queue_end_tag(request_queue_t *queue, struct request *req);
void blk_queue_free_tags(request_queue_t *queue);
int blk_queue_resize_tags(request_queue_t * queue, int new_depth);
void blk_queue_invalidate_tags(request_queue_t *queue);
struct blk_queue_tag *blk_init_tags(int depth);
void blk_free_tags(struct blk_queue_tag *queue);

Hierarchy of Disk Drivers

In general, block device drivers are likely to be stacked. For example, a software RAID driver might be stacked over a general purpose SCSi driver, which is stacked over a specific device driver. The following is a possible sequence of calls through such a hierarchy, based on my browsing of the code. I did not have the time to read through every line of code carefully enough to be absolutely certain, but I tried to follow what appeared to be the most likely (normal) control flow. This example gives some idea of how all the pieces fit together.

  1. User calls write (fd, buf, count), and traps into kernel.

  2. The trap is handled by the system call trap handler, in entry.S, which executes

      call *sys_call_table(,%eax,4)
  3. Entry 4 of the table sys_call_table, in syscall_table.S contains the entrypoint address of sys_write.

  4. sys_write calls vfswrite(file, buf, count, &pos).

  5. vfs_write calls file->f_op->write(file, buf, count, pos).

  6. Assuming the file that corresponds to the file parameter is in an ext2 filesystem, the dispatching vector would be ext2_file_operations, declared in linux/fs/ext2/file.c, whose address is stored into the inode->i_fop by ext2_create. The relevant lines of ext2_file_operations are :

       .write = do_sync_write
            ...
       .aio_write = generic_file_aio_write

    So, this call would be to do_sync_write, a generic write routine in linux/fs/read_write.c.

  7. do_sync_write sets up a kiocb structure to hold the parameters, and then calls

       ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);

    Using the same dispatching vector as above, this translates to a call to generic_file_aio_write, defined in linux/mm/filemap.c.

  8. generic_file_aio_write locks the corresponding inode and calls

       __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
  9. __generic_file_aio_write_nolock does a number of checks and eventually calls

       generic_file_buffered_write(iocb, iov, nr_segs, pos, ppos, count, written);
  10. generic_file_buffered_write eventually calls generic_perform_write() which does a number of steps, including making sure the page(s) of the buffer to be written are currently resident in memory, eventually calls

       status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata);

    a_ops comes from file->f_mapping->mapping->a_ops. The corresponding dispatching table is ext2_aops, declared in linux/fs/ext2/file.c, whose address is stored into the inode->i_fop by ext2_create. The relevant lines seem to be:

        .write_begin            = ext2_write_begin,
        .write_end              = generic_write_end,
  11. ... There seems to be a gap here, due to recent changes to the block I/O system. If those changes are not fundamental, there will eventually be a call to block_write_full_page().
  12. block_write_full_page does a number of preliminary steps, and then calls

       __block_write_full_page(inode, page, get_block, wbc);
  13. __block_write_full_page does a number of complicated steps, and eventually calls

       submit_bh(WRITE, bh);
  14. submit_bh allocates a bio structure, fills it in with the information about the page to be written, and eventually calls

       submit_bio(rw, bio);
  15. submit_bio eventually calls

       generic_make_request(bio);
  16. generic_make_request calls __generic_make_request(bio), which eventually calls

    q->make_request_fn(q, bio);

    This is the request method installed by some device driver for the queue. Suppose this is a RAID 1 (mirrored) device. In that case, the actual function called may be make_request defined in linux/drivers/md/raid1.c and referenced in raid1_personality.

  17. make_request identifies the devices to which the mirrored write should go, eventually creates new bio structures for each of the individual writes, merges them into a queue that will be processed by md_thread, and then calls

       md_wakeup_thread(mddev->thread);
  18. Time passes, until the thread is scheduled ...
  19. md_thread scans the queue of requests, and calls

       thread->run(thread->mddev);

    The run method for RAID1 was earlier bound to raid1d by the thread by a call to md_register_thread in run().

  20. raid1d eventually calls generic_make_request for each of the bio structures in its list.

       generic_make_request(bio);
  21. generic_make_request eventually calls

       q->make_request_fn(q, bio);
  22. ... The thread of control is difficult to follow here. Perhaps the call above is to dm_request, installed by alloc_dev.

  23. dm_request eventually calls

      __split_bio(md, bio);
  24. __split_bio eventually calls __clone_and_map, after frist constricting a clone_info struct, ci.

      __clone_and_map(&ci);
  25. __clone_and_map eventually calls

      __map_bio(ti, clone, tio);
  26. __map_bio eventually calls

      ti->type->map(ti, clone, &tio->info);

    This structure appears to be declared for RAID1 as mirror_target:

       static struct target_type mirror_target = {
           .name    = "mirror",
           .version = {1, 0, 2},
           .module  = THIS_MODULE,
           .ctr     = mirror_ctr,
           .dtr     = mirror_dtr,
           .map     = mirror_map,
           .end_io  = mirror_end_io,
           .postsuspend = mirror_postsuspend,
           .resume  = mirror_resume,
           .status  = mirror_status,
       };
  27. mirror_map eventually calls

      queue_bio(ms, bio, rw);
  28. queue_bio apparently puts the request onto the queue of bios to be served.

... At this point, we again have difficulty following the flow of control. ...

Example: Processing of a SCSI Request

The following picks up processing at a lower level.

  1. During device initialization, the request method for the queue is set to scsi_request_fn().

  2. scsi_request_fn dequeues a request, translates it into a SCSI command, and eventually dispatches the command to the next lower level driver:

       rtn = scsi_dispatch_cmd(cmd);
  3. scsi_dispatch_cmd does more work, and eventually calls:

       rtn = host->hostt->queuecommand(cmd, scdi_done);

    Suppose the actual scsi host controller is aic7xxx. Then the actual function called would be ahc_linux_queue.

  4. ahc_linux_queue, eventually calls

       ahc_linux_run_command(ahc, dev, cmd);
  5. ahc_linux_run_command sets up a data structure for the SCSI command block (scb), and finally calls

       ahc_queue_scb(ahc, scb);
  6. ahc_queue_scb adds the scb to the in-memory list of scb's (which is accessed via DMA by the adapter) and eventually tells the adapter about it

      ahc_outb(ahc, HNSCB_QOFF, ahc->qinfifonext);

This is the end of the first of the two steps of the write operation. The second step, which begins when the device notifies the aic7xx driver that the scb is done, is another story...