Asynchronous I/O on linux

This is G o o g l e's cache of http://davmac.org/davpage/linux/async-io.html as retrieved on Jan 23, 2006 07:31:25 GMT.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
This cached page may reference images which are no longer available. Click here for the cached text only.
To link to or bookmark this page, use the following url:

http://www.google.com/search?q=cache:svsHOpk0RZkJ:davmac.org/davpage/linux/async-io.html+Asynchronous+I/O+on+linux+or:+Welcome+to+hell&hl=en&gl=us&ct=clnk&cd=1&client=firefox-a

Google is neither affiliated with the authors of this page nor responsible for its content.

These search terms have been highlighted:

asynchronous

linux

welcome

hell

Asynchronous I/O on linux

or: Welcome to hell.

"Asynchronous I/O" essentially refers to the ability of a process to perform input/output on multiple sources at one time. For instance a process may be processing a file on disk as well as serving a client over the network, essentially unrelated tasks; the question is: which should it give priority? If the process was to read from the network socket, it could be waiting ("blocking") quite a while before the data arrived, time which could have been spent processing the file. On the other hand the opposite could also be true (especially if the file resides on a slow medium such as a floppy disk, or a networked filesystem).

Updated: 15/7/2005

In general Asynchronous I/O revolves around two functions: The ability to determine that data is available (in the case of a network connection, terminal, or certain other devices) or that a pending I/O operation has completed. The first case can be generalised to include being able to determine that a device or network connection is ready to receive new data. All these (italicised) things are Asynchronous events, that is, they can happen at an arbitrary time during program execution.

There are several ways to deal with these asynchronous events on linux; all of them presently have at least some minor problems, mainly due to limitations in the kernel.

open() in Non-blocking mode
the SIGIO signal
select() and poll()
threading
POSIX asynchronous I/O

"open" in non-blocking mode

It is possible to open a file (or device) in "non-blocking" mode by using the O_NONBLOCK option in the call to open. You can also set non-blocking mode on an already open file using the fcntl call. Both of these options are documented in the GNU libc documentation.

The result of opening a file in non-blocking mode is that calls to read() and write() will return with an error if they are unable to proceed immediately, ie. if there is no data available to read (yet) or the write buffer is full. In itself this almost solves the initial problem of asynchronous I/O, since when using it is possible to continuously loop through the interesting file descriptors and check for available input (or check for readiness for output) simply by attempting a read. This technique is called polling and is problematic primarily because it needlessly consumes CPU time - that is, the program never blocks, even when no input or output is possible on any file descriptor.

A more subtle problem with non-blocking I/O is that it generally doesn't work with regular files (this is true on linux). That is, opening a regular file in non-blocking mode has no effect for regular files: a read will always actually read some of the file, even if the program blocks in order to do so. In some cases this may not be important, seeing as file I/O is generally fast enough so as to not cause long blocking periods. However, I see it as a general weakness of the technique.

Note the O_NONBLOCK also causes the open() call itself to be non-blocking for certain types of device (modems are the primary example in the GNU libc documentation). Unfortunately, there doesn't seem to exist a mechanism by which you can execute an open() call in a truly non-blocking manner for all files.

The select() function

The select() function is documented in the libc manual. As noted, a file descriptor for a regular file is considered ready for reading if it's not at end-of-file and is always considered ready for writing. This is problematic, to some degree, as it is not true that reading or writing from a regular file is always close to instantaneous. Consider the case of network mounted shares or floppy discs (even CD-ROMs) - long delays can occur while blocking on reads or writes to files on these devices.

Another problem with select() is the limited number of file descriptors which it can handle. Also, select() is not interruptible by signals, so it is not possible to waiting for I/O means that signal notification gets delayed until either an I/O event or timeout occurs. I'm not sure if signal handling occurs while the select() call is blocking - most likely it does. However the signal handler can interrupt the select() call only with difficulty (it must cause an I/O readiness event on one of the fds that the select() call is waiting on).

The poll() function

The poll() function, not documented in the GNU libc manual, is an alternative to select() which uses a variable sized array to hold the relevant file descriptors instead of a fixed size structure. This removes the limit on the number of file descriptors.

   #include <sys/poll.h>

   int poll(struct pollfd *ufds, unsigned int nfds, int timeout);

The structure struct pollfd is defined as:

   struct pollfd {
       int fd;        // the relevant file descriptor
       short events;  // events we are interested in
       short revents; // events which occur will be marked here
   };

The events and revents are bitmasks with a combination of any of the following values:

POLLIN - there is data available to be read
POLLPRI - there is urgent data to read
POLLOUT - writing now will not block

If the feature test macros are set for XOpen, the following are also available. Although they have different bit values, the meanings are essentially the same:

POLLRDNORM - data is available to be read
POLLRDBAND - there is urgent data to read
POLLWRNORM - writing now will not block
POLLWRBAND - writing now will not block

Just to be clear on this, when it is possible to write to an fd without blocking, all three of POLLOUT, POLLWRNORM and POLLWRBAND will be generated. There is no functional distinction between these values.

The following is also enabled for GNU source:

POLLMSG - a system message is available; this is used for dnotify and possibly other functions. If POLLMSG is set then POLLIN and POLLRDNORM will also be set.

The following additional values are not useful in events but may be returned in revents, i.e. they are implicitly polled: POLLERR - an error condition has occurred
POLLHUP - hangup or disconnection of communications link
POLLNVAL - file descriptor is not open

The nfds argument should provide the size of the ufds array, and the timeout is specified in milliseconds.

The return from poll() is the number of file descriptors for which a watched event occurred (that is, an event which was set in the events field in the struct pollfd structure, or which was one of POLLERR, POLLHUP or POLLNVAL). The return may be 0 if the timeout was reached. The return is -1 if an error occurred, in which case errno will be set to one of the following:

EBADF - a bad file descriptor was given
ENOMEM - there was not enough memory to allocate file descriptor tables, necessary for poll() to function.
EFAULT - the specified array was not contained in the calling process's address space.
EINTR - a signal was received while waiting for events.

EINVAL - if the nfds is ridiculously large, that is, larger than the number of fds the process may have open. Note that this implies it may be unwise to add the same fd to the listen set twice. Note that while poll(), unlike select(), is interruptible via a signal, it is still awkward to wait for signals in conjunction with waiting for I/O readiness events. A race condition occurs when attempting to do this, because the signal may be handled just before the poll() function is called.

The poll call is inefficient for large number of file descriptors, because the kernel must scan the list provided by the process each time poll is called, and the process must scan the list to determine which descriptors were active. Also, poll exhibits the same problems in dealing with regular files as select() does.

The SIGIO signal

A simple solution to the problem of wasting CPU time when using non-blocking mode is to also set the O_ASYNC mode, which causes a SIGIO signal to be generated when a socket/terminal is ready for I/O. Thus a process can use sleep(), pause() or sigsuspend() to avoid consuming CPU time while waiting for input. Like non-blocking I/O itself however the SIGIO mechanism does not work for regular files.

The GNU libc documentation has some information on using SIGIO. It tells how you can use the F_SETOWN argument to fcntl() in order to specify which process should recieve the SIGIO signal for a given file descriptor. However, it does not mention that on linux you can also use F_SETSIG to specify an alternative signal, including a realtime signal. Usage is as follows:

   fcntl(fd, F_SETSIG, signum);

... where fd is the file descriptor and signum is the signal number you want to use. Setting signum to 0 restores the default behaviour (send SIGIO). Setting it to non-zero has the effect of causing the specified signal to be queued, if it is a non-realtime signal which is already pending. If the signal cannot be queued a SIGIO is sent in the traditional manner.

If a signal is successfully queued due to an I/O readiness event, additional signal handler information becomes available to advanced signal handlers (see the link on realtime signals above for more information). Specifically the handler will see si_code (in the siginfo_t structure) with one of the following values:

POLL_IN - data is available
POLL_OUT - output buffers are available (writing will not block)
POLL_MSG - system message available
POLL_ERR - input/output error at device level
POLL_PRI - high priority input available
POLL_HUP - device disconnected

Note these values are not necessarily distinct from other values used by the kernel in sending signals. So it is advisable to use a signal which is used for no other purpose. Assuming that the signal is generated to indicate an I/O event, the following two structure members will be available:

si_band - contains the event bits for the relevant fd, the same as would be seen using poll()
si_fd - contains the relevant fd.

Together will poll(), the signal technique can be used to reliably wait on a set of events including both I/O readiness events and signals. Initially, the signals should be blocked. Then a poll() is performed with a zero timeout, so that it does not block. The returned set of fds can serviced immediately; then perform a sigwaitinfo() to wait for the SIGIO, the signal set using F_SETSIG (if any), and any other signals of interest.

From that point, signal data retrieved from sigwaitinfo() can be used to determine which fds require servicing. If a SIGIO signal occurs, signal buffer overflow may have occurred. In this case the chosen I/O signal(s) should be flushed by setting them to be ignored and unmasking them. Then, the signals should be masked and their handlers restored, and another poll() should be performed. Rinse and repeat.

There is a minor race condition which needs to be avoided, specifically, the case where signals are queued just before the poll() call is executed. This could mean that the fd will be seen as active both by poll() and by sigwaitinfo(). This can easily be circumvented by also using non-blocking I/O.

Note that SIGIO can itself be selected as the notification signal. This allows the assosicated extra data to be retrieved, however, multiple SIGIO signals will not be queued and there is no way to detect if signals have been lost, so it is necessary to treat each SIGIO as an overflow regardless. It's much better to use a real-time signal.

Epoll

On newer kernels - since 2.5.45 - a new set of syscalls known as the epoll interface (or just epoll) is available. The epoll interface works in essentially the same way as poll(), except that the array of fds is maintained in the kernel rather than userspace. Syscalls are available to to create a set, add and remove fds from the set, and retrieve events from the set. This is much more efficient than traditional poll() as it prevents the linear scanning of the set required at both the kernel and userspace level for each poll() call.

   #include <sys/epoll.h>
   int epoll_create(int size);
   int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
   int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

epoll_create() is used to create a poll set. The size argument is an indicator only; it doesn not limit the number of fds which can be put into the set. The return value is a file descriptor (used to identify the set) or -1 if an error occurs (the only possible error is ENOMEM which indicates there is not enough memory or address space to create the set in kernel space). An epoll file descriptor is deleted by calling close() and otherwise acts as an I/O file descriptor which has input available if an event is active on the set.

epoll_ctl is used to add, remove, or otherwise control the monitoring of an fd in the set donated by the first argument, epfd. The op argument specifies the operation which can be any of:

EPOLL_CTL_ADD

add a file descriptor to the set. The fd argument specifies the fd to add. The event argument points to a struct epoll_event structure with the following members:

uint32_t events

a bitmask of events to monitor on the fd. The values have the same meaning as for the poll() events, though they are named with an EPOLL prefix: EPOLLIN, EPOLLPRI, EPOLLOUT, EPOLLRDNORM, EPOLLRDBAND, EPOLLWRNORM, EPOLLWRBAND, EPOLLMSG, EPOLLERR, and EPOLLHUP.

Two additional flags are possible: EPOLLONESHOT, which sets "One shot" operation for this fd, and EPOLLET, which sets edge-triggered mode (see the description of epoll_wait for more information).

In one-shot mode, a file descriptor generates an event only once. After that, the bitmask for the file descriptor is cleared, meaning that no further events will be generated unless EPOLL_CTL_MOD is used to re-enable some events.

epoll_data_t data

this is a union type which can be used to specify additional data that will be assosciated with events on the file descriptor. It has the following members:

   void *ptr;
   int fd;
   uint32_t u32;
   uint64_t u64;

EPOLL_CTL_MOD

modify the settings for an existing descriptor in the set. The arguments are the same as for EPOLL_CTL_ADD.

EPOLL_CTL_DEL

remove a file descriptor from the set. The data argument is ignored.

The return is 0 on success or -1 on failure, in which case errno is set to one of the following:

EBADF - the epfd argument is not a valid file descriptor
EPERM - the target fd is not supported by the epoll interface
EINVAL - the epfd argument is not an epoll set descriptor, or the operation is not supported
ENOMEM - there is insufficient memory or address space to handle the request

The epoll_wait() call is used to read events from the fd set. The epfd argument identifies the epoll set to check. The events argument is a pointer to an array of struct epoll_event structures (format specified above) which contain both the user data associated with a file descriptor (as supplied with epoll_ctl()) and the events on the fd. The size of the array is given by the maxevents argument. The timeout argument specifies the time to wait for an event, in milliseconds; a value of -1 means to wait indefinitely.

In edge-triggered mode, an event is reported only once for each time the readiness state changes from inactive to active, that is, from the sitation being absent to being present. If the situation is not removed once the event is reported by epoll_wait(), it will not be reported again; this is as opposed to the default triggering mode, "level triggered" mode, where the situation will be reported each time epoll_wait() is called.

The return is 0 on success or -1 on failure, in which case errno is set to one of:

EBADF - the epfd argument is not a valid file descriptor
EINVAL - epfd is not an epoll set descriptor, or maxevents is less than 1
EFAULT - the memory area occupied by the specified array is not accessible with write permissions

Note that an epoll set descriptor can be used much like a regular file descriptor. That is, it can be made to generate SIGIO (or another signal) when input (i.e. events) is available on it; likewise it can be used with poll() and can even be stored inside another epoll set.

Threading

The use of multiple threads is in some ways an ideal solution to the problem of asynchronous I/O, as well as asynchronous event handling in general, thought it has significant problems for practical application due to the fact that each thread requires a stack (and therefore consumes a certain amount of memory) and the number of threads of in a process may be limited by this and other factors.

Thus, it is impractical to assign one thread to each fd of interest. However, threading can be combined with the other techniques for asynchronous I/O to reduce the cost of scanning through a large array of fds (assuming that epoll is not available). This comes at some cost in terms of complexity, however. Also, the SIGIO technique must be extended to determine which thread the signal should really be destined for (since it is not possible to use the F_SETOWN control to set the fd owner to a single thread), and signal the appropriate thread.

Threading allows handling different events with different priority levels. However the SIGIO technique complicates this, as the signal can be delivered to any thread which does not block the signal. Therefore, the highest priority thread (or threads, if there are multiple signals with the same highest priority) should be the only one to have SIGIO unblocked. It's possible to do various other things to alleviate this problem such as use a selection of different realtime signals as I/O readiness notification signals (using the F_SETSIG control).

POSIX asynchronous I/O

The POSIX asynchronous I/O interface, which is documented in the GNU libc documentation, would seem to be almost ideal for performing asynchronous I/O. After all, that's what it was designed for. Unfortunately POSIX AIO on linux suffers from a major flaw: It's implemented at user level, using threads!

(Actually, there is an AIO implementation in the kernel. I believe it's been in there since sometime in the 2.5 series. BUT it only works for regular files, and only those which were opened with a special flag - O_DIRECT - and it has other issues such as fsync not working properly. Or something. So it's basically unusable...)

POSIX AIO read calls must be assigned an amount to read. Although I haven't found any documentation to support it (apart from the glibc source) it seems that partial reads (and writes) can occur. Otherwise the interface would be of severely limited use.

There is no way to use POSIX AIO to poll a socket on which you are listening for connections. It can only be used for actually reading or writing data.

The documentation in the GNU libc manual (v2.3.1) is not complete - it doesn't document the "struct sigevent" structure used to control how notification of completed requests is performed. The structure has the following members:

int sigev_notify - can be set to SIGEV_NONE (no notification), SIGEV_THREAD (a thread is started, executing function sigev_notify_function), or SIGEV_SIGNAL (a signal, identified by sigev_signo, is sent). SIGEV_SIGNAL can be combined with SIGEV_THREAD_ID in which case the signal will be delivered to a specific thread, rather than the process. The thread is identified by the _sigev_un._tid member - this is an obviously undocumented feature and possibly an unstable interface.
void (*sigev_notify_function)(sigval_t) - if notification is done through a seperate thread, this is the function that is executed in that thread.
sigev_notify_attributes - if notification is done through a seperate thread, this field specifies the attributes of that thread.
int sigev_signo - if notification is to be performed by a signal, this gives the number of the signal.
sigval_t sigev_value - this is the parameter passed to either the signal handler or notification function. See real-time signals for more information.

Note that in particular, "sigev_value" and "sigev_notify_attributes" are not documented in the libc manual, and the types of none of the fields is specified.

Of the notification methods, sending a signal would seem at the outset to be the only appropriate choice when large amounts of concurrent I/O are taking place. Although realtime signals could be used, there is a potential for signal buffer overflow which means signals could be lost and in this case the completion would never be noticed (the GNU libc implementation guarentees that the signal will be raise()d if it cannot be queued, so at least one signal will be seen for multiple events - but the ever present possibility of overflow means that for every signal received, it must be assumed that overflow has occurred).

Note that the aio_suspend() call seems to have the same efficiency problems as the original poll() call.

The main advantage of AIO is it is efficient for disk I/O, as the data can be stored directly at the destination rather than going through an intermediate buffer, particularly with modern disk controllers which do not require a contiguous buffer. However the notification methods are far from ideal, and lack of complete kernel support means that Posix AIO is not presently a high performance option on linux.

The ideal solution

If I had my way...

The epoll interface by itself is nearly enough to completely solve the asynchronous I/O problem. The obvious gap is in the handling of regular file I/O. If that gap were filled, there would be little to wish for.

With that said, the so-called "zero-copy" possibility of Posix AIO is also attractive, but the notification methods are almost useless.

What is really needed is a combination of the two. Consider that:

I/O readiness events are asynchronous events
AIO completions are also asynchronous events
Signals are asynchronous events, often used to notify the process when some other event (like the two types of event above) occurs.

... All we need is an extra poll event type, "AIO completion", and then epoll in conjunction with POSIX AIO and SIGIO could solve the problem completely.