Linux Kernel & Device Driver Programming

Linux 2.6.25 Kernel Scheduling

 

Introduction

The following is a summary of the flow of control involved in waiting for an event and being awakened by an interrupt. In this scenario task A makes a call to the read method of a raw serial device, and waits for input. Task B is scheduled while task A waits, and is then preempted when an interrupt awakes task A.

This is only one of several possible control paths through the scheduler. We assume the architecture-dependent code is for the 32-bit Intel x86 architecture.

The example is for kernel version 2.6.25, which includes two different schedulers: the "completely fair" scheduler and the "real time" scheduler.

I worked out this example by reading code. This is a skill you will need to develop, since there are no reliable up-to-date documents explaining the internals of the kernel. In fact, this is true of all large software systems. Whatever documentation exists is generally not up to date with the actual code; there will be gaps, and there will also be outright false statements. Therefore, reading code is an essential skill for any serious programmer. It would not hurt to work on developing that skill by following the process in this example.

I started at several known points, and tried to connect them. The two main known points were:

  1. wait_event_interruptible
  2. wakeup_event_interrupible

I read through the code of each of these, using LXR to drill down through the direct calls until I encountered an indirect call (a call through a virtual "method" of some object).

When I encountered indirect calls, I used "grep -R ... " in the Linux source tree to search for places in the code that assigned values to the named method pointer. This generally occurred within a few functions. I then used LXR to find calls to those functions, calls to the functions that called them, and so forth, reading code until I was able discover what actual function would be called by the virtual method.

I happened to already know about entry.S and its variants, and so worked bottom-up from this to do_IRQ. However if I had not know about this, I could have worked top-down from serio_raw_interrupt to setup_irq, from there to the ->handle_irq structure field, from there to all places where that field is used, from there to do_IRQ, from there to all places where do_IRQ is called, and that would eventually lead me to the various files entry_XXX.S.

In several cases I had to resort to grep over LXR, because LXR had not caught all the refernces to a given name. This was always true for field names, but also true for some function names, including the references to function names from assembly code.

Along thew ay, in several cases, I found more than one definition of a directly or indirectly called function, because of architectural variations or alternate scheduling policies. In these cases I chose one as representative.

When I had finished tracing out the call structures for wait_event_interruptible and wakeup_event_interruptible, I used LXR to find calls to wait_event_interruptible in various device drivers, searched through several examples until I found one that used this mechanism in conjunction with a wakeup from an interrupt handler, and searched through those until I found one what is fairly simple.

Task A Waits

  1. Suppose task A decides to wait for some condition by calling wait_event_interruptible. Suppose this call comes from serio_raw_read. We will consider a scenario where this call blocks until task A is awakened by the arrival of data through executino of the interrupt handler serio_raw_interrupt.

  2. wait_event_interruptible calls macro __wait_event_interruptible.

  3. __wait_event_interruptible calls prepare_to_wait with flag TASK_INTERRUPTIBLE.

  4. prepare_to_wait adds task A to the wait queue that it has chosen, and (still under protection of the wait-queue's lock) sets the TASK_INTERRUPTIBLE of task A.

  5. Upon return from the call to prepare_to_wait, if the awaited condition still does not evaluate to true and there are no pending signals, task A calls schedule.

  6. schedule chooses the next task to run on the current processor, as well as updating information about the currently running task (prev), which is a candidate for preemption. The choice of next task to run is done by a call to pick_next_task. If the chosen task (next) is different from the current task (prev) context_switch is called to effect the transfer of control to the new task. All this is done with preemption disabled (implemented by a software counter, preempt_count, and the flag PREEMPT_ACTIVE), to prevent recursive calls to schedule being caused by other interrupts.

    Suppose the task chosen to run next is task B.

  7. context_switch transfers control to task B. Most of the work is done in a call to switch_to.

  8. context_switch transfers control to task B. Most of the work is done in a call to assembly language subprogram switch_to.

  9. The version of switch_to for the 32-bit Intel x86 architecture pushes some register values on the stack of the current task, and jumps to label __switch_to to complete the context-switch. Control will resume immediately at the same point, but using the stack of task B, at which point the instructions at the point of return will pop the register values from the stack of task B (where they were saved earlier, which task B last called schedule). Control will not return to task A until the next time it is chosen by the scheduler, at which point the same code will restore the saved register values for task A.

  10. __switch_to completes the context switch, saving some more registers, and then restoring the state of task B from the saved values.

  11. Task B returns through the nest of calls, back through whatever point it last called schedule, and resumes execution.

Task B is Interrupted

Suppose while task B is running an interrupt occurs that causes serio_raw_interrupt to execute.

The following skips over the details of how the interrupt handler and pointers to other indirectly called subprograms are installed.

  1. entry_32.S contains the first-level interrupt and trap handlers for the system (Intel x86 architecture, 32-bit version.

    In this code you will find a label interrupt, which is the beginning of the common (generic) interrupt handler code. Reading down throught this code you will see a call to do_IRQ.

  2. do_IRQ calls the actual handler, through the ->handle_irq method of the interrupt descriptor. This method will be an interrupt handler, attached somewhere earlier by a call to setup_irq. In general, the handler may detect that rescheduling is needed. In this case, we assume the handler is serio_raw_interrupt.

  3. serio_raw_interrupt calls macro wake_up_interruptible.

Interrupt Handler Awakes Task A

  1. wake_up_interruptible expands to a call to __wake_up with paramter TASK_INTERRUPTIBLE.

  2. __wake_up calls __wake_up_common after acquiring the queue lock.

  3. __wake_up__common iterates over the queue entries and calls the ->func method of each wait-queue entry. The actual function called is autoremove_wake_function, which was set earlier in the raw serial I/O driver, when it used a call to macro DEFINE_WAIT to initialize the wait-queue node.

  4. autoremove_wake_function calls default_wake_function, and if that succeeds it removes the wait queue node from the queue.

  5. default_wake_function calls try_to_wake_up.

  6. try_to_wake_up does the actual work of waking up the task, after checking that the task is not already running. There are a lot of complications, some to accomodate multiprocessor scheduling, but in the normal case control reaches the call to activate_task, under the label out_activate.

  7. activate_task calls enqueue_task.

  8. enqeue_task calls the ->sched_class->enqueue_task method of the run queue of the processor associated with the task.

    Note: It is not completely clear to me why the kernel (apparently) requires that every task -- even a task that is waiting for an event -- must have a unique processor associated with it.

    By "grepping" the full kernel source tree, we find that this method is assigned in only two places:

    1. In sched_fair.c, we find that the enqueue_task method is assigned enqueue_task_fair and the check_preempt_curr method is assigned check_preempt_wakeup.
    2. In sched_rt.c, we find that the enqueue_task method is assigned enqueue_task_rt and the check_preempt_curr method is assigned check_preempt_rt
    3. In sched_idletask.c, we find a third scheduling class object, which has no enqueue_task method. However, it has a check_preempt_curr method, which is assigned check_preempt_curr_idle.

    We continue with the case where the enqueue_task method is enqueue_task_rt.

  9. enqueue_task_rt is complicated by the introduction of the sched_rt_entity (schedulable entity) structure, which apparently may be associated with more than one task and permits tasks to be scheduled in groups. (Note: I have not yet read enough of the new scheduler code to say I understand the details of this mechanism and its purpose. I surmise the intent is to allow a task to have several different priorities and budgets associated with it.) After calling dequeue_rt_stack to remove all the associated schedulable entities from their queues, this function calls enqueue_rt_entity for each schedulable entity associated with the task.

  10. After return to try_to_wake_up, at the label out_running, there is a call to check_preempt_curr.

  11. check_preempt_curr calls the ->curr->sched_class->check_preempt_curr method of the run queue of the processor associated with the task. We continue with the case where the call is to check_preempt_curr_rt.

  12. check_preempt_curr_rt calls resched_task.

  13. resched_task calls __resched_task.

  14. __resched_task calls set_tsk_thread_flag to set the TIF_NEED_RESCHED bit of the task.

  15. set_tsk_thread_flag calls set_ti_thread_flag.

  16. set_ti_thread_flag calls in-line atomic machine code function set_bit to set the bit.

  17. Through a sequence of returns, control eventually returns to do_IRQ().

Interrupt Handler calls Scheduler

  1. Upon return from the call to do_IRQ, return from the hardware interrupt is through the label ret_from_intr, which in the normal preemptive scheduling case flows through to label need_resched.

    Code at need_resched checks the vaule of the TIF_NEED_RESCHED flag of the current task, to decide whether to call the scheduler or not. If the TIF_NEED_RESCHED flag is set and preemption is not masked a call is made to preempt_schedule_irq().

  2. preempt_schedule_irq calls schedule.

Task A Resumes

  1. As explained above when schedule was called by task A, schedule chooses the next task to run on the current processor and eventually calls context_switch to turn control of the processor over to the new task. In the scenario here, this transfer of control would be from task B (which was interrupted) to task A (which will wake up from it's call to wait_event_interruptible).

  2. Task A resumes execution in schedule, which returns to __wait_event_interruptible, which returns to wait_event_interruptible, inside serio_raw_read.

Note on the Logic of wait_event_interruptible

In an earlier class, there were questions about what would happen to task A if it were executing a call to wait_event_interruptible and another task (or interrupt handler) attempted to awake task A concurrently in the code below:

256         for (;;) {
257                 prepare_to_wait(&wq, &__wait, TASK_INTERRUPTIBLE);
258                 if (condition)
259                         break;
260                 if (!signal_pending(current)) {
261                         schedule();
262                         continue;
263                 }
264                 ret = -ERESTARTSYS;
265                 break;
266         }                                                               \
267         finish_wait(&wq, &__wait);

In the first case of interest, the wakeup occurs before task A is on the event queue. The call to wake_up_interruptible will not find task A on the event queue and so will not wake it up. However, if the task that called wake_up_interruptible modified condition before that call, task A will detect the change in condition in line 258, break out of the loop, and remove itself from the wait queue at line 267.

In the second case of interest, the wakeup occurs after task A is on the event queue. The call to wake_up_interruptible will find task A on the event queue and will remove it, putting the task back onto the run queue if necessary.

A key to the correctness is that modifications to the event queue and run queue are protected by locks, which force serialization. The details of this locking, especially in the case of the run queues (which are per-processor) are rather subtle.

($Id)