benchmarks.html

Benchmarks & Time Metrics

The Need for Time Measurement

Some time values must be known to build a real time system, e.g.

granularity of clock
granularity of a timer
how long it takes to read the clock
how long it takes to execute other pieces of code
in the worst case
how much overhead is consumed by the scheduler,
say for a periodic task

Notes:

Building a real time system requires that we be able to manage time. To manage time accurately we need to know the values of some parameters that depend on the hardware and operating system. We will look at the ways people measure these times.

The way that requires the least special tool support is to run experiments, using "benchmark" programs.

Benchmark Program

Common usage: program that can be run on a system to discover values of performance metrics on that system
Often used to compare performance of difference hardware platforms and operating systesm.
We use it here for discovering platform-dependent parameters that affect the performance of real time applications, e.g.,

Basic Idea

Apply the "Scientific Method"

Running a benchmark program is an experiment.

Observe
Analyze the observations
Hypothesize an explanation for what you observed
Design an experiment to confirm or deny the hypothesis
Repeat from the beginning, oserving the results of the experiment

Notes:

Running a benchmark program is essentially an experiment that we do to extract information about a system. The design of a benchmark program follows the same principles as the design of any other scientific experiment. We can use it to directly measure some things, and from the numbers we get we can deduce and infer other things, and make some guesses. We the run further experiments to confirm or deny our guesses.

When the results of a benchmark test run are different from what we expected, that means there is something we did not understand about the program or the system you are testing. We then need to use a combination of guesswork and deduction to come up with an explanation. Based on that we may want to revise our benchmark program, to verify the theory.

The following example illustrates how an unexpected results from an experiment leads one to rethink, and repeat, the experiment.

Inferring Clock Tick Size

theory:

The system clock in incremented regularly, at some real time interval, which we call the tick size of the clock
If we run a short enough piece of code the execution is unlikely to be interrupted

questions:

What is the clock tick size?
How long does it take to read the clock?

Notes:

We try constructing a benchmark program to answer these questions.

An Example Benchmark Program

Look at the program clockbench0.c. The conceptual core of this program is the following loop:

  t1 = current clock value;
  for (i = 0; i<N; i++) {
      t2 = current clock value;
      d = t2 - t1;
      ...
      val[(int) d]++;
      t1 = t2;
  }

Notes:

The objective of this program is to measure the size of the clock tick. A secondary objective is to find out how long it takes to read the value of the clock. In general, how this can be done will depend on whether it takes more than one clock tick to read the clock (including the path around the loop).

The basic idea behind the program is to poll the clock value, and build a histogram of the observed changes in the clock value.

Clock Tick Size Less than Polling Interval

Notes:

If the time to poll the clock is less than one tick, we should be able to poll the clock several times in succession without the value changing. Eventually the value of the clock will change, indicating one "tick" event has passed. The histogram should show many instances of zero change in clock value.

Clock Tick Size Greater than Polling Interval

Notes:

If the time to poll the clock is more than one tick, we should see a change in the clock value every time we poll the clock. If there are no other conflicting activities in the system, it should take a constant amount of time to complete one polling cycle. This value will be between N and N+1 clock ticks, for some N. We can infer the value of N by looking for the two peaks on the histogram.

The next step is to relate clock ticks to real time. To do this, we wrap the inner loop in an outer loop, which repeats the inner loop until a rather large independently measurable time has elapsed. We measure the time using an independent device, and compute the average time per tick.

A primitive way to do this is to run a long enough time that a person can time the execution on a clock or watch. There will be perhaps a fraction of a second human error in observation, so the time must be long enough that this error is a small fraction of of the total time measured. For example, if we have an error of one second in reading the watch, but run the experiment for two minutes, the relative error is 1/120, which is less than one percent.

If we trust the system's real time clock, we can use that to measure the tick size of the tick clock, instead of an outside timekeeping device. That is what the example does.

Example Output

Running clockbench0 on a 500 MHz III running Red Hat Linux 9.0, we get the following output:

elapsed ticks = 100
ticks per second =        100
ticks frequency
----- --------
    0 4124922
    1    78

Notes:

If we trust the system's real time clock, we can use that to measure the tick size of the tick clock, instead of an outside timekeeping device. That is what the example does.

Since the number of ticks per second is 100, and the test is set up to run for one second, we would expect the test to count 100 ticks. However, the histogram to shows a count of 78 for the value one (as the clock difference). What is happening?

Suppose we run the test a few more times.

Analysis of Variations on Repeated Experiments

elapsed time	observed ticks
1000010	78
1000002	91
1000021	93
1000008	89
1000016	60
1000022	89
1000001	80

Notes:

The average number of ticks per second does not vary, which is consistent with our hypothesis of a fixed clock tick size. However, the elapsed time and number of times around the inner loop vary somewhat.

The observed variation in elapsed time reported by gettimeofday() is 22 microseconds. We don't know the granularity of the gettimeofday(). We really should do another experiment to find that, but from the data we have we can make a reasonable initial guess. If the granularity were much larger than one microsecond, the microsecond values returned should cluster around multiples of the gettimeofday() tick size. What we see in our limited sample looks as there might be clustering around multiples of ten, or perhaps gettimeofday() is accurate down to a microsecond. So, we guess we can rely on the values being accurate to down at least to ten microseconds.

Note: The above reasoning is not always valid. Some Unix systems fake the microsecond part of the gettimeofday() value. They add a tiny increment to the clock every time a process reads the clock, to ensure the clock can be used to obtain unique timestamp values. There is also timer that generates an interrupt periodically, say every 10 milliseconds. The timer interrupt handler increments the clock to the next multiple of 10 millseconds. If one polls the clock, the value will appear to increase by a small amount every time it is polled, and then it will periodically jump to the next multiple of the timer interrupt period.

The variation is too small to be explained by preemption by another Linux process.

Another possible explanation is the loop structure. We only check the outer loop exit condition after the inner loop has gone around 100 times. Sometimes the 100th clock tick will occur near the beginning of the inner loop, and we will need to complete up to 100 iterations before we get a chance to terminate the outer loop.

If we suppose that is the only cause of variation, we would infer that that the average time to execute 100 iterations of the inner loop is at least 22 microseconds. We will verify this later, using a different line of reasoning.

The variation in the number of counted ticks is worrisome. There should be 100 ticks, but we are missing some of them. Why is that?

We made an error in the benchmark program. There is a window between the end of the inner loop and the top of the outer loop where we may miss a tick, because we reset variable T1.

To fix this, we modify the code of the benchmark.

Revised Benchmark and Output

See the revised benchmark program clockbench.c.

Running it, we have the following output:

elapsed time = 1000012 usec
elapsed ticks = 100
ticks per second =        100
ticks frequency
----- --------
    0 4167600
    1   100

Notes:

Repeated runs continue to show variation of about 20 ticks in the elapsed time and the histogram values at zero, but the histogram value at one is always 100. We are not missing any ticks.

What Can We Infer from the Output?

Executing the code of the inner loop takes much less time than one clock tick.
The granularity (tick size) of the times() clock is 1/100 second.
Executing the code of the inner loop appears to take about 20 microseconds.

Notes:

We have a second way to infer the execution time of one iteration of the inner loop, using the sum of the values in the histogram. In the few experiments we ran, the value ranged between 4125000 and 4167800. That is the number of iterations per second, so the number of seconds per iteration is about 1/4125000 = 2.4x10^-7 sec = 24 microseconds.

What Does it Do on Another Machine?

Running the program on a 450MHz SPARC running SunOS 5.8, we get the following output:

elapsed time = 1000105 usec
elapsed ticks = 100
ticks per second =        100
ticks frequency
----- --------
    0 657700
    1   100

Notes:

The Pentium III program goes around the inner loop 4,125,000 times while the SPARC only gets around 657,800 times. This may be due to the SPARC being slower overall, or due to the times() operating system call taking longer in SunOS than Linux.

Measuring the Execution Time of a Subprogram

basic clock reading
compensating for clock call overhead
averaging
dual loop averaging

Notes:

A prerequisite for predicting that a job can be executed to meet a deadline is to determine how long it takes to execute the corresponding code. Suppose the job is to execute a call to the function f(). How do we determine how long it takes to execute a call to f()?

We will look at several ways of measuring the execution time using the system clock. Of course none of these is completely reliable, since they are measuring only single executions. They do not take into account the several factors that could cause variations in execution time of different calls to the same function (e.g., caching, memory refresh, interrupts, page faults).

Basic Clock Reading

t1 = clock();
f();
t2 = clock();
d = t2 - t1;

Notes:

d represents the combined time spent executing calls to f() and clock().

t0 = clock();
t1 = clock();
f();
t2 = clock();
d = (t2 - t1) - (t1 - t0);

Notes:

d represents the time spent executing the call to f(), since we have subtracted the time it takes to do one call to clock().

This may result in an underestimate, though. The second call to clock() may go faster, because some of the code and data may still be in cache from the first call.

On a system with a coarse-grained clock, the difference in clock values may be zero ticks or so few ticks that the relative error (percent) in measurement is unsatisfactory. If so, then we can try to compensate for this problem by calling the function f() many times, and averaging.

Averaging

t1 = clock();
for (i = 0; i < N; i++) {
  f();
}
t2 = clock();
d = t2 - t1;

Notes:

The value of d represents the time spent executing N calls to f(), plus N times the through the loop control structure, plus one call to clock().

Dual Loop Averaging

t0 = clock();
for (i = 0; i < N; i++) {
  ;
}
t1 = clock();
for (i = 0; i < N; i++) {
  f();
}
t2 = clock();
d = (t2 - t1) - (t1 - t0);

Notes:

The intent here is that by subtracting out the loop overhead and the clock() call overhead, we are left with N times the execution time of f(). In theory, dividing by N should give us the approximate execution time of one call to f().

This is simplistic, though.

First, we have the cache effect. The first call to f() will bring the code into cache. Unless f() (and whatever functions it calls) are huge, the code will still be in cache the next time around the loop, and so the subsequent calls will run much faster, making the average optimistic.

Second, we have a problem with the first loop. Chances are, the compiler will notice that the body is null, and optimize away the whole loop. So, subtracting out its execution time will not help improve the accuracy of our execution timing benchmark.

One thing we can do to compensate for this is to put some code into the loop, that the compiler cannot optimize away. In general, a compiler will try to eliminate all obviously "useless" code. That is usually a problem with benchmarks, since their only real use is to measure times. To defeat the compiler's optimizations we need to know something about what compilers can out and what they cannot. There are several techniques one might use.

Note that we do not just want to turn off the compiler's optimizations. That will probably also increase the execution time of f().

Things Compilers Usually Can't Optimize Away

Examples:

assignments to volatile variables
system calls (traps)
calls to separately compiled (library, not inline) functions
modifications to variable accessed via pointers
nontrivial modifications to variables that are later output

Notes:

Make certain you understand what the qualifier volatile in a variable declaration of a C program means to the programmer and the compiler.

Suppose p() is a function that contains code that the compiler cannot optimize away, and we make the following changes to our dual-loop benchmark.

t0 = clock();
for (i = 0; i < N; i++) {
  p();
}
t1 = clock();
for (i = 0; i < N; i++) {
  p();
  f();
}
t2 = clock();
d = (t2 - t1) - (t1 - t0);

Notes:

The value of t1-t0 is now more likely to be comparable to the overhead of the loop and call to p() in the second loop.

Of course we still have a potential problem with cache effects making the later calls to f() speed up.

p() & Cache Effects

can write p() to force complete cache turnover
danger of increasing effects of variations in execution time

Notes:

We may try to force f() to always have the maximum cache misses, by writing p() to reference a wide enough range of memory locations that it forces all the code and data of f() out of cache.

However, we then run into some new dangers.

Timing Variation Problems

if p()'s execution time may vary, we have a problem - so keep it straight-forward
increasing execution time of p() increases window for other sources of variation
memory refresh - can model as periodic interference on the average
interrupts - mask them out
page faults - lock pages into real memory
scheduler preeemption - set priority to the maximum

Notes:

First, the execution time of p() may be so much longer than that of f() that the inevitable errors and variations in the execution time of p() dominate the results. That is, the execution time of f() may be lost in the "noise" of executing p(). If we write p() in a way that its memory reference pattern is always the same, we should minimize the variations in execution time due to the logic of p().

That still leaves other causes of variation in execution times, such as memory refresh, interrupts, page faults, and preemptive scheduling of another task.

If we average, the memory refresh effects will average out, giving us a lower estimate than the worst-case execution time.

If we have permission to disable interrupts, that may be a good idea. On a hard real time OS there would be a way to do this. On ordinary Linux or Unix, it may be possible with root permission, but the overhead of trapping to the kernel to do this may again introduce noise (variability) in the time measurement.

Note: Do you understand the difference between "masking", "blocking", and "disabling" interrupts? Is there any difference? Is there any standard terminology?

Hard real time tasks generally cannot tolerate page faults (can't wait for disk I/O to bring in page). Therefore, we should lock the pages used by the benchmark into real memory, just as we would for a hard real time task.

All of these examples rely on the total execution time of the benchmark between clock calls being short enough that the test can run to completion without being preempted. With the addition of a function p() that reads or writes enough memory to invalidate the whole cache, we may have more chance of running long enough to be preempted. On a hard real time operating system there will be a way to raise the priority of the benchmark high enough that it cannot be preempted. On ordinary Linux or Unix, one just has to keep the measured code short enoug to fit in one time quantum, and then take the minimum of the times observed over many repeated runs.

Generally, if you have a clock with fine enough granularity, you want to avoid the averaging techniques. For example the Pentium CPU cycle counter should be accurate enough to just do point-to-point time measurements.

Indirect (Subtractive) Measurement Methods

Suppose there are two concurrent activities
We can control both, but measure only one
The other can be measured indirectly, by subtraction

Subtractive Measurement of Task Scheduling Overhead

gantt chart showing subtractive measurement

Subtractive Measurement of Task Scheduling Overhead

a high priority periodic task has period T₁ and executes subprogram P whose execution time has been measured as c₁
a one-time very-long-running task takes time c₂ to complete, measured with no other tasks running
the same long-running task completes in time t when it is run as background (low priority) with the periodic task running concurrently in the foreground (high priority)
we infer that the difference between t and c₂ should be attributable to interference caused by executions of the foreground task and its associated scheduling overhead, i.e.
t - c₂ = n*(c₁ + x)
where x is the scheduling overhead per iteration of the foreground task and n is the number of executions of the foreground task that preempt the background task
sincd the high priority task is periodic, the number n is approximately t/T₁, and so,
t - c₂ ~= (t/T₁) (c₁ + x)

One then solves the above equation for x.

Notes:

The above equation is not quite exact, since there the period T₁ will not necessarily divide t exactly, nor will the foreground task necessarily be exactly in phase with the start of the background task. It is possible to derive upper and lower bound expressions for this error, and so determine how large t needs to be.

To see how well you understand this, you might try working out the error estimate yourself. However, the algebra and case analysis is not pretty, and so, for the purpose of discussion here, we will assume t is chosen large enough to make the amount of error acceptable.

Solving the Equation

t - c₂ = (t/T₁) (c₁ + x)

t - c₂ = t c₁/T₁ + tx/T₁

T₁(t - c₂-tc₁/T₁)/t = x

T₁(1 - c₂/t) - c₁ = x

Notes:

The above formula will not compute an accurate answer using integer arithmetic, since the ratio c₂/t will be less than zero, and so it will always truncate down to zero. To achieve more accuracy with integer arithmetic, we can reorder the arithmetic as follows:

x = T₁ - (T₁c₂)/t - c₁

Subtractive Measurement of Interrupt Handling Overhead

The stand-alone execution time of a subprogram P is estimated, perhaps using the dual-loop method
This subprogram P is attached as handler to the timer interrupt, and the timer is set to interrupt periodically
A long-running background task is chosen, and its execution time is measured with all interrupts disabled.
The background task is run again, with the timer interrupt enabled, and its execution time is also measured.
The increase in execution time of the background task is due to interference from the preemptions caused by the interrupt handler.
We have the same equation as above, with the interrupt handler taking the place of the periodic task and the interrupt handling overhead taking the place of the periodic task scheduling overhead.

Notes:

This is the same technique as before, but with the itnerference being generated by an interrupt handler, instead of a periodic task.

Bisection Technique to Find Breakdown Utilization

A set of periodic benchmark tasks is given
Each task has a fixed period, and an execution time that can be increased or decreased within some range by changing a global parameter L (called the "load factor")
We set a value of L and run the task set long enough to verify if the tasks are making all their deadlines
If a deadline is missed, we reduce L and try again
If all deadlines are satisfied, we increase L and try again
The increment or decrement of L is made progressively smaller, so that this process converges to a value that is right on the boundary between the tasks to making all their deadlines and missing a deadline.
The final value of L is used to compute the "breakdown" utilization of the system for this set of tasks
This can give us an idea of how much margin for error (e.g., underestimated WCETs) we have in the system
If we have a good scheduling technique, we may have a model that allows us to predict the theoretical breakdown utilization. If we do, comparing the actual against the theoretical can give us an idea of what kind of scheduling overhead the system is imposing

Breakdown Utilization

The actual utilization bound of a system is generally lower than the theoretical estimate one might derive from analysis of task periods and WCETs
Due to factors not included in the analysis, such as preemption overhead and interference effects of I/O devices
The breakdown utilization of a set of tasks with given periods, on a given system, can be estimated empirically by running a synthetic benchmark that can increase or decrease the execution time of all tasks linearly, via a parameter we call a "load factor"
By adjusting the the load factor up and down, we can zero in on the breakdown utilization of the task system

Convergence by Bisection

Notes:

We will look at this technique several times more. Soon, when we look at POSIX threads program bisection.c, later when we look at scheduling, and possibly also in a programming assignment with RTLinux.

More Direct Time Measure Techniques

external time measurement device - a logic analyzer
static machine code timing analyzer, perhaps built into a compiler

Use of External Time Measurement Device

hook some simple output to serve as "trigger" to an external device
e.g., hook to memory buss and trigger on reference to some memory address
e.g., hook to parallel I/O port and trigger on some signal to a pin on the parallel connector
must insert calls to output or memory operations in place of calls to clock()
external device measures time between triggering events
using memory reference as trigger may be defeated by cache, unless you use special instructions

Remaining Inherent Problems with Measurement

Natural variations in execution time
How do we figure out what is worst case?
Data-dependent flows of control
Delays due to memory refresh
Slowdowns due to memory contention from DMA devices (can be up to 49%)
Slowdowns due to pipeline and cache side-effects of interrupts
...is that all? (more always seem to crop up)

Static Code Analysis

consider all paths through code
model execution cycles, pipelining, and cache effects
sum up cycles along each path
take the maximum

Problems with Static Code Analysis

Lack of information:
- needs detailed information about the hardware architecture
- this information is hard or impossible to find
Complexity:
- architectures are becoming increasing complex, and so harder to model
- heuristic execution features cause more variation in execution times (can you name some?)
- number of cases to consider blows up exponentially with time
- gap between best, average, and worst case times can be huge
- analyzer may find as longest path one that logically can never be taken

Summary of Factors that Cause Timing Variability

Hardware
- asynchronism between CPU and memory
- memory refresh cycles
- instruction pipelining (internal parellelism)
- speculative execution
- translation lookaside buffer misses
- memory cache misses
- DMA I/O devices: memory cycle stealing
- other cores/CPUs: memory cycle stealing, shared cache invalidations
- hardware interrupts
Operating System
- interrupt masking/blocking/disabling
- preemption by device drivers
- page fault handling
- CPU scheduling: preemption by higher priority tasks, maybe timeslicing
- I/O scheduling
- dynamic library linkage
Compiler
- instruction choice
- optimizations
- layout of data and code in memory
Application Code
- data or time-dependent control flow (branches, loops)
- irregular data addressing

Notes:

Do you understand how each of these can cause varation in execution timing?

Do you understand how some of these can interact to make timing even more variable? For example, several of the above interact with cache performance.

Can you roughly rank them in terms of the magnitude of the timing variation they are likely to cause?

For which ones can you think of a way to prevent or limit the effect?

Conclusions About Use of WCET in Scheduling

must balance:

wasted CPU time due to using a very conservative (pessimistic) WCET estimate
restrictions on structure of algorithms to keep longest path predictable
chance of missing deadline due to underestimating WCET
techniques for tolerating errors in WCET

T. P. Baker. ($Id$)