Real Time Systems: Notes

Benchmarks & Time Metrics

 

The Need for Time Measurement

Some time values must be known to build a real time system, e.g.


Notes:

Building a real time system requires that we be able to manage time. To manage time accurately we need to know the values of some parameters that depend on the hardware and operating system. We will look at the ways people measure these times.

The way that requires the least special tool support is to run experiments, using "benchmark" programs.

Benchmark Program

Basic Idea

Apply the "Scientific Method"

Running a benchmark program is an experiment.

  1. Observe
  2. Analyze the observations
  3. Hypothesize an explanation for what you observed
  4. Design an experiment to confirm or deny the hypothesis
  5. Repeat from the beginning, oserving the results of the experiment

Notes:

Running a benchmark program is essentially an experiment that we do to extract information about a system. The design of a benchmark program follows the same principles as the design of any other scientific experiment. We can use it to directly measure some things, and from the numbers we get we can deduce and infer other things, and make some guesses. We the run further experiments to confirm or deny our guesses.

When the results of a benchmark test run are different from what we expected, that means there is something we did not understand about the program or the system you are testing. We then need to use a combination of guesswork and deduction to come up with an explanation. Based on that we may want to revise our benchmark program, to verify the theory.

The following example illustrates how an unexpected results from an experiment leads one to rethink, and repeat, the experiment.

Inferring Clock Tick Size

theory:

questions:


Notes:

We try constructing a benchmark program to answer these questions.

An Example Benchmark Program

Look at the program clockbench0.c. The conceptual core of this program is the following loop:

  t1 = current clock value;
  for (i = 0; i<N; i++) {
      t2 = current clock value;
      d = t2 - t1;
      ...
      val[(int) d]++;
      t1 = t2;
  }

Notes:

The objective of this program is to measure the size of the clock tick. A secondary objective is to find out how long it takes to read the value of the clock. In general, how this can be done will depend on whether it takes more than one clock tick to read the clock (including the path around the loop).

The basic idea behind the program is to poll the clock value, and build a histogram of the observed changes in the clock value.

Clock Tick Size Less than Polling Interval


Notes:

If the time to poll the clock is less than one tick, we should be able to poll the clock several times in succession without the value changing. Eventually the value of the clock will change, indicating one "tick" event has passed. The histogram should show many instances of zero change in clock value.

Clock Tick Size Greater than Polling Interval


Notes:

If the time to poll the clock is more than one tick, we should see a change in the clock value every time we poll the clock. If there are no other conflicting activities in the system, it should take a constant amount of time to complete one polling cycle. This value will be between N and N+1 clock ticks, for some N. We can infer the value of N by looking for the two peaks on the histogram.

The next step is to relate clock ticks to real time. To do this, we wrap the inner loop in an outer loop, which repeats the inner loop until a rather large independently measurable time has elapsed. We measure the time using an independent device, and compute the average time per tick.

A primitive way to do this is to run a long enough time that a person can time the execution on a clock or watch. There will be perhaps a fraction of a second human error in observation, so the time must be long enough that this error is a small fraction of of the total time measured. For example, if we have an error of one second in reading the watch, but run the experiment for two minutes, the relative error is 1/120, which is less than one percent.

If we trust the system's real time clock, we can use that to measure the tick size of the tick clock, instead of an outside timekeeping device. That is what the example does.

Example Output

Running clockbench0 on a 500 MHz III running Red Hat Linux 9.0, we get the following output:

elapsed ticks = 100
ticks per second =        100
ticks frequency
----- --------
    0 4124922
    1    78

Notes:

The next step is to relate clock ticks to real time. To do this, we wrap the inner loop in an outer loop, which repeats the inner loop until a rather large independently measurable time has elapsed. We measure the time using an independent device, and compute the average time per tick.

A primitive way to do this is to run a long enough time that a person can time the execution on a clock or watch. There will be perhaps a fraction of a second human error in observation, so the time must be long enough that this error is a small fraction of of the total time measured. For example, if we have an error of one second in reading the watch, but run the experiment for two minutes, the relative error is 1/120, which is less than one percent.

If we trust the system's real time clock, we can use that to measure the tick size of the tick clock, instead of an outside timekeeping device. That is what the example does.

Since the number of ticks per second is 100, and the test is set up to run for one second, we would expect the test to count 100 ticks. However, the histogram to shows a count of 78 for the value one (as the clock difference). What is happening?

Suppose we run the test a few more times.

Analysis of Variations on Repeated Experiments

elapsed timeobserved ticks
100001078
100000291
100002193
100000889
100001660
100002289
100000180

Notes:

The average number of ticks per second does not vary, which is consistent with our hypothesis of a fixed clock tick size. However, the elapsed time and number of times around the inner loop vary somewhat.

The observed variation in elapsed time reported by gettimeofday() is 22 microseconds. We don't know the granularity of the gettimeofday(). We really should do another experiment to find that, but from the data we have we can make a reasonable initial guess. If the granularity were much larger than one microsecond, the microsecond values returned should cluster around multiples of the gettimeofday() tick size. What we see in our limited sample looks as there might be clustering around multiples of ten, or perhaps gettimeofday() is accurate down to a microsecond. So, we guess we can rely on the values being accurate to down at least to ten microseconds.

Note: The above reasoning is not always valid. Some Unix systems fake the microsecond part of the gettimeofday() value. They add a tiny increment to the clock every time a process reads the clock, to ensure the clock can be used to obtain unique timestamp values. There is also timer that generates an interrupt periodically, say every 10 milliseconds. The timer interrupt handler increments the clock to the next multiple of 10 millseconds. If one polls the clock, the value will appear to increase by a small amount every time it is polled, and then it will periodically jump to the next multiple of the timer interrupt period.

The variation is too small to be explained by preemption by another Linux process.

Another possible explanation is the loop structure. We only check the outer loop exit condition after the inner loop has gone around 100 times. Sometimes the 100th clock tick will occur near the beginning of the inner loop, and we will need to complete up to 100 iterations before we get a chance to terminate the outer loop.

If we suppose that is the only cause of variation, we would infer that that the average time to execute 100 iterations of the inner loop is at least 22 microseconds. We will verify this later, using a different line of reasoning.

The variation in the number of counted ticks is worrisome. There should be 100 ticks, but we are missing some of them. Why is that?

We made an error in the benchmark program. There is a window between the end of the inner loop and the top of the outer loop where we may miss a tick, because we reset variable T1.

To fix this, we modify the code of the benchmark.

Revised Benchmark and Output

See the revised benchmark program clockbench.c.

Running it, we have the following output:

elapsed time = 1000012 usec
elapsed ticks = 100
ticks per second =        100
ticks frequency
----- --------
    0 4167600
    1   100

Notes:

Repeated runs continue to show variation of about 20 ticks in the elapsed time and the histogram values at zero, but the histogram value at one is always 100. We are not missing any ticks.

What Can We Infer from the Output?


Notes:

We have a second way to infer the execution time of one iteration of the inner loop, using the sum of the values in the histogram. In the few experiments we ran, the value ranged between 4125000 and 4167800. That is the number of iterations per second, so the number of seconds per iteration is about 1/4125000 = 2.4x10-7 sec = 24 microseconds.

What Does it Do on Another Machine?

Running the program on a 450MHz SPARC running SunOS 5.8, we get the following output:

elapsed time = 1000105 usec
elapsed ticks = 100
ticks per second =        100
ticks frequency
----- --------
    0 657700
    1   100

Notes:

The Pentium III program goes around the inner loop 4,125,000 times while the SPARC only gets around 657,800 times. This may be due to the SPARC being slower overall, or due to the times() operating system call taking longer in SunOS than Linux.

Measuring the Execution Time of a Subprogram


Notes:

A prerequisite for predicting that a job can be executed to meet a deadline is to determine how long it takes to execute the corresponding code. Suppose the job is to execute a call to the function f(). How do we determine how long it takes to execute a call to f()?

We will look at several ways of measuring the execution time using the system clock. Of course none of these is completely reliable, since they are measuring only single executions. They do not take into account the several factors that could cause variations in execution time of different calls to the same function (e.g., caching, memory refresh, interrupts, page faults).

Basic Clock Reading

t1 = clock();
f();
t2 = clock();
d = t2 - t1;

Notes:

d represents the combined time spent executing calls to f() and clock().

t0 = clock();
t1 = clock();
f();
t2 = clock();
d = (t2 - t1) - (t1 - t0);

Notes:

d represents the time spent executing the call to f(), since we have subtracted the time it takes to do one call to clock().

This may result in an underestimate, though. The second call to clock() may go faster, because some of the code and data may still be in cache from the first call.

On a system with a coarse-grained clock, the difference in clock values may be zero ticks or so few ticks that the relative error (percent) in measurement is unsatisfactory. If so, then we can try to compensate for this problem by calling the function f() many times, and averaging.

Averaging

t1 = clock();
for (i = 0; i < N; i++) {
  f();
}
t2 = clock();
d = t2 - t1;

Notes:

The value of d represents the time spent executing N calls to f(), plus N times the through the loop control structure, plus one call to clock().

Dual Loop Averaging

t0 = clock();
for (i = 0; i < N; i++) {
  ;
}
t1 = clock();
for (i = 0; i < N; i++) {
  f();
}
t2 = clock();
d = (t2 - t1) - (t1 - t0);

Notes:

The intent here is that by subtracting out the loop overhead and the clock() call overhead, we are left with N times the execution time of f(). In theory, dividing by N should give us the approximate execution time of one call to f().

This is simplistic, though.

First, we have the cache effect. The first call to f() will bring the code into cache. Unless f() (and whatever functions it calls) are huge, the code will still be in cache the next time around the loop, and so the subsequent calls will run much faster, making the average optimistic.

Second, we have a problem with the first loop. Chances are, the compiler will notice that the body is null, and optimize away the whole loop. So, subtracting out its execution time will not help improve the accuracy of our execution timing benchmark.

One thing we can do to compensate for this is to put some code into the loop, that the compiler cannot optimize away. In general, a compiler will try to eliminate all obviously "useless" code. That is usually a problem with benchmarks, since their only real use is to measure times. To defeat the compiler's optimizations we need to know something about what compilers can out and what they cannot. There are several techniques one might use.

Note that we do not just want to turn off the compiler's optimizations. That will probably also increase the execution time of f().

Things Compilers Usually Can't Optimize Away

Examples:


Notes:

Make certain you understand what the qualifier volatile in a variable declaration of a C program means to the programmer and the compiler.

Suppose p() is a function that contains code that the compiler cannot optimize away, and we make the following changes to our dual-loop benchmark.

t0 = clock();
for (i = 0; i < N; i++) {
  p();
}
t1 = clock();
for (i = 0; i < N; i++) {
  p();
  f();
}
t2 = clock();
d = (t2 - t1) - (t1 - t0);

Notes:

The value of t1-t0 is now more likely to be comparable to the overhead of the loop and call to p() in the second loop.

Of course we still have a potential problem with cache effects making the later calls to f() speed up.

p() & Cache Effects


Notes:

We may try to force f() to always have the maximum cache misses, by writing p() to reference a wide enough range of memory locations that it forces all the code and data of f() out of cache.

However, we then run into some new dangers.

Timing Variation Problems


Notes:

First, the execution time of p() may be so much longer than that of f() that the inevitable errors and variations in the execution time of p() dominate the results. That is, the execution time of f() may be lost in the "noise" of executing p(). If we write p() in a way that its memory reference pattern is always the same, we should minimize the variations in execution time due to the logic of p().

That still leaves other causes of variation in execution times, such as memory refresh, interrupts, page faults, and preemptive scheduling of another task.

If we average, the memory refresh effects will average out, giving us a lower estimate than the worst-case execution time.

If we have permission to disable interrupts, that may be a good idea. On a hard real time OS there would be a way to do this. On ordinary Linux or Unix, it may be possible with root permission, but the overhead of trapping to the kernel to do this may again introduce noise (variability) in the time measurement.

Note: Do you understand the difference between "masking", "blocking", and "disabling" interrupts? Is there any difference? Is there any standard terminology?

Hard real time tasks generally cannot tolerate page faults (can't wait for disk I/O to bring in page). Therefore, we should lock the pages used by the benchmark into real memory, just as we would for a hard real time task.

All of these examples rely on the total execution time of the benchmark between clock calls being short enough that the test can run to completion without being preempted. With the addition of a function p() that reads or writes enough memory to invalidate the whole cache, we may have more chance of running long enough to be preempted. On a hard real time operating system there will be a way to raise the priority of the benchmark high enough that it cannot be preempted. On ordinary Linux or Unix, one just has to keep the measured code short enoug to fit in one time quantum, and then take the minimum of the times observed over many repeated runs.

Generally, if you have a clock with fine enough granularity, you want to avoid the averaging techniques. For example the Pentium CPU cycle counter should be accurate enough to just do point-to-point time measurements.

Indirect (Subtractive) Measurement Methods

Subtractive Measurement of Task Scheduling Overhead

gantt chart showing subtractive measurement

Subtractive Measurement of Task Scheduling Overhead

One then solves the above equation for x.


Notes:

The above equation is not quite exact, since there the period T1 will not necessarily divide t exactly, nor will the foreground task necessarily be exactly in phase with the start of the background task. It is possible to derive upper and lower bound expressions for this error, and so determine how large t needs to be.

To see how well you understand this, you might try working out the error estimate yourself. However, the algebra and case analysis is not pretty, and so, for the purpose of discussion here, we will assume t is chosen large enough to make the amount of error acceptable.

Solving the Equation

t - c2 = (t/T1) (c1 + x)

t - c2 = t c1/T1 + tx/T1

T1(t - c2-tc1/T1)/t = x

T1(1 - c2/t) - c1 = x


Notes:

The above formula will not compute an accurate answer using integer arithmetic, since the ratio c2/t will be less than zero, and so it will always truncate down to zero. To achieve more accuracy with integer arithmetic, we can reorder the arithmetic as follows:

x = T1 - (T1c2)/t - c1

Subtractive Measurement of Interrupt Handling Overhead


Notes:

This is the same technique as before, but with the itnerference being generated by an interrupt handler, instead of a periodic task.

Bisection Technique to Find Breakdown Utilization

Breakdown Utilization

Convergence by Bisection

bisection convergence figure

Notes:

We will look at this technique several times more. Soon, when we look at POSIX threads program bisection.c, later when we look at scheduling, and possibly also in a programming assignment with RTLinux.

More Direct Time Measure Techniques

Use of External Time Measurement Device

Remaining Inherent Problems with Measurement

Static Code Analysis

Problems with Static Code Analysis

Summary of Factors that Cause Timing Variability


Notes:

Do you understand how each of these can cause varation in execution timing?

Do you understand how some of these can interact to make timing even more variable? For example, several of the above interact with cache performance.

Can you roughly rank them in terms of the magnitude of the timing variation they are likely to cause?

For which ones can you think of a way to prevent or limit the effect?

Conclusions About Use of WCET in Scheduling

must balance:

T. P. Baker. ($Id$)