Regression

Using the method of least squares to find best-fit curves to approximate data

By a scalability curve for an algorithm implementation we shall mean an equation whose form is determined by known asymptotic properties of the algorithm and whose coefficients are determined by a least squares fit to actual timing data for the algorithm as a function of input size. For example, the merge sort algorithm, implemented as a generic algorithm named g_merge_sort(), is known to have asymptotic runtime Θ(n log n), and we will use the following for its form:

R = A + B n log n

where R is the predicted run time on input of size n. To obtain the concrete scalability curve, we need to obtain actual timing data for the sort and use that data to find optimal values for the coefficients A and B. Note this curve will depend on the implementation all the way from source code to hardware, so it is important to keep the compiler and testing platform the same in order to compare efficiencies of different sorts using their concrete scalability curves.

The method for finding the coefficients A and B is the method of least squares. Assume that we have sample runtime data as follows:

Input size: n₁ n₂ ... n_k

Measured runtime: t₁ t₂ ... t_k

and the scalability form is given by

f(n) = A + B g(n)

Define the total square error of the approximation to be the sum of squares of errors at each data point:

E = Σ [t_i - f(n_i)]²

where the sum is taken from i = 1 to i = k, k being the number of data points. (Note that the "unknown" coefficients A,B are the variables on which E depends, everything else in the formula being fixed.) The key observation in the method of least squares is that total square error E is minimized when the gadient of E is zero, that is, where the partial derivatives with respect to the variables are zero: D_AE and D_BE are zero. Calculating these partial derivatives gives:

D_XE = 2 Σ [t_i - f(n_i)] D_Xf

= 2 Σ [t_i - (A + B g(n_i))] D_Xf

(where X is A or B). This gives the partial derivatives of E in terms of those of f, which may be calculated to be:

D_Af = 1
D_Bf = g(n)

(because n and g(n) are constant with respect to A and B.) Substituting these into the previous formula and setting the results equal to zero yields the following equations:

A Σ 1 + B Σ g(n_i) = Σ t_i
A Σ g(n_i) + B Σ (g(n_i))² = Σ t_i g(n_i)

Rearranging and using Σ 1 = k yields:

k A + [Σ g(n_i)] B = Σ t_i

[Σ g(n_i)] A + [Σ (g(n_i))²] B = Σ t_i g(n_i)

These are two linear equations in the unknowns A and B. With even a small amount of luck, they have a unique solution, and thus optimal values of A and B are determined. (Here is a link to a more detailed derivation in the quadratic case.)

Note that all of the coefficients in these equations may be calculated from the original data table and knowledge of the function g(n), in a spreadsheet or in a simple stand-alone program. The solution to the system of equations itself is probably easiest to find by hand by row-reducing the 2x3 matrix of coefficients to upper triangular form and then back-substitution.

Input size:	n₁	n₂	...	n_k
Measured runtime:	t₁	t₂	...	t_k

D_XE	= 2 Σ [t_i - f(n_i)] D_Xf
	= 2 Σ [t_i - (A + B g(n_i))] D_Xf

k A	+	[Σ g(n_i)] B	=	Σ t_i
[Σ g(n_i)] A	+	[Σ (g(n_i))²] B	=	Σ t_i g(n_i)