Project 1: Sortmania

Note: This assignment is used to assess the required outcomes for the course, as outlined in the course syllabus. These outcomes are:

analyze the computational complexity of algorithms used in the solution of a programming problem
evaluate the performance trade-offs of alternative data structures and algorithms

These will be assessed using the following rubric:

I E H
Key:
I = ineffective
E = effective
H = highly effective

Performance Analysis

Runtime Analysis - - -

Runspace Analysis - - -

Tradeoff Analysis

Comparison Sorts - - -

Numerical Sorts - - -

In order to earn a course grade of C- or better, the assessment must result in Effective or Highly Effective for each outcome.

Educational Objectives: On successful completion of this assignment, the student should be able to

Implement a variety of comparison sort algorithms as generic algorithms, including Insertion Sort, Selection Sort, Heap Sort, Merge Sort, and Quick Sort, re-using code as much as possible from the course library. The implementations should cover both default order and order by passed predicate object.
Discuss the capabilities and use constraints for each of these generic algorithms, including assumptions on assumed iterator type, worst and average case runtimes.
Implement the Counting Sort algorithm for specified arrays of integers
Implement the Counting Sort algorithm as a template function taking function object parameter that is used to define the sort value of the input integers, obtaining Bit Sort and Reverse Sort as special cases.
Collect data and use the method of least squares to find a best fit scalability curve for each sort algorithm (including Counting Sort), based on a form derived from the known asymptotic runtime for the algorithm.
Perform a comparative qualitative analysis of these algorithms using asymptotic runtime analysis as well as quantitative analysis using data collected from implementations.

Background Knowledge Required: Be sure that you have mastered the material in these chapters before beginning the project: Sequential Containers, Function Classes and Objects, Iterators, Generic Algorithms, Generic Set Algorithms, Heap Algorithms, and Sorting Algorithms

Part I: Generic Sort Algorithms

Operational Objectives: Implement various comparison sorts as generic algorithms, with the minimal practical constraints on iterator types. Each generic comparison sort should be provided in two froms: (1) default order and (2) order supplied by a predicate class template parameter.

Also implement some numerical sorts as template functions, with the minimal practical constraints on template parameters. Again there should be two versions, one for default order and one for order determined by a function object whose class is passed as a template parameter.

The sorts to be developed and tested are selection sort, insertion sort, heap sort, merge sort, quick sort, counting sort, and bit sort.

Deliverables: Two files:

gsort.h         # contains the generic algorithm implementations of comparison sorts
nsort.h         # contains the numerical sorts and class Bit

Procedural Requirements

Develop and fully test all of the sort algorithms listed under requirements below. Make certain that your testing includes "boundary" cases, such as empty ranges, ranges that have the same element at each location, and ranges that are in correct or reverse order before sorting. Place all of the generic sort algorithms in the file gsort.h and all of the numerical sort algorithms in the file nsort.h. Your test programs should have filename suffix .cpp and your test data files should have suffix .dat.
Turn in gsort.h and nsort.h using the script LIB/proj1/proj11submit.sh.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Note that Parts 1 and 2 have different due dates.

Code Requirements and Specifications

The two sort algorithm files are expected to operate using the supplied test harnesses: fgsort.cpp (tests gsort.h) and fnsort.cpp (tests nsort.h). Note that this means, among other things, that:
1. All generic sorts in gsort.h should work with ordinary arrays as well as iterators of the appropriate category
2. Both classic (4-argument) counting_sort and the 5-argument version should work
3. bit_sort should work with the class Bit defined in nsort.h
Issuing the single command "make" should build executables for all of your test functions.
The comparison sorts should be implemented as generic algorithms with template parameters that are iterator types.
Each comparison sort should have two versions, one that uses default order (operator < on I::ValueType) and one that uses a predicate object whose type is an extra template parameter.
Re-use as many components as possible, especially existing generic algorithms such as g_copy (in genalg.h), g_set_merge (in gset.h), and the generic heap algorithms (in gheap.h).

Two versions of counting_sort should be implemented: the classic 4-parameter version, plus one that takes a function object as an argument. Here is a prototype for the 5-parameter version:

template < class F >
void counting_sort(const int * A, int * B, size_t n, size_t k, F f)
// Pre:  A,B are arrays of type unsigned int
//       A,B are defined in the range [0,n)
//       f is defined for all elements of A and has values in the range [0,k)
// Post: A is unchanged
//       B is a stable f-sorted permutation of A
//       I.e., i < j ==> f(B[i]) <= f(B[j])

Test and submit both versions of counting_sort.

Also test and submit a specific instantiation of radix sort called bit_sort, implemented using a call to counting_sort with an object of type Bit:

class Bit
{
public:
  size_t operator () (unsigned long n)
  {
    return (size_t)(0 != (mask_ & n));
  }
  Bit() : mask_(0x00000001) {}
  void SetBit(unsigned char i) { mask_ = (0x00000001 << i); }
private:
  unsigned long mask_;
};

Test and submit bit_sort (in file nsort.h).

Hints

Heapsort is already done and distributed in LIB/tcpp/gheap.h. The prototypes for the two versions of heapsort should be useful as models for the other generic comparison sorts. Don't re-invent this wheel.
You will need specializations for some generic sort algorithms (g_insertion_sort and g_merge_sort) so that they work with arrays (raw pointers), because the generic versions use the iterator feature ValueType that pointers do not have.

The following is a summary of the code files that are supplied in LIB/proj1 needed for Part I (testing):

fgsort.cpp      # functionality test for all of the generic sorts in gsort.h
fnsort.cpp      # functionality test for all of the numeric sorts in nsort.h

TAKE NOTES! Use either an engineers lab book or (thoughtfully named) text files to keep careful notes on what you do and what the results are. Date your entries. This will be of immense assistance when you are preparing your report. In real life, these could be whipped out when that argumentative know-it-all starts to question the validity of your report.

Part II: Analysis of Sort Algorithms

Operational Objectives: All of the sorts coded in Part I should be analyzed, first finding the asymptotic runtime and then specific runtime according to collected timing data on the linprog machines. The results should be collected in a written report.

Deliverables: At least six files:

gsort.h         # contains the generic algorithm implementations of comparison sorts [parts I & II]
nsort.h         # contains the numerical sorts and class Bit [parts I & II]
*.cpp           # source code for your testing - at least the ones supplied [part II]
makefile        # builds all testing executables [part II]
sort_report.pdf # your report [part II]

Background: Scalability Curves

By a scalability curve for an algorithm implementation we shall mean an equation whose form is determined by known asymptotic properties of the algorithm and whose coefficients are determined by a least squares fit to actual timing data for the algorithm as a function of input size. For example, the selection sort algorithm, implemented as a generic algorithm named g_selection_sort(), is known to have asymptotic runtime Θ(n²), so the form of the best fit curve is taken to be

R = A + B n + C n²

where R is the predicted run time on input of size n. To obtain the concrete scalability curve, we need to obtain actual timing data for the sort and use that data to find optimal values for the coefficients A, B, and C. Note this curve will depend on the implementation all the way from source code to hardware, so it is important to keep the compiler and testing platform the same in order to compare efficiencies of different sorts using their concrete scalability curves.

The following table shows the code name and forms for scalabality curves of each algorithm to be examined:

Sort Code Name Scalability Curve Form         A         B         C

Insertion Sort g_insertion_sort() R = A + B n + C n²

Selection Sort     g_selection_sort()     R = A + B n + C n²

Heap Sort g_heap_sort() R = A + B n + C n log n

Merge Sort g_merge_sort() R = A + B n + C n log n

Quick Sort g_quick_sort() R = A + B n + C n log n

Counting Sort counting_sort() R = A + B n

Radix Sort (base 2)     bit_sort() R = A + B n

The last three columns of the table are where the coefficients for the form would be given, thus determining the concrete scalability curve.

The method for finding the coefficients A, B, and C is the method of least squares. Assume that we have sample runtime data as follows:

Input size: n₁ n₂ ... n_k

Measured runtime: t₁ t₂ ... t_k

and the scalability form is given by

f(n) = A + B n + C g(n)

Define the total square error of the approximation to be the sum of squares of errors at each data point:

E = Σ [t_i - f(n_i)]²

where the sum is taken from i = 1 to i = k, k being the number of data points. The key observation in the method of least squares is that total square error E is minimized when the gadient of E is zero, that is, where all three partial derivatives D_AE, D_BE, and D_CE are zero. Calculating these partial derivatives gives:

D_XE = 2 Σ [t_i - f(n_i)] D_Xf

= 2 Σ [t_i - (A + B n_i + C g(n_i))] D_Xf

(where X is A, B, or C). This gives the partial derivatives of E in terms of those of f, which may be calculated to be:

D_Af = 1
D_Bf = n
D_Cf = g(n)

(because n and g(n) are constant with respect to A, B, and C.) Substituting these into the previous formula and setting the results equal to zero yields the following three equations:

A Σ 1 + B Σ n_i + C Σ g(n_i)    =    Σ t_i
A Σ n_i + B Σ n_i² + C Σ n_i g(n_i)    =    Σ t_i n_i
A Σ g(n_i) + B Σ n_i g(n_i) + C Σ (g(n_i))²    =    Σ t_i g(n_i)

Rearranging and using Σ 1 = k yields:

k A + [Σ n_i] B + [Σ g(n_i)] C    =    Σ t_i

[Σ n_i] A + [Σ n_i²] B + [Σ n_i g(n_i)] C    =    Σ t_i n_i

[Σ g(n_i)] A + [Σ n_i g(n_i)] B + [Σ (g(n_i))²] C    =    Σ t_i g(n_i)

These are three linear equations in the unknowns A, B, and C. With even a small amount of luck, they have a unique solution, and thus optimal values of A, B, and C are determined. (Here is a link to a more detailed derivation in the quadratic case g(n) = n².)

Note that all of the coefficients in these equations may be calculated from the original data table and knowledge of the function g(n), in a spreadsheet or in a simple stand-alone program. The solution to the system of equations itself is probably easiest to find by hand by row-reducing the 3x4 matrix of coefficients to upper triangular form and then back-substitution.

Procedural Requirements

Use a CPU timing system to obtain appropriate timing data for each sort. Input sizes should range from small to substantially large. Any timing programs used should have suffix .cpp. Again all data files should have suffix .dat, including output data. Make certain that any file containing items relevant to (or mentioned in) your report has suffix .cpp or .dat and is preserved for project submission.
Use the method of least squares (sometimes called regression) to calculate the coefficients of a scalability curve for each sort.
Write a report on your results. (See below for detailed advice on report structure and content.) Your report should be in PDF format and reside in the file sort_report.pdf format.
Turn in gsort.h, nsort.h, *.cpp, and sort_report.pdf using the script LIB/proj1/proj12submit.sh.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.
Also submit your report sort_report.pdf to the Blackboard course web site.

Analysis and Report Requirements

Be sure that you use and document good investigation habits by keeping careful records of your analysis and data collection activities.
Before beginning any data collection, think through which versions of sorts you are going to test. These should ideally be versions that are most comparable across all of the sorts and for which the "container overhead" is as low as practical. Usually, this would mean using the array cases for all sorts. If however you do not have all sorts working for the case of arrays, you will have to devise a plan and defend the choices made.
You also need to plan what kinds and sizes of data sets you will use. It is generally advisable to create these in advance of collecting runtime data and to use the same data sets for all of the sorts, to reduce the effects of randomness in the comparisons. On the other hand, the data sets themselves should "look random" so as not to hit any particular weakness or strenght of any particular algorithm. For example: if a data set has size 100,000 but consists of integers in the range 0 .. 1000 then there will be lots of repeats in the data set, which could be bad for quicksort.
Generally, it is best to use unsigned integer data for the data sets, so that they can be consumed by all of the sorts, including the numerical sorts.
If you use a multi-user machine to collect data, there will be the possibility that your timings are exaggerated by periods when your process is idled by the OS. One way to compensate for this is to do several runs and use the lowest time among all of the runs in your analysis. Most likely you will need to collect your data using linprog, because the random number generator needs 64 bit words.
The framework of (pseudo) random object generators in LIB/cpp/xran.* has been upgraded to provide 32 bit integers. To simplify your tasks (and ensure some uniformity in the raw data) we supply a random unsigned int generator proj1/ranuint.cpp. This should compile and run on linprog (but not program, due to word size restrictions).
The CPU timing framework in LIB/cpp/timer.* can be used to collect timing data. Again with the goals of simplifying your work load and ensuring more uniformity, a timing program is supplied in proj1/sorttimer.cpp. Like ranuint.cpp, this program requires 64-bit architecture.
The supplied sort timer program outputs time in milliseconds (1ms = sec/1000). You can change your scale for your report if you like. Whatever units are chosen, you will need to deal with large differences of elapsed time, ranging over several orders of magnitude. Some displays may need log scaling in the vertical axis.
Analysis should be done in two senses. First, provide an asymptotic analysis that results in the scalability curve forms shown above. This step is of course independent of platform and programming language. Second, collect data on actual runtimes of the sorts and use that data to find a best-fit concrete scalability curve using the form derived above. This curve will depend on almost any choice made, so it is important to use the same choices across the sorts being analyzed and to eliminate irrelevant overhead costs as much as possible: same input data, same machine, simplest data structures.
The tools that are supplied should give you more time to think about the data: what kind of test data to generate, what kind of test data to collect, and have a good plan to accomplish that. Be sure to address these issues in your report as well.
The following procedure should be followed, and documented in complete detail, to find the concrete scalability curve coefficients. A sample calculation is illustrated as we step through the procedure. (Note that the timing data is fake - don't believe in the results, just the process.)
Your report should be structured something like the following outline. You are free to change the titles, organization, and section numbering scheme.
1. Abstract or Executive Summary
  [brief, logical, concise overview of what the report is about and what its results and conclusions/recommendations are; this is for the Admiral or online library]
2. Introduction
  [A narrative overview "story" of what the paper is about: what, why, how, and conclusions, including how the paper is organized]
3. Background
  [what knowledge underpins the paper, such as theory, in this case the known asymptotic runtimes of the sorts, with references, and the statistical procedure to be used, with references]
4. Theoretical Analysis
  [asymptotic runtime analysis of each sort, concluding with what form to use in fitting a curve to runtime data.]
5. Data Analysis Process or Procedure
  [details on what data decisions are made and why; how input data is created; how timing data is collected; and how analysis is accomplished, including references for any software packages that are used and detailed documentation on any software used for regression]
6. Analysis Results
  [give results in both tabular and graphical form] Illustrate all major conclusions and recommendations graphically where appropriate, for example with a single figure comparing concrete scalability curves superimposed in the same graphical display.
7. Conclusions
  [use results, including comparative tabular and/or graphical displays, to back up your conclusions]
8. Appendix 1
  Give complete details on calculation of all coefficients.
9. Appendix 2
  Give tables of all collected data.
10. Appendix 3
  Give detailed descriptions of all input files, including how they were built, there size, and constraints on content. (Do not put the actual input files in the report.)
Reading your report should make it clear how to use the test functions and how data was collected from them.

Hints

The following is a summary of the various code files that are supplied in LIB/proj1:

fgsort.cpp      # functionality test for all of the generic sorts in gsort.h
fnsort.cpp      # functionality test for all of the numeric sorts in nsort.h
ranuint.cpp     # generator of files of pseudo-random unsigned int
makefile.files  # sample makefile that creates random unsigned int files using ranuint.cpp
sorttimer.cpp   # runs all sorts on a data file and times results
makefile.times  # sample makefile that creates timing data using sorttimer.cpp
regmat.cpp      # computes the 3x4 regression matrix from collected timing data
matrixRR.cpp    # row-reduces a 3x4 matrix

The two supplied makefiles illustrates how easily data sets can be generated, named, and combined by calling Linux commands inside the makefile. All that is missing is your decisions on what input data to generate and what timing data to collect. Be sure to include good explanations of how and why you make these choices.

Once the various data decisions are made and data collected, you need to process the results to obtain the coefficients A, B, C of the concrete scalability curve. Here are the steps that process timing data through a sequence of files all the way to coefficients for the concrete scalability curves:
1. Organize your timing data in separate files, one for each sort. I have distributed an example for selection sort that is based on the timing data generated using the two example makefiles. (NOTE: because the example makefiles do not generate enough data, the resulting coefficients will be off.) The example file is "times.selection", with contents:
```
16
10      50      100     200     300     400     500     600     700     800
900     1000    2000    3000    4000    5000
0.001   0.01    0.03    0.103   0.224   0.385   0.597   0.846   1.137   1.475
1.866   2.272   8.862   19.734  34.869  54.325
```
  The file begins with the number of timing pairs (16), followed by the size sequence (16 sizes), followed by the runtime sequence for these sizes (16 runtimes). Your files will have more data, but should look similar to this.
2. Run regmat on this data set to compute the regression matrix:
```
regmat.x times.selection matrix.selection
```
  The result is the regression matrix stored in matrix.selection, whose content looks like:
```
16            19560           5.78526e+07     126.736 
19560         5.78526e+07     2.27025e+11     495006  
5.78526e+07   2.27025e+11     9.80533e+14     2.13491e+09   
```
  This is the matrix of coefficients.
3. Finally you need to solve the linear equations captured by this matrix: 3 equations in 3 unknowns, the solution being the coeffients you are seeking. This can be done in several ways, including by hand and using publicly available matrix algebra libraries. The simplest solution is the Row Reduction method, which uses the rules of manipulation of linear equations to simplify them. There are many software libraries (e.g., boost) offering matrix algebra and you are free to use one of these. Alternatively, I have a very modest matrix row-reduction program matrixRR.cpp that should work in case you want to use it, as follows:
```
matrixRR.x 3 4 matrix.selection solution.selection
```
  which writes the reduced matrix to the output file solution.selection, whose contents look like:
```
1       0       0       -0.00899951     
0       1       0       0.000143883     
0       0       1       2.14451e-06     
```
  This translates to:
  A = -0.0089951
  B = 0.000143883
  C = 2.14451e-06
TAKE NOTES! Use either an engineers lab book or (thoughtfully named) text files to keep careful notes on what you do and what the results are. Date your entries. This will be of immense assistance when you are preparing your report. In real life, these could be whipped out when that argumentative know-it-all starts to question the validity of your report.
Personal experience notes:
1. As long as the time scale is consistent, it makes no theoretical difference whether you measure time in minutes (min), seconds (sec), milliseconds (msec), or microseconds (usec). I personally like using usec, for this problem, because the coefficients come out larger. The timing data generated by sorttimer is in usec.
2. To get the best-fit approximation curve (the concrete scalability curve), there is no theoretical reason why the number of timing data points (k in the discussion above) has to be the same for all the sorts. However, comparisons of runtimes between two sorts are most meaningful when the input data sets the same.
3. It is good to have some very small data sets to anchor the constant term (A), which is the estimate of the "constant overhead" of the algorithm.
4. It is good to have intermediate size data sets for the linear term (Bn) and some large data sets for the non-linear term (either Cn² or Cn log n).
5. The faster sorts can be timed on data sets 10 times as big as those for the slower sorts.
6. The fast, in-place sorts can be timed on sets 10 times as big as the fast, memory-consuming sorts.
Your report will be used to assess the required outcomes discussed in the box at the top of this document.
The National Institute for Standards and Technology (NIST) has a good reference for statistics: NIST Engineering Statistics Handbook.

Sort	Code Name	Scalability Curve Form	A	B	C
Insertion Sort	`g_insertion_sort()`	R = A + B n + C n²
Selection Sort	`g_selection_sort()`	R = A + B n + C n²
Heap Sort	`g_heap_sort()`	R = A + B n + C n log n
Merge Sort	`g_merge_sort()`	R = A + B n + C n log n
Quick Sort	`g_quick_sort()`	R = A + B n + C n log n
Counting Sort	`counting_sort()`	R = A + B n
Radix Sort (base 2)	`bit_sort()`	R = A + B n

Input size:	n₁	n₂	...	n_k
Measured runtime:	t₁	t₂	...	t_k

D_XE	= 2 Σ [t_i - f(n_i)] D_Xf
	= 2 Σ [t_i - (A + B n_i + C g(n_i))] D_Xf

k A	+	[Σ n_i] B	+	[Σ g(n_i)] C	=	Σ t_i
[Σ n_i] A	+	[Σ n_i²] B	+	[Σ n_i g(n_i)] C	=	Σ t_i n_i
[Σ g(n_i)] A	+	[Σ n_i g(n_i)] B	+	[Σ (g(n_i))²] C	=	Σ t_i g(n_i)