Note: This assignment is used to assess the required outcomes for the course, as outlined in the course syllabus. These outcomes are:
These will be assessed using the following rubric:
In order to earn a course grade of C- or better, the assessment must result in Effective or Highly Effective for each outcome. |
Educational Objectives: On successful completion of this assignment, the student should be able to
Background Knowledge Required: Be sure that you have mastered the material in these chapters before beginning the project: Sequential Containers, Function Classes and Objects, Iterators, Generic Algorithms, Generic Set Algorithms, Heap Algorithms, and Sorting Algorithms
Operational Objectives: Implement various comparison sorts as generic algorithms, with the minimal practical constraints on iterator types. Each generic comparison sort should be provided in two froms: (1) default order and (2) order supplied by a predicate class template parameter.
Also implement some numerical sorts as template functions, with the minimal practical constraints on template parameters. Again there should be two versions, one for default order and one for order determined by a function object whose class is passed as a template parameter.
The sorts to be developed and tested are selection sort, insertion sort, heap sort, merge sort, quick sort, counting sort, and bit sort.
Deliverables: Two files:
gsort.h # contains the generic algorithm implementations of comparison sorts nsort.h # contains the numerical sorts and class Bit
Develop and fully test all of the sort algorithms listed under requirements below. Make certain that your testing includes "boundary" cases, such as empty ranges, ranges that have the same element at each location, and ranges that are in correct or reverse order before sorting. Place all of the generic sort algorithms in the file gsort.h and all of the numerical sort algorithms in the file nsort.h. Your test programs should have filename suffix .cpp and your test data files should have suffix .dat.
Turn in gsort.h and nsort.h using the script LIB/proj1/proj11submit.sh.
Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.
Note that Parts 1 and 2 have different due dates.
The two sort algorithm files are expected to operate using the supplied test harnesses: fgsort.cpp (tests gsort.h) and fnsort.cpp (tests nsort.h). Note that this means, among other things, that:
Issuing the single command "make" should build executables for all of your test functions.
The comparison sorts should be implemented as generic algorithms with template parameters that are iterator types.
Each comparison sort should have two versions, one that uses default order (operator < on I::ValueType) and one that uses a predicate object whose type is an extra template parameter.
Re-use as many components as possible, especially existing generic algorithms such as g_copy (in genalg.h), g_set_merge (in gset.h), and the generic heap algorithms (in gheap.h).
Two versions of counting_sort should be implemented: the classic 4-parameter version, plus one that takes a function object as an argument. Here is a prototype for the 5-parameter version:
template < class F > void counting_sort(const int * A, int * B, size_t n, size_t k, F f) // Pre: A,B are arrays of type unsigned int // A,B are defined in the range [0,n) // f is defined for all elements of A and has values in the range [0,k) // Post: A is unchanged // B is a stable f-sorted permutation of A // I.e., i < j ==> f(B[i]) <= f(B[j])
Test and submit both versions of counting_sort.
Also test and submit a specific instantiation of radix sort called bit_sort, implemented using a call to counting_sort with an object of type Bit:
class Bit { public: size_t operator () (unsigned long n) { return (size_t)(0 != (mask_ & n)); } Bit() : mask_(0x00000001) {} void SetBit(unsigned char i) { mask_ = (0x00000001 << i); } private: unsigned long mask_; };
Test and submit bit_sort (in file nsort.h).
Heapsort is already done and distributed in LIB/tcpp/gheap.h. The prototypes for the two versions of heapsort should be useful as models for the other generic comparison sorts. Don't re-invent this wheel.
You will need specializations for some generic sort algorithms (g_insertion_sort and g_merge_sort) so that they work with arrays (raw pointers), because the generic versions use the iterator feature ValueType that pointers do not have.
The following is a summary of the code files that are supplied in LIB/proj1 needed for Part I (testing):
fgsort.cpp # functionality test for all of the generic sorts in gsort.h fnsort.cpp # functionality test for all of the numeric sorts in nsort.h
TAKE NOTES! Use either an engineers lab book or (thoughtfully named) text files to keep careful notes on what you do and what the results are. Date your entries. This will be of immense assistance when you are preparing your report. In real life, these could be whipped out when that argumentative know-it-all starts to question the validity of your report.
Operational Objectives: All of the sorts coded in Part I should be analyzed, first finding the asymptotic runtime and then specific runtime according to collected timing data on the linprog machines. The results should be collected in a written report.
Deliverables: At least six files:
gsort.h # contains the generic algorithm implementations of comparison sorts [parts I & II] nsort.h # contains the numerical sorts and class Bit [parts I & II] *.cpp # source code for your testing - at least the ones supplied [part II] makefile # builds all testing executables [part II] sort_report.pdf # your report [part II]
By a scalability curve for an algorithm implementation we shall mean an equation whose form is determined by known asymptotic properties of the algorithm and whose coefficients are determined by a least squares fit to actual timing data for the algorithm as a function of input size. For example, the selection sort algorithm, implemented as a generic algorithm named g_selection_sort(), is known to have asymptotic runtime Θ(n2), so the form of the best fit curve is taken to be
R = A + B n + C n2
where R is the predicted run time on input of size n. To obtain the concrete scalability curve, we need to obtain actual timing data for the sort and use that data to find optimal values for the coefficients A, B, and C. Note this curve will depend on the implementation all the way from source code to hardware, so it is important to keep the compiler and testing platform the same in order to compare efficiencies of different sorts using their concrete scalability curves.
The following table shows the code name and forms for scalabality curves of each algorithm to be examined:
Sort Code Name Scalability Curve Form A B C Insertion Sort g_insertion_sort() R = A + B n + C n2 Selection Sort g_selection_sort() R = A + B n + C n2 Heap Sort g_heap_sort() R = A + B n + C n log n Merge Sort g_merge_sort() R = A + B n + C n log n Quick Sort g_quick_sort() R = A + B n + C n log n Counting Sort counting_sort() R = A + B n Radix Sort (base 2) bit_sort() R = A + B n
The last three columns of the table are where the coefficients for the form would be given, thus determining the concrete scalability curve.
The method for finding the coefficients A, B, and C is the method of least squares. Assume that we have sample runtime data as follows:
Input size: n1 n2 ... nk Measured runtime: t1 t2 ... tk
and the scalability form is given by
f(n) = A + B n + C g(n)
Define the total square error of the approximation to be the sum of squares of errors at each data point:
E = Σ [ti - f(ni)]2
where the sum is taken from i = 1 to i = k, k being the number of data points. The key observation in the method of least squares is that total square error E is minimized when the gadient of E is zero, that is, where all three partial derivatives DAE, DBE, and DCE are zero. Calculating these partial derivatives gives:
DXE = 2 Σ [ti - f(ni)] DXf = 2 Σ [ti - (A + B ni + C g(ni))] DXf
(where X is A, B, or C). This gives the partial derivatives of E in terms of those of f, which may be calculated to be:
DAf = 1
DBf = n
DCf = g(n)
(because n and g(n) are constant with respect to A, B, and C.) Substituting these into the previous formula and setting the results equal to zero yields the following three equations:
A Σ 1 + B Σ ni + C Σ g(ni) = Σ ti
A Σ ni + B Σ ni2 + C Σ ni g(ni) = Σ ti ni
A Σ g(ni) + B Σ ni g(ni) + C Σ (g(ni))2 = Σ ti g(ni)
Rearranging and using Σ 1 = k yields:
k A + [Σ ni] B + [Σ g(ni)] C = Σ ti [Σ ni] A + [Σ ni2] B + [Σ ni g(ni)] C = Σ ti ni [Σ g(ni)] A + [Σ ni g(ni)] B + [Σ (g(ni))2] C = Σ ti g(ni)
These are three linear equations in the unknowns A, B, and C. With even a small amount of luck, they have a unique solution, and thus optimal values of A, B, and C are determined. (Here is a link to a more detailed derivation in the quadratic case g(n) = n2.)
Note that all of the coefficients in these equations may be calculated from the original data table and knowledge of the function g(n), in a spreadsheet or in a simple stand-alone program. The solution to the system of equations itself is probably easiest to find by hand by row-reducing the 3x4 matrix of coefficients to upper triangular form and then back-substitution.
Use a CPU timing system to obtain appropriate timing data for each sort. Input sizes should range from small to substantially large. Any timing programs used should have suffix .cpp. Again all data files should have suffix .dat, including output data. Make certain that any file containing items relevant to (or mentioned in) your report has suffix .cpp or .dat and is preserved for project submission.
Use the method of least squares (sometimes called regression) to calculate the coefficients of a scalability curve for each sort.
Write a report on your results. (See below for detailed advice on report structure and content.) Your report should be in PDF format and reside in the file sort_report.pdf format.
Turn in gsort.h, nsort.h, *.cpp, and sort_report.pdf using the script LIB/proj1/proj12submit.sh.
Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.
Also submit your report sort_report.pdf to the Blackboard course web site.
Be sure that you use and document good investigation habits by keeping careful records of your analysis and data collection activities.
Before beginning any data collection, think through which versions of sorts you are going to test. These should ideally be versions that are most comparable across all of the sorts and for which the "container overhead" is as low as practical. Usually, this would mean using the array cases for all sorts. If however you do not have all sorts working for the case of arrays, you will have to devise a plan and defend the choices made.
You also need to plan what kinds and sizes of data sets you will use. It is generally advisable to create these in advance of collecting runtime data and to use the same data sets for all of the sorts, to reduce the effects of randomness in the comparisons. On the other hand, the data sets themselves should "look random" so as not to hit any particular weakness or strenght of any particular algorithm. For example: if a data set has size 100,000 but consists of integers in the range 0 .. 1000 then there will be lots of repeats in the data set, which could be bad for quicksort.
Generally, it is best to use unsigned integer data for the data sets, so that they can be consumed by all of the sorts, including the numerical sorts.
If you use a multi-user machine to collect data, there will be the possibility that your timings are exaggerated by periods when your process is idled by the OS. One way to compensate for this is to do several runs and use the lowest time among all of the runs in your analysis. Most likely you will need to collect your data using linprog, because the random number generator needs 64 bit words.
The framework of (pseudo) random object generators in LIB/cpp/xran.* has been upgraded to provide 32 bit integers. To simplify your tasks (and ensure some uniformity in the raw data) we supply a random unsigned int generator proj1/ranuint.cpp. This should compile and run on linprog (but not program, due to word size restrictions).
The CPU timing framework in LIB/cpp/timer.* can be used to collect timing data. Again with the goals of simplifying your work load and ensuring more uniformity, a timing program is supplied in proj1/sorttimer.cpp. Like ranuint.cpp, this program requires 64-bit architecture.
The supplied sort timer program outputs time in milliseconds (1ms = sec/1000). You can change your scale for your report if you like. Whatever units are chosen, you will need to deal with large differences of elapsed time, ranging over several orders of magnitude. Some displays may need log scaling in the vertical axis.
Analysis should be done in two senses. First, provide an asymptotic analysis that results in the scalability curve forms shown above. This step is of course independent of platform and programming language. Second, collect data on actual runtimes of the sorts and use that data to find a best-fit concrete scalability curve using the form derived above. This curve will depend on almost any choice made, so it is important to use the same choices across the sorts being analyzed and to eliminate irrelevant overhead costs as much as possible: same input data, same machine, simplest data structures.
The tools that are supplied should give you more time to think about the data: what kind of test data to generate, what kind of test data to collect, and have a good plan to accomplish that. Be sure to address these issues in your report as well.
The following procedure should be followed, and documented in complete detail, to find the concrete scalability curve coefficients. A sample calculation is illustrated as we step through the procedure. (Note that the timing data is fake - don't believe in the results, just the process.)
Your report should be structured something like the following outline. You are free to change the titles, organization, and section numbering scheme.
Reading your report should make it clear how to use the test functions and how data was collected from them.
The following is a summary of the various code files that are supplied in LIB/proj1:
fgsort.cpp # functionality test for all of the generic sorts in gsort.h fnsort.cpp # functionality test for all of the numeric sorts in nsort.h ranuint.cpp # generator of files of pseudo-random unsigned int makefile.files # sample makefile that creates random unsigned int files using ranuint.cpp sorttimer.cpp # runs all sorts on a data file and times results makefile.times # sample makefile that creates timing data using sorttimer.cpp regmat.cpp # computes the 3x4 regression matrix from collected timing data matrixRR.cpp # row-reduces a 3x4 matrix
The two supplied makefiles illustrates how easily data sets can be generated, named, and combined by calling Linux commands inside the makefile. All that is missing is your decisions on what input data to generate and what timing data to collect. Be sure to include good explanations of how and why you make these choices.
Once the various data decisions are made and data collected, you need to process the results to obtain the coefficients A, B, C of the concrete scalability curve. Here are the steps that process timing data through a sequence of files all the way to coefficients for the concrete scalability curves:
Organize your timing data in separate files, one for each sort. I have distributed an example for selection sort that is based on the timing data generated using the two example makefiles. (NOTE: because the example makefiles do not generate enough data, the resulting coefficients will be off.) The example file is "times.selection", with contents:
16 10 50 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000 0.001 0.01 0.03 0.103 0.224 0.385 0.597 0.846 1.137 1.475 1.866 2.272 8.862 19.734 34.869 54.325The file begins with the number of timing pairs (16), followed by the size sequence (16 sizes), followed by the runtime sequence for these sizes (16 runtimes). Your files will have more data, but should look similar to this.
Run regmat on this data set to compute the regression matrix:
regmat.x times.selection matrix.selectionThe result is the regression matrix stored in matrix.selection, whose content looks like:
16 19560 5.78526e+07 126.736 19560 5.78526e+07 2.27025e+11 495006 5.78526e+07 2.27025e+11 9.80533e+14 2.13491e+09This is the matrix of coefficients.
Finally you need to solve the linear equations captured by this matrix: 3 equations in 3 unknowns, the solution being the coeffients you are seeking. This can be done in several ways, including by hand and using publicly available matrix algebra libraries. The simplest solution is the Row Reduction method, which uses the rules of manipulation of linear equations to simplify them. There are many software libraries (e.g., boost) offering matrix algebra and you are free to use one of these. Alternatively, I have a very modest matrix row-reduction program matrixRR.cpp that should work in case you want to use it, as follows:
matrixRR.x 3 4 matrix.selection solution.selectionwhich writes the reduced matrix to the output file solution.selection, whose contents look like:
1 0 0 -0.00899951 0 1 0 0.000143883 0 0 1 2.14451e-06This translates to:
TAKE NOTES! Use either an engineers lab book or (thoughtfully named) text files to keep careful notes on what you do and what the results are. Date your entries. This will be of immense assistance when you are preparing your report. In real life, these could be whipped out when that argumentative know-it-all starts to question the validity of your report.
Personal experience notes:
Your report will be used to assess the required outcomes discussed in the box at the top of this document.
The National Institute for Standards and Technology (NIST) has a good reference for statistics: NIST Engineering Statistics Handbook.