Project 2: Sort Analysis

Note: This assignment will be used to assess the required outcomes for the course, as outlined in the course syllabus. These outcomes are:

  1. analyze the computational complexity of algorithms used in the solution of a programming problem
  2. evaluate the performance trade-offs of alternative data structures and algorithms

These will be assessed using the following rubric:

  I E H  
Key:
  I = ineffective
  E = effective
  H = highly effective
Performance Analysis
  Runtime Analysis - - -
  Runspace Analysis - - -
Tradeoff Analysis
  Comparison Sorts - - -
  Numerical Sorts - - -

In order to earn a course grade of C- or better, the assessment must result in Effective or Highly Effective for each outcome.

Educational Objectives: On successful completion of this assignment, the student should be able to

Background Knowledge Required: Be sure that you have mastered the material in these chapters before beginning the project: Sequential Containers, Function Classes and Objects, Iterators, Generic Algorithms, Generic Set Algorithms, Heap Algorithms, and Sorting Algorithms

Part I: Generic Sort Algorithms

Operational Objectives: Implement various comparison sorts as generic algorithms, with the minimal practical constraints on iterator types. Each generic comparison sort should be provided in two froms: (1) default order and (2) order supplied by a predicate class template parameter.

Also implement some numerical sorts as template functions, with the minimal practical constraints on template parameters. Again there should be two versions, one for default order and one for order determined by a function object whose class is passed as a template parameter.

The sorts to be developed and tested are selection sort, insertion sort, heap sort, merge sort, quick sort, counting sort, bit sort, byte sort, and word sort.

Deliverables: Two files:

gsort.h         # contains the generic algorithm implementations of comparison sorts
nsort.h         # contains the numerical sorts and classes Bit, Byte, and Word

Procedural Requirements

  1. The official development, testing, and assessment environment is g++47 -std=c++11 -Wall -Wextra on the linprog machines. Code should compile without error or warning.

  2. Develop and fully test all of the sort algorithms listed under requirements below. Make certain that your testing includes "boundary" cases, such as empty ranges, ranges that have the same element at each location, and ranges that are in correct or reverse order before sorting. Place all of the generic sort algorithms in the file gsort.h and all of the numerical sort algorithms in the file nsort.h. Your test data files should have descriptive names explaining their content.

  3. Turn in gsort.h and nsort.h using the script LIB/proj2/proj21submit.sh.

    Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

    Note that Parts 1 and 2 have different due dates.

Code Requirements and Specifications

  1. The two sort algorithm files are expected to operate using the supplied test harnesses: fgsort.cpp (tests gsort.h) and fnsort.cpp (tests nsort.h). Note that this means, among other things, that:

    1. All generic sorts in gsort.h should work with ordinary arrays as well as iterators of the appropriate category
    2. Both classic (4-argument) counting_sort and the 5-argument version should work
    3. bit_sort should work with the class Bit defined in nsort.h
    4. byte_sort should work with the class Byte defined in nsort.h
    5. word_sort should work with the class Word defined in nsort.h

  2. The comparison sorts should be implemented as generic algorithms with template parameters that are iterator types.

  3. Each comparison sort should have two versions, one that uses default order (operator < on I::ValueType) and one that uses a predicate object whose type is an extra template parameter.

  4. Some of the comparison sorts will require specializations (for both the default and predicate versions) to handle the case of arrays and pointers, for which I::ValueType is not defined.

  5. Re-use as many components as possible, especially existing generic algorithms such as g_copy (in genalg.h), g_set_merge (in gset.h), and the generic heap algorithms (in gheap.h).

  6. Two versions of counting_sort should be implemented: the classic 4-parameter version, plus one that takes a function object as an argument. Here is a prototype for the 5-parameter version:

    template < class F >
    void counting_sort(const int * A, int * B, size_t n, size_t k, F f)
    // Pre:  A,B are arrays of type unsigned int
    //       A,B are defined in the range [0,n)
    //       f is defined for all elements of A and has values in the range [0,k)
    // Post: A is unchanged
    //       B is a stable f-sorted permutation of A
    //       I.e., i < j ==> f(B[i]) <= f(B[j])
    

    Test and submit both versions of counting_sort.

  7. Also test and submit specific instantiations of radix sort called bit_sort, byte_sort, and word_sort.

    1. bit_sort is implemented using a call to counting_sort with an object of type Bit:

        template <typename N>
        class Bit
        {
        public:
          size_t operator () (N n)
          {
            return (0 != (mask_ & n)); // the bit at the offset location
          }
          Bit() : mask_(static_cast(0x00)) {}
          void SetBit(unsigned char i)
          {
            mask_ = (static_cast(0x01) << i);  // the ith bit
          }
        private:
          N mask_;
        };
      

      The template parameter represents an integer type. bit_sort is implemented as a loop of calls to counting_sort at each bit (increasing in significance). Note that the size of N can be calculated and used to limit the length of the loop.

    2. byte_sort is implemented using a call to counting_sort with an object of type Byte:

        template <typename N>
        class Byte
        {
        public:
          size_t operator () (N n)
          {
            return ((n >> offset_) & 0xFF); // the byte at the offset location
          }
          Byte() : offset_(static_cast(0x00)) {}
          void SetByte(unsigned char i)
          {
            offset_ = static_cast(i << 3); // the ith byte = offset*8
          }
        private:
          N offset_;
        };
      

      Again the template parameter represents an integer type. byte_sort is implemented as a loop of calls to counting_sort at each byte (increasing in significance). Again, the size of N can be calculated and used to limit the length of the loop.

    3. word_sort is implemented using a call to counting_sort with an object of type Word. Developing this class and the word_sort algorithm is left to your creativity.

    Test and submit bit_sort, byte_sort, and word_sort (in file nsort.h).

Hints

Part II: Sort Algorithm Data Collection

Operational Objectives:

Step 1: Problem Selection. Begin by selecting one of the analysis problems for your work:

  1. Curve-Fitting. Use a theoretical review to assign a "form" to each sort algorithm, and then use the method of least squares (aka regression) on actual timing data to find coefficients for a best-fit curve in the form. See curve_fitting for more details.

  2. Optimal Cutoff for Recursive Sorts. Recursive sort algorithms tend to make many recursive calls on small or empty ranges. There is usually a point where these calls make the recursive algorithm less effective than a simple non-recursive sort such as insertion_sort. Use a combination of runtime theory and practical experiment to find the "optimal cutoff size" for switching from the recursive algorithm to a call to insertion_sort, for: merge_sort and quick_sort. Submit revised code for g_merge_sort and g_quick_sort that implements this cutoff.

  3. Sorting Almost Sorted Data. When data is "almost sorted", with only a few (say k) items out of place, discuss the pros and cons of the various sort algorithms. In particular, devise an analysis of insertion_sort for almost sorted data in terms of n (the size of the data set) and k (the number of items not already in order). If you prefer, you could re-phrase the analysis in terms of the average number of places each element is "out of position" from sorted data.

  4. Key-Comp v Numerical Sorting. Given that key comparison sorts cannot run faster than Ω(n log n) and that the numerical sorts have runtime O(n), eventually the numerical sorts must be faster for sufficiently large n. Use actual timing data to estimate the value of n where this change takes place, and also discuss the tradeoffs involved, including memory use. Which of the numerical sorts are practical for these very large data sets?

  5. String Sorts. Discuss the pros and cons of sort algorithms designed specifically for strings, compared to the general-purpose sort algorithms. Consider at least two string sorts: LSD and MSD.

Step 2: Data Collection Plan. Create a plan to collect data for analysis for your chosen analysis problem. This will involve creation of data files, timing data, and/or comp_count data, appropriate for analysis of all of the sorts. The plan should be outlined in data_collection_plan.txt, and makefiles for creating input data and output results should be created that support the plan. The plan should support the analyses you have chosen to do.

Deliverables: Five files:

data_collection_plan.txt # text file describing the data that will be collected and
                         # the rational for the choices
                         # included in sort_analysis as an Appendix
makefile.files.*   # create input data files used for your analysis 
makefile.times.*   # create output timing data used in your analysis
makefile.counts.*  # create output comp_count data used in your analysis

Note that you may have several suffixes for the makefiles. (See Hints below.)

Procedural Requirements

  1. Choose your topic, either from the list in step 1 above or another topic (cleared with the instructor).

  2. Devise a plan to collect data using a CPU timing system (and optionally comparison counters) to obtain appropriate timing / comp_count data to support your analysis. Input sizes should range from small to substantially large. Also qualitative aspects of the data may vary, for example data with many repeats, data that is almost sorted, data with bounded values, and completely random data. Be sure that you have specific questions you want to research and answer by analizing the collected data. Outline the data collection plan, including the questions to be researched, in the text file named "data_collection_plan.txt".

  3. Create makefiles that create the data described in data_collection_plan.txt. Name these makefiles makefile.files, makefile.times, and (optionally) makefile.counts.

  4. Turn in Turn in data_collection_plan.txt, makefile.files, makefile.times, and (optionally) makefile.counts using the script LIB/proj2/proj22submit.sh.

    Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

  5. Note: That your plan and collection makefiles may be changed as you get deeper into the project, just resubmit whenever changes occur.

Hints

Part III: Sort Analysis

Operational Objectives: Perform your analyses of sort algorithms and write the report. The data collection plan submitted in Part II should be followed [or revised, resubmitted, and followed], various analyses completed, and a paper written on your findings. The paper should be named sort_analysis.pdf. Guidelines for the structure pof the paper are given below and should be followed.

Deliverables: One file:

sort_analysis.pdf   # your Assignment 5 report 

Procedural Requirements

  1. Read the analysis and report guidelines below.

  2. Collect data according to your data collection plan, perform your analyses, and write your paper.

  3. Turn in sort_analysis.pdf using the script LIB/proj2/proj23submit.sh.

    Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

  4. Also submit your report sort_analysis.pdf to the Blackboard course web site.

Analysis and Report Requirements

  1. Be sure that you use and document good investigation habits by keeping careful records of your analysis and data collection activities.

  2. Before beginning any data collection, think through which versions of sorts you are going to test. These should ideally be versions that are most comparable across all of the sorts and for which the "container overhead" is as low as practical. This means using the array case for all sorts.

  3. You also need to plan what kinds and sizes of data sets you will use. It is generally advisable to create these in advance of collecting runtime data and to use the same data sets for all of the sorts, to reduce the effects of randomness in the comparisons. On the other hand, the data sets themselves should "look random" so as not to hit any particular weakness or strenght of any particular algorithm. For example: if a data set has size 100,000 but consists of integers in the range 0 .. 1000 then there will be lots of repeats in the data set, which could be bad for quicksort.

    Generally, it is best to use unsigned integer data for the data sets, so that they can be consumed by all of the sorts, including the numerical sorts.

  4. If you use a multi-user machine to collect data, there will be the possibility that your timings are exaggerated by periods when your process is idled by the OS. One way to compensate for this is to do several runs and use the lowest time among all of the runs in your analysis. Most likely you will need to collect your data using linprog, because the random number generator needs 64 bit words.

  5. The framework of (pseudo) random object generators in LIB/cpp/xran.* has been upgraded to provide 32 bit integers. To simplify your tasks (and ensure some uniformity in the raw data) we supply a random unsigned int generator proj2/ranuint.cpp. This should compile and run on linprog (but not program, due to word size restrictions).

  6. The CPU timing framework in LIB/cpp/timer.* can be used to collect timing data. Again with the goals of simplifying your work load and ensuring more uniformity, a timing program is supplied in proj2/sorttimer.cpp. Like ranuint.cpp, this program requires 64-bit architecture.

  7. The supplied sort timer program outputs time in milliseconds (1ms = sec/1000). You can change your scale for your report if you like. Whatever units are chosen, you will need to deal with large differences of elapsed time, ranging over several orders of magnitude. Some displays may need log scaling in the vertical axis.

  8. Analysis should be done in two senses. First, provide an asymptotic analysis that results in the scalability curve forms shown above. This step is of course independent of platform and programming language. Formal analysis is not required, but an informed and informative discussion is expected. Second, collect data on actual runs of the sorts and use that data to support your findings. If you are doing the curve-fitting project, find a best-fit concrete scalability curve using the form derived above. This curve will depend on almost any choice made, so it is important to use the same choices across the sorts being analyzed and to eliminate irrelevant overhead costs as much as possible: same input data, same machine, simplest data structures. Other projects would follow similar guidelines, as appropriate for the problem at hand.

  9. The tools that are supplied should give you more time to think about the data: what kind of test data to generate, what kind of test data to collect, and have a good plan to accomplish that. Be sure to address these issues in your report as well.

  10. Your report should be structured something like the following outline. You are free to change the titles, organization, and section numbering scheme.

    1. Abstract or Executive Summary
      [brief, logical, concise overview of what the report is about and what its results and conclusions/recommendations are; this is for the Admiral or online library]
    2. Introduction
      [A narrative overview "story" of what the paper is about: what, why, how, and conclusions, including how the paper is organized]
    3. Background
      [what knowledge underpins the paper, such as theory, in this case the known asymptotic runtimes of the sorts, with references, and the statistical procedure to be used, with references]
    4. Theoretical Analysis
      [asymptotic runtime analysis of each sort, concluding with what form to use in fitting a curve to runtime data.]
    5. Data Analysis Process or Procedure
      [details on what data decisions are made and why; how input data is created; how timing data is collected; and how analysis is accomplished, including references for any software packages that are used and detailed documentation on any software used for regression]
    6. Analysis Results
      [give results in both tabular and graphical form] Illustrate all major conclusions and recommendations graphically where appropriate, for example with a single figure comparing concrete scalability curves superimposed in the same graphical display.
    7. Conclusions
      [use results, including comparative tabular and/or graphical displays, to back up your conclusions]
    8. Appendix 1
      Give complete details on calculation of all coefficients.
    9. Appendix 2
      Give tables of all collected data.
    10. Appendix 3
      Give detailed descriptions of all input files, including how they were built, there size, and constraints on content. (Do not put the actual input files in the report.)
  11. Reading your report should make it clear how to use the test functions and how data was collected from them.