COP 4531 - Programming Homework II

Sorting Large Files a.k.a External Memory Sorting
Due Date: 25th April


Introduction

External sorting refers to the sorting of a file that is on disk. Internal sorting refers to the sorting of an array of data that is in RAM. The main concern with external sorting is to minimize disk access since reading a disk block takes about a million times longer than accessing an item in RAM. External memory sorting is called for usually when the data is so huge that it wont fit in main memory, for instance, huge databases, file systems, graphics applications where the 3D models are huge, pattern recognition and machine learning applications, etc.

Perhaps the simplest form of external sorting is to use a fast internal sort with good locality of reference (which means that it tends to reference nearby items, not widely scattered items) and hope that your operating system's virtual memory can handle it. (Quicksort is one sort algorithm that is generally very fast and has good locality of reference.) If the file is too huge, however, even virtual memory might be unable to fit it. Also, the performance may not be too great due to the large amount of time it takes to access data on disk.

Disks with huge capacities are often designed such that peak performance is achieved when we access the blocks in sequential manner. Most external sorting methods use the following general strategy. Make a first pass through the file to be sorted, breaking it into blocks about the size of the memory and sort these blocks. Then, merge the sorted blocks together, if necessary by making several passes through the file, creating successively larger sorted blocks until the whole file is sorted. This approach is called external memory merge-sort and has been in use since 1950s. Note that this method does not use the fact that the keys are integers. For the least we could sort the blocks in internal memory using radix sort to make this implementation faster (since in this assignment the keys are integers). Here is some more material on external memory algorithms for sorting from Jeff Erickson. (1.1.1 B-Trees are a vairant of a-b trees) and from Peter Sanders. You might also want to look the chapter "Special purpose sorting methods" in the book "Algorithms in C++"/"Algorithms in Java" by Robert Sedgewick

In this assignment you will sort a file of fixed size records, stored in a file on disk, using a given amount of main memory. Note that when you are working with massive data sets, you need to take care of how your disk works. It is a good idea to read chunks of data at a time from the disk than compared to randomly accessing the disk. The input data to your code will run into Gigabytes, so you should make sure your program can handle data sizes of more than 4GB. The files that I will run the program on will be huge. The benchmark machine will have dual processors, so in case you need more CPU power you should think of using posix threads.

Where to Get Help?

Come to my "programming homework" office hours. I will have extra office hours for the programming assignment on Wednesdays after class. I will help you design and analyse the algorithms that you can come up with for this task and tell you the tricks and traps of algorithms that need to handle massive data sets. I will point you to material that you could read for this task.

What to implement?

You have to implement the following classes:



// All File IO must be performed using stdio.h

using namespace std;

template < unsigned SizeOfRec >
class Record {
  private:
       char m_chdata[SizeOfRec];  // Data for record contained here

  public:
       Record();
       ~Record();
       long size();
/* 
   Other comments for the class
   ----------------------------
   The class must implement a copy constructor.
   The class must implement a operator[] for the data in the class
   The class must implement an assignment operator.

*/
};

template < unsigned SizeOfRec >
bool operator<(const Record &  r1,const Record &  r2){
  //  Put your comparison function here.
  return true;
};


// Total amount of hard drive space used should not exceed twice the
// size of the input file.

// Total amount of memory used should not exceed the amount input by
// the user. An error should occur if you can not sort in the amount
// of memory the user inputs.

template < typename Rec > 
class FileSorter {
       FILE *m_h_inpfile;   // handle to input file
       FILE *m_h_outfile;   // handle to output file
       long m_lnrecords;    // Number of records in file.
       int  m_i_amt_of_mem;

  public:
       FileSorter(
          string& inFile,   // Name of unsorted heapfile.
          string& outFile,  // Name of sorted heapfile.
          int amt_of_mem    // Amount of memory in MB available for sorting
                            // Your program should not allocate a total of more than amt_of_mem MB memory at any
                            // given time. In my benchmarks, the input file size will always exceed amt_of_mem MB.
       );
       ~FileSorter();       // destructor 

       int TwoPassMergeSort(long i, long j); // Only sorts records i to j.
       int TwoPassMergeSort();               // Sorts the whole file.
       // Return value for sorters should correspond to :
       // Status of call: Should return 1 if sorted, -x for an error. You should say what -x is, when you code.
       // For instance -1 = No disk space left
       //              -2 = File IO error
       //              ...

       // The following function should print a verbose output of what
       // went wrong when sorting was attempted
       void perror(int x);

};


// The size of main should not exceed 10 lines
// (or 800 characters, whichever is less)
int main(int argc, char **argv){

  return 0;
}

Feel free to add your own classes (especially for Buffers of records which will handle input / output). Reading and writing is done using the read and write system calls. The IO streams need to maintain a size B buffer inside them so that they can read or write B size blocks on disk at a time. Note that you are given M = size of memory = amt_of_mem and the choice of B should be such that all the buffers in your code do not make your code allocate more than size amt_of_mem. Also note that for streams, its better to have larger B sizes compared to smaller ones, why?

Coding Requirement: You are not allowed to use any static variables in your program. Use Hungarian naming of variables. Please use underscores to separate words in a name. for example

	/* Ignore changes in horizontal whitespace (-b).  */
	int ignore_space_change_flag;
Space Requirement: Your program should not use more than twice the amount of space than the size of the input file at any given time. For instance if the input file size is 100GB, the code is not allowed to create temporary files of size 300GB. The most amount of disk space the code is allowed to use is 200GB at any given time. You are allowed to destroy the input file while creating the output file if that helps you.

Grading

I should be able to run it on Linux (Redhat) + Intel Xeon Dual processor systems (with hyperthreading on) (In case you decide to use something low level in the code for speed). Since the system has more than 1 cpu (actually 4 with hyperthreading on), if you want you can also use posix threads to sort more internal memory data using threads. The disk that your program will be benchmarked on will consist of 4 Western Digital 74, 10000RPM, disks parallely emulated to behave as 1 disk (using RAID 0 with 3ware cards). So sequential access is blazingly fast on this disk.

Make sure that your program does not take more memory than it is allowed. You can monitor this using the "top" command in linux (or using the windows task manager) and you should keep this in mind when you are designing your algorithm and code. If your code does not sort correctly, you might end up getting a zero in the entire homework even if you are very fast and have good documentation. So correctness should be your first priority. Make sure that you document your code and algororithm well. The algorithm's documentation is expected to be in HTML and on the homework webpage of yours. In the end I will link up all the homework of the class on this page.

Important: Your implementation should beat the most naive way to solve this problem, i.e. using randomized quicksort on the data. This makes (log n) passes on the file using a random pivot to sort the file. Your algorithm should do less than O(log n) passes on the file and hence should be faster than quicksort. If you loose to quicksort, you risk not getting any points for the homework.

What to turn in and when?

Tar all the files into a single file. (No data files in the tar file). Make a homework home page where you will have your documentation in HTML, downloadable code(the tar file, gzipped/bzipped/zipped). Describe your algorithm in the homework home page. Send me the link to your homework home page when you are done.

Your tar file should contain the makefile to make the executable that I need to run. It should have a README file that tells me how to run the code on my input file. For instance if my input file is "input.dat" with 5000,0000 records, each record is of size 100 bytes, key size is 8 bytes, and i want it sorted in increasing order, and the maximum amount of memory your code could use at any given time during the run is 32 MB, I should be able to run it like this
tar -xvf project.tar
make
extsort input.dat output.dat 100 8 32

(here 1 stands for increasing order). I will announce in class the mode of submission. All you will need to submit is the project.tar file which should consist of a makefile that should compile your extsort program and your documentation using doxygen. The doxygen documentation should reside in the doc/ directory.

The deadline for homework II submission is 10:00am, April 25th.

A Small FAQ?

  1. I want to do the homework, what do I need to do?
    You need to first read this webpage carefully. Then send the TA an email with your ID (and your partner's id). Make sure you attend the recitations this wednesday.
  2. Can we do the homework in groups?
    Yes! But the group size can not exceed 2. If you want to do it in a group, send me an email with the name and IDs of you and your partner.
  3. Can we use do it on Windows?
    Yes! But only as long as your code compiles on linux without problems. You can use cygwin on windows to try to make sure that your code is compatible with linux. Preferably this project should be done on a Unix/Linux system.
  4. What programming languages are allowed?
    (C/C++/Java/Python)! As long as I can use the above commands to run your code on my machine. If you want me to use another language to implement the project, send me an email and I will add it to the list. Remember that speed will be a criterion for evaluation (so you really should keep away from crawling languages like Java and Python for this project).
  5. Can I output compressed data?
    No. The input is uncompressed, and the output should be just a permutation of the input.
  6. Can we use assembler?
    Yes, but it is a bit surprising if you need to. We want you to be good at algorithms, not compilers. :)
  7. Can the sort output destroy (reuse) the sort input file?
    Yes. The sort must create a new output file and allocate space for it. But, it can delete the old file and reclaim that space if that is needed to run the task.

  8. On what machine should I experiment?
    Any machine should be fine (with a reasonable size disk, at least a few Gigabytes). Make sure that you delete the files that you generate while experimentation.
  9. What is the minimum amount of memory my program will have?
    You can assume that the minimum amount of memory your program will have is 32MB.
  10. How do I know how much disk space do I have?
    On windows, right click on the drive and see properties, On Linux/Unix use "df"
  11. On what size data set should I test my program before submitting?
    Sort a file greater than 4GB size with your implementation. For some operating systems you might have to use special commands to handle files larger than 4GB. Make sure you use them. For instance in "C", if you need "fopen", you should use "fopen64".
  12. On what data will my code be benchmarked?
    Data from different sources. It will not be taken only from SortGen.cpp. So if you want your code to be fast, experiment with integer keys from different sources.
    I will put up some benchmarking data in a week or so. The data I will benchmark on will not have keys more than 128 bit in length. Usual key sizes will be 8 bytes or 4 bytes.
  13. What if my partner does not work?
    Let me know and I'll try to hook you up with another person if one is available. At least I will split the group.
  14. Is there a trophy for the winner?
    Almost Yes! You get 25% extra credit for the course (Which almost gets you an A!). This will be awarded to the best overall program.
  15. I am working on a project, should I bother?
    People who have started working on their projects are exempted from this assignment. If you have not started working on the project, you can still opt to do the assignment.

Some Helpful code

There is no guarantee that the following code is not buggy! Let me know if you find any problems or bugs in it. All questions about the programming assignment should be clarified with me after class either on Monday or Wednesday. Specially, I have not yet checked ChkSort, I will do it as soon as I get time. In the meanwhile if you find a bug, lemme know. The original code came from Microsoft Research for the following code.

More Help with homework

This wednesday morning (22nd March), I will talk about this homework in recitations. Make sure you attend. I also plan to link/put your doxygen documentation on the web on this page after 25th.

Useful Links

Here are some links that will help you shamelessly copy the code for the assignment. I did the searching for you already. (But I have modified the specs of the input output, and I did not find anything on the web that has exactly the same specs) I have not benchmarked this code or tried to compile it. You are most welcome to use it to begin your project. (But I would not advise so). It is a great idea to try to look into this code and understand why they did what they did... If you find other such code on the web, please do let me know so that I can add it here.

Links to additional resources

Recommended Documentation Software: DoxyGen

Mmap/DMA tutorial [ mmap manpage ] [ A simple mmap example code ]

An interesting paper (Dynamic memory adjustment for external memory merge sort)
Some more external memory sorting references
New AIO Efforts on the kernel

Java Vs C++ Performance issues
Parallel Integer Sorting: You can get some tips on how to sort integers from here. Note that this work was for massively parallel machines and does not apply to us. Just shows you an example of how people attack integer sorting for large data.

My collection of programming links.
Back to course homepage
Piyush Kumar (piyush at cs dot fsu dot edu)