Perhaps the simplest form of external sorting is to use a fast internal sort with good locality of reference (which means that it tends to reference nearby items, not widely scattered items) and hope that your operating system's virtual memory can handle it. (Quicksort is one sort algorithm that is generally very fast and has good locality of reference.) If the file is too huge, however, even virtual memory might be unable to fit it. Also, the performance may not be too great due to the large amount of time it takes to access data on disk.
Disks with huge capacities are often designed such that peak performance is achieved when we access the blocks in sequential manner. Most external sorting methods use the following general strategy. Make a first pass through the file to be sorted, breaking it into blocks about the size of the memory and sort these blocks. Then, merge the sorted blocks together, if necessary by making several passes through the file, creating successively larger sorted blocks until the whole file is sorted. This approach is called external memory merge-sort and has been in use since 1950s. Note that this method does not use the fact that the keys are integers. For the least we could sort the blocks in internal memory using radix sort to make this implementation faster (since in this assignment the keys are integers). Here is some more material on external memory algorithms for sorting from Jeff Erickson. (1.1.1 B-Trees are a vairant of a-b trees) and from Peter Sanders. You might also want to look the chapter "Special purpose sorting methods" in the book "Algorithms in C++"/"Algorithms in Java" by Robert Sedgewick
In this assignment you will sort a file of fixed size records, stored in a file on disk, using a given amount of main memory. Note that when you are working with massive data sets, you need to take care of how your disk works. It is a good idea to read chunks of data at a time from the disk than compared to randomly accessing the disk. The input data to your code will run into Gigabytes, so you should make sure your program can handle data sizes of more than 4GB. The files that I will run the program on will be huge. The benchmark machine will have dual processors, so in case you need more CPU power you should think of using posix threads.
Come to my "programming homework" office hours. I will have extra office hours for the programming assignment on Wednesdays after class. I will help you design and analyse the algorithms that you can come up with for this task and tell you the tricks and traps of algorithms that need to handle massive data sets. I will point you to material that you could read for this task.
// All File IO must be performed using stdio.h using namespace std; template < unsigned SizeOfRec > class Record { private: char m_chdata[SizeOfRec]; // Data for record contained here public: Record(); ~Record(); long size(); /* Other comments for the class ---------------------------- The class must implement a copy constructor. The class must implement a operator[] for the data in the class The class must implement an assignment operator. */ }; template < unsigned SizeOfRec > bool operator<(const RecordFeel free to add your own classes (especially for Buffers of records which will handle input / output). Reading and writing is done using the read and write system calls. The IO streams need to maintain a size B buffer inside them so that they can read or write B size blocks on disk at a time. Note that you are given M = size of memory = amt_of_mem and the choice of B should be such that all the buffers in your code do not make your code allocate more than size amt_of_mem. Also note that for streams, its better to have larger B sizes compared to smaller ones, why?& r1,const Record & r2){ // Put your comparison function here. return true; }; // Total amount of hard drive space used should not exceed twice the // size of the input file. // Total amount of memory used should not exceed the amount input by // the user. An error should occur if you can not sort in the amount // of memory the user inputs. template < typename Rec > class FileSorter { FILE *m_h_inpfile; // handle to input file FILE *m_h_outfile; // handle to output file long m_lnrecords; // Number of records in file. int m_i_amt_of_mem; public: FileSorter( string& inFile, // Name of unsorted heapfile. string& outFile, // Name of sorted heapfile. int amt_of_mem // Amount of memory in MB available for sorting // Your program should not allocate a total of more than amt_of_mem MB memory at any // given time. In my benchmarks, the input file size will always exceed amt_of_mem MB. ); ~FileSorter(); // destructor int TwoPassMergeSort(long i, long j); // Only sorts records i to j. int TwoPassMergeSort(); // Sorts the whole file. // Return value for sorters should correspond to : // Status of call: Should return 1 if sorted, -x for an error. You should say what -x is, when you code. // For instance -1 = No disk space left // -2 = File IO error // ... // The following function should print a verbose output of what // went wrong when sorting was attempted void perror(int x); }; // The size of main should not exceed 10 lines // (or 800 characters, whichever is less) int main(int argc, char **argv){ return 0; }
Coding Requirement: You are not allowed to use any static variables in your program. Use Hungarian naming of variables. Please use underscores to separate words in a name. for example
/* Ignore changes in horizontal whitespace (-b). */ int ignore_space_change_flag;Space Requirement: Your program should not use more than twice the amount of space than the size of the input file at any given time. For instance if the input file size is 100GB, the code is not allowed to create temporary files of size 300GB. The most amount of disk space the code is allowed to use is 200GB at any given time. You are allowed to destroy the input file while creating the output file if that helps you.
Make sure that your program does not take more memory than it is allowed. You can monitor this using the "top" command in linux (or using the windows task manager) and you should keep this in mind when you are designing your algorithm and code. If your code does not sort correctly, you might end up getting a zero in the entire homework even if you are very fast and have good documentation. So correctness should be your first priority. Make sure that you document your code and algororithm well. The algorithm's documentation is expected to be in HTML and on the homework webpage of yours. In the end I will link up all the homework of the class on this page.
Important: Your implementation should beat the most naive way to solve this problem, i.e. using randomized quicksort on the data. This makes (log n) passes on the file using a random pivot to sort the file. Your algorithm should do less than O(log n) passes on the file and hence should be faster than quicksort. If you loose to quicksort, you risk not getting any points for the homework.
Tar all the files into a single file. (No data files in the tar file). Make a homework home page where you will have your documentation in HTML, downloadable code(the tar file, gzipped/bzipped/zipped). Describe your algorithm in the homework home page. Send me the link to your homework home page when you are done.
Your tar file should contain the makefile to make the executable that I need to run. It should have a README file that tells me how to run the code on my input file. For instance if my input file is "input.dat" with 5000,0000 records, each record is of size 100 bytes, key size is 8 bytes, and i want it sorted in increasing order, and the maximum amount of memory your code could use at any given time during the run is 32 MB, I should be able to run it like thistar -xvf project.tar make extsort input.dat output.dat 100 8 32
(here 1 stands for increasing order). I will announce in class the mode of submission. All you will need to submit is the project.tar file which should consist of a makefile that should compile your extsort program and your documentation using doxygen. The doxygen documentation should reside in the doc/ directory.
The deadline for homework II submission is 10:00am, April 25th.
There is no guarantee that the following code is not buggy! Let me know if you find any problems or bugs in it. All questions about the programming assignment should be clarified with me after class either on Monday or Wednesday. Specially, I have not yet checked ChkSort, I will do it as soon as I get time. In the meanwhile if you find a bug, lemme know. The original code came from Microsoft Research for the following code.
tar -xvf SortGen.tar make SortGen 100 input.dat
tar -xvf ChkSort.tar make ChkSort input.dat output.dat
Recommended Documentation Software: DoxyGen
Mmap/DMA tutorial [ mmap manpage ] [ A simple mmap example code ]
An interesting paper (Dynamic memory adjustment for external memory merge sort)
Some more external memory sorting references
New AIO Efforts on the kernel
Java Vs C++ Performance issues
Parallel Integer Sorting: You can get some tips on how to sort integers from here.
Note that this work was for massively parallel machines and does not apply to us. Just shows you an example of how people attack integer sorting for large data.