Project 1: Hash Analysis

Analysis Methodology for Hash Tables

Version 08/19/17

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

Describe and explain in detail the concept of hash table
Implement hash tables as vector of lists
Define and implement the ADT Table using a private hash table structure
Define and implement bidirectional iterator class for this implementation of Table
Calculate the theoretical bucket size distribution for Simple Uniform Hashing and a given table size
Calculate the actual bucket size distribution for a given instance of a hash table
Add methods to fsu::HashTable performing these calculations

Operational Objectives: Implement two methods in the class template HashTable<K,T,H>:

size_t  HashTable<K,T,H>::MaxBucketSize () const;
void    HashTable<K,T,H>::Analysis      (std::ostream& os) const;

conforming to the requirements and specifications given below.

Background Knowledge Requirements: Before starting software development you should study and be familiar with the following:

The lecture notes on Hash Tables
The distributed code implementing fsu::HashTable in LIB/tcpp/hashtbl.h
The supplemental notes on Hash Table Analysis

Deliverables: Three files:

hashtbl.cpp  # contains implementations MaxBucketSize and Analysis
makefile     # builds 10 executables (4 hasheval*.x, 4 fhtbl*.x, plus rantable.x and hashcalc.x)
log.txt      # your experience log

Note that hashtbl.cpp is a slave file for hashtbl.h. Your log.txt should contain date/time of work sessions and a brief description what the activity was during that session. The log should end with a brief discussion of your experience and knowledge gained testing various hash functions and load factors.

Procedural Requirements

The official development/testing/assessment environment is specified in the Course Organizer. Code should compile without warnings or errors.
In order not to confuse the submit system, create and work within a separate subdirectory cop4531/proj1.
Maintain your work log in the text file log.txt as documentation of effort, testing results, and development history. This file may also be used to report on any relevant issues encountered during project development.

Begin by copying the all of the files in the course project directory into yours, along with a few others that will be helpful:


LIB/tcpp/hashtbl.h              # HashTable<> and HashTableIterator<>, except Analysis and MaxBucketSize
LIB/tcpp/hashtbl.cpp            # stub file to be completed
LIB/tests/fhtbl.cpp             # test harness for hash tables
LIB/tests/hashcalc.cpp          # calculates hash values interactively
LIB/tests/hasheval.cpp          # test focusing specifically on Analysis
LIB/tests/rantable.cpp          # creates random  table data 
LIB/area51/fhtblKISS_i.x        # linprog/Intel/Linux executables
LIB/area51/fhtblModP_i.x        # ...
LIB/area51/fhtblMM_i.x
LIB/area51/fhtblSimple_i.x
LIB/area51/rantable_i.x
LIB/area51/hashcalc_i.x
LIB/area51/hashevalKISS_i.x
LIB/area51/hashevalModP_i.x
LIB/area51/hashevalMM_i.x
LIB/area51/hashevalSimple_i.x

The executables in area51 are distributed only for your information and experimentation. You have the source code for these (except for hashtbl.cpp) and can build these to test your own code.

The file hashtbl.h is copied ONLY FOR YOUR CONVENIENCE. Note this file is NOT submitted to your portfolio, so any code you write must deal with this file as it currently exists in the course library at the time. It is a good idea to rename this to "hashtbl.header" just to ensure that your code is reading the file from the library.

Your file hashtbl.cpp should contain implementations of Analysis and MaxBucketSize.
Be sure that you have established the submit script LIB/scripts/submit.sh as a command in your ~/.bin directory.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu or quake.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Code Requirements and Specifications - MaxBucketSize and Analysis

MaxBucketSize should return the size of the largest bucket in the hash table instance.

Analysis should result in a display (to the std::ostream passed in) as illustrated here:

      table size:           9997
      number of buckets:    9973
        nonempty buckets:   6326
        max bucket size:    7
      expected search time: 2.00
        actual search time: 2.58

bucket size distributions
-------------------------
      size       actual         theory (uniform random distribution)
      ----       ------         ------
         0         3647         3659.9
         1         3685         3669.0
         2         1846         1838.9
         3          608          614.4
         4          145          153.9
         5           37           30.9
         6            4            5.2
         7            1            0.7
         8                         0.1

This display shows the size of the table, number of buckets, number of non-empty buckets, max bucket size, expected search time [1 + (table size)/(number of buckets)], actual average search time [1 + (table size)/(number of non-empty buckets)]. Then a tabular printout of the bucket size distribution follows, showing the bucket size, actual number of buckets of that size, and the expected number for simple uniform hashing. The table print terminates for bucket size n when there are no buckets of size > n and the theoretical size is < 0.05. Display the theoretical sizes to the nearest tenth as depicted above.

Algorithm for MaxBucketSize and Analysis. Use the algorithms developed in notes (see course organizer).
Thoroughly test your implementation for correct functionality using the provided test clients fhtbl.cpp and hasheval.cpp using a variety of tables you create with rantable.cpp. Be sure to test using variations:
1. Tables of various sizes, small to very large (at least 1,000,000)
2. Varieties of hash functions (four are provided: ModP, KISS, MM, and Simple)
3. Load factor = n/b = ratio of table size to (approximate) number of buckets (0.1, 1.0, 10.0 100.0 are suggested)
4. Prime / nonprime number of buckets
The test harnesses fhtbl.cpp and hasheval.cpp are easily changed via comment/uncomment of typedefs to accomodate the variations in hash functions. The prime/non-prime number of buckets is a constructor argument (default value "true" meaning prime number of buckets).
Write a short summary giving your experience and lessons learned during the testing of variations as above. Turn this in as an addendum to log.txt.

Hints

Use Hash Analysis Proposition 3 as a check for internal consistency during Analysis.
In calulating the theoretical distribution, you can restrict the size of the vector to be the same as that storing the actual distribution. There may be a few extra entries that need calculating for the display, but these can be done iteratively. This will save a huge amount of storage space, most of which would have very small numbers stored (or zero).

These bits of code show the formatting of the table columns:

  int width0 = 10, width1 = 13, width2=15;

...

  // details header
  os << std::setw(width0) << "bucket size distributions" << '\n'
     << std::setw(width0) << "-------------------------" << '\n';
  os << std::setw(width0) << "size" 
     << std::setw(width1) << "actual"
     << std::setw(width2) << "theory" << " (uniform random distribution)" << '\n';
  os << std::setw(width0) << "----" 
     << std::setw(width1) << "------"
     << std::setw(width2) << "------" << '\n';
...