strings2

Version 09/03/2018

2 String Sorts

The order operator< defined for strings implements "dictionary" or lexicographical order, based on the numerical Index order of the characters. In the worst case, x < y will entail examining all of the characters in the two strings x and y, an Ω(W) operation (where W is the length of the strings). One of the general inquiries in this chapter is to investigate whether efficiencies can be gained in sorting and searching collections of string keys by using the constant-time order operator on the character set itself. We look at three sort algorithms adapted specifically to strings.

2.1 LSD String Sort

LSD (Least-Significant-Digit) string sort: assume that the N keys to be sorted are strings, all of the same length W, and that the alphabet has size R. (The default we are accustomed to is R = 256, strings are extended ascii characters.) Note that we can sort a collection of characters using counting sort.

The general idea for LSD string sort is to apply counting sort on the characters at each position, beginning with the "least significant" (highest index = W-1) character and working up to the "most significant" (smallest index = 0) character. Because counting sort is stable, the final result will be a sort of all of the keys.

This is exactly what we did implementing byte_sort, except that we were sorting actual numbers, which consisted of at most 8 bytes, so our byte_sort looked like this:

mask = 255 = 0x00000000000000FF;
for i = 0 ... 7
  apply counting sort to (keys & mask);
mask = mask << 8;

In other words we just apply counting sort to the right-most byte, then shift over 8 bits, and repeat until we have moved the mask all the way to the left-most byte.

If we think of a string as a "base R" number, LSD sort does the same thing to string keys, except that we have much bigger "numbers" that cannot be represented numerically in the machine. Therefore we must maintain the symbolic representation as strings of "digits".

LSD can be adapted with almost no extra effort when the String class used to house strings has a built-in null-terminator, a la C-strings. in fact, all that is needed is an Element(index) method that returns the null character whenever there is no character at that index.

2.1.1 LSD Pitfalls

Variable Length Strings

The algorithm is not well suited to collections of strings of varying length (although these are readily handled by the fsu::String class). It can be adapted, but a lot of energy may be wasted comparing low-significant characters to padding or dummy extensions of shorter strings.
Low-significance characters may never be needed

Considering strings from the least significant character first means that in many cases energy will be spent on ultimately irrelevant comparisons between characters that won't affect the final sorted order. For example, the 3 strings
```
    BFBOVOJQWBVFQOQBOQBVOJQBFVJQBVOQ
    ANQONVOWNOWNBJWNBONVOQWNVOWNVJNO
    CNEOBNONWJONJONVOENVJTNBWNBTOWNJ
    
```
can be placed in sorted order by looking only at the leading (most significant) character. Applying LSD string sort would permute the three strings for each character starting at the right character and going all the way to the left, and only the very last permutation, taking place on the left-most character, is necessary.

A general conclusion from this can be that LSD string sort is best suited to situations where the strings tend to have long prefixes in common so that the less signifcant characters become relevant in the sort.
Unconditional execution of loops

Much like the numerical "little brother" algorithm byte_sort, so much energy is used running fixed-length loops that often the Θ(N) algorithm takes second place to optimized generic Θ (N log N) comparison sorts.

2.1.2 Bit, Byte, and Word Sorts

You can refresh your perspective by experimenting with notes_support/sortspy.x. Note that bit_sort is beaten by several generic sorts for most data. Note also that counting_sort is very fast, but it becomes impractical when the maximum "spread" of individual number values in the data is too large. This is due to the locally declared array of size k = 1 + max_spread in the implementation of counting_sort (which has k as a parameter). Bit_sort, byte_sort, word_sort, and LSD string sort all get around this limitation by considering the data one component [bit, byte, word, or character] at a time and looping through the components from least to most significant. Byte_sort is exactly LSD string sort on 8-character extended ASCII strings. Word_sort is exactly LSD string sort on 4-character UNICODE16 strings.

Note that the runtimes for byte_sort are doubled when data is processed using variables of type uint64_t, even thought the input data is restricted to be bounded by UINT32_MAX. For strings, interpret this as having to sort at all character positions even when all of the strings have the same 4-character prefix.

2.1.3 LSD Cost Estimates

LSD string sort thus runs in time proportional to the number N of strings + the size R of the alphabet (because counting_sort is Θ(N+R)) times the number of characters in the strings: Θ((N+R)*W). Note also that the space overhead is the number of keys plus size of the alphabet: +Θ(N+R).

2.1.4 A Generic Hollorith Mapping

LSD string sort relies on counting_sort as discussed earlier. The version g_counting_sort below takes counting_sort all the way to a generic algorithm. It's a post-modern version of the original card sorting machine invented by Herman Hollerith, co-founder of IBM. There is no explicit assumption needed on the element types being processed, and the counting_sort algorithm is re-phrased as a kind of permutation mapping an input range to an output range.

The place where numbers enter the picture is via the function object f. This may seem abstract/esoteric, but it is very useful. We will illustrate with ByteSort (for integers) and LSD (for strings).


template < class I , class J , class F >
void g_counting_sort(I source_beg, I source_end, J dest_beg, size_t R, F f)
// Pre:  I,J are iterator types with the same ElementType
//       destination range is at least as large as source range
//       f maps ElementType to int values in the range [0,R)
// Post: source range is unchanged
//       dest range is a stable f-sorted permutation of source range
//       I.e., i < j ==> f(B[i]) <= f(B[j])
//         and relative order of f-equal elements is preserved
{
  size_t * c = new size_t[R+1];    // declare counter array
  for (size_t r = 0; r <= R; ++r)  // initialize counters to 0
    c[r] = 0;
  for (I i = source_beg; i != source_end; ++i)   // count instances of f(t) == r offset by one
    ++c[1+f(*i)];                                //   c[r+1] = number of a's that map to r
  for (size_t r = 1; r <= R; ++r)  // accumulate instance counts
    c[r] += c[r-1];                //   c[r+1] = number of a's that map to 0 .. r
                                   //   c[r]   = number of a's that map to < r
  for (I i = source_beg; i != source_end; ++i)   // map a -> b
  {
    dest_beg[c[f(*i)]] = *i;
    ++c[f(*i)];
  }
}

(We have migrated from the Cormen-like implementation to a Sedgewick-like implementation to get all the loops running in the same direction.) Note that the implementing code implicitly requires ranges determined by random access iterators or pointers.

2.1.5 Byte Sort

Notice that counting_sort permutes the input range by stably ordering the elements according to the function object f. The trick in making practical use of counting_sort is in finding a family of "mask-like" function objects that serve to isolate small/manageable components of the input data. Consider for example the following function class:


template <typename N>
class Byte
{
public:
  N operator () (N n)
  {
    return ((n >> offset_) & 0xFF); // the byte at the offset location
  }
  Byte() : offset_(static_cast<N>(0x00)) {}
  void SetByte(unsigned char i)
  {
    offset_ = static_cast<N>(i << 3); // the ith byte = offset*8
  }
private:
  N offset_;
};

If b is a Byte<unsigned long> object, b(n) returns the i^th byte of n embedded as the right-most byte in an N object. The offset i is set by the method SetByte.

With these two helpers we can now write complete code for ByteSort:


template <typename N>
void byte_sort (N* A, size_t n)
{
  N* B = new N [n];
  fsu::Byte<N> b;
  size_t numBytes = sizeof(N);
  for (size_t i = 0; i < numBytes; ++i)
  {
    b.SetByte(i);   // byte i will be isolated with a  mask
    fsu::g_counting_sort(A,A+n,B,256,b); // call the Hollorith mapping
    fsu::Swap(A,B); // swap pointers
  }
  delete [] B;
}

Note that we are swapping addresses of memory blocks which is much more efficient than data copy (Θ(1) v Θ(n)). Even if we had a type with an odd number of bits (so that delete [] B actually deletes the original A) there is no problem letting A take over B's original memory allocation, due to the global scope of dynamically allocated entities.

2.1.6 LSD String Sort

Quite analogous to ByteSort, the following function class combines with g_counting_sort to implement LSD string sort:

class IndexValue
{
public:
  size_t operator() ( const fsu::String& s )
  {
     return (size_t)s.Element(index_);
  }
  IndexValue () : index_(0) {}
  void SetIndex (size_t i)
  {
    index_ = i;
  }
private:
  size_t index_;
};

void LSD (fsu::Vector<fsu::String>& a, size_t L, size_t R)
{
  fsu::Vector<fsu::String> b(a);
  IndexValue iv;
  for (size_t d = L + 1; d > 0; )
  {
    --d;
    iv.SetIndex(d);
    g_counting_sort(a.Begin(), a.End(), b.Begin(), R, iv);
    a.Swap(b);
  }
}

LSD string sort applies to strings of varying length without any fuss with the help of the IndexValue function class and the fact that the null-character '\0' comes before any other character in the character set. (Recall that the fsu::String method s.Element(i) returns the character at i if i is in range and '\0' otherwise, so that IndexValue(s) returns 0 when i ≥ s.Length().)

2.2 MSD String Sort

Given the "right" notion of string object, the LSD approach adapts to string keys of varying length, but wide variation in string length can lead to inefficiencies. For example, suppose we have many keys of length 6 and one of length 100. Then the main loop in LSD would run 100 times and produce no meaningful change in the array on the first 94 iterations. Like ByteSort, LSD is a strictly "run-to-completion" process. It can be fast, but it cannot be sped up.

MSD (most significant digit first) uses a recursive approach: an application of counting_sort to the first (left-most = most significant) character organizes the array of keys into subarrays, one for each value of the leading character; then a recursive call on each of these subarrays completes the sort of the array.

Note that after each application of counting_sort there are R recursive calls, where R is the size of the alphabet. The (maximum possible) depth of the recursions is the string length W.

2.2.1 MSD Pitfalls

Small Subarrays

Sorting an array of strings (objects with string keys) starting with the left-most character can quickly sort most of the elements after just a few characters. Think of, say, 100,000 ASCII strings to be sorted. After the first character is sorted, we have 128 distinct subarrays of size (on average) 100,000/128 ~ 800. After these subarrays are sorted using the second character we have approximately 128 * 128 = 16,384 subarrays of average size 400/128 ~ 3.2. Thus we are very quickly thrashing in the weeds of recursive calls.

We saw in our general study of sort algorithms that, for the recursive sorts (merge_sort, quick_sort) having a size cut-off to insertion_sort results in significant improvement in runtime, compared to the pure recursive algorithm. That effect is even more dramatic for MSD string sort, due to the rapid decrease in size and increase in number of subarrays. Empirical studies have shown a 10-fold improvement using the cut-off to insertion_sort, optimizing at around subarray size = 10.
Equal keys

If a substring occurs in the set of keys, long enough so that the cutoff for small subarrays does not apply, then a recursive call is needed for every character in all of the equal keys. Also, counting_sort is not an efficient way to determine that the characters are all equal - the index count array must be created and values counted, only to discover at the end that all counts are for one value. The worst case is when all keys are equal, but a good approximation to the worst case occurs when large numbers of keys have a long common prefix.

2.2.2 MSD Cost Estimates

The time & space costs for MSD are not as simple to calculate as those of LSD, due to the recursive nature of the algorithm and to the variability due to characteristics of the set of strings being sorted. For random strings, the following can be proved [from Sedgewick/Wayne].

Proposition. Let N be the number of strings to be sorted, R the number of characters in the alphabet, W the maximum length of the strings, and w the average length of the strings. Then:

To sort N random strings from an R-character alphabet, MSD string sort examines about N log_R N characters, on average.
MSD string sort uses between 8N + 3R and ~7wN + 3WR array accesses to sort N strings from an R-character alphabet.
To sort N strings taken from an R-character alphabet, the amount of space needed by MSD string sort is proportional to RW + N in the worst case.

2.3 Three-Way String QuickSort

This version of string sort is modelled on 3-way quick sort. The idea is to adapt quick_sort_3w to apply to the leading (left-most, index 0) character in the vector of strings, using the Alphabet::operator<. This will then re-organize the strings into three ranges: those with leading character less than the pivot character, those with leading character equal to the pivot character, and those with leading character greater than the pivot character. Then apply the same algorithm recursively to each of these three sub-ranges, with the middle range considering the second character instead of the first.

The following example illustrates the process:

NEON BNJWOW ABNGRW ---> NVNP ABNGRW BNJWOW ---> BNJWOW GOBJNO GOBJNO DNIW ---> NBKPN MJYR MJYR GOBJNO ---> ABNGRW DNIW DNIW MJYR MER MER ---> GOBJNO MER MER MER MJYR MJYR ---> MJYR NEON NBKPN ---> DNIW NVNP NEON ---> NO NBKPN NVNP NO ---> WNGO NO NO NVNP ---> SNTP WNGO SNTP ---> MER SNTP WNGO --->

ABNGRW BNJWOW DNIW GOBJNO MER MJYR NBKPN NEON NO NVNP SNTP WNGO

The three ranges are color coded blue, red, and green, with the red color omitted from the first letter in the middle range. The above illustrates a run to completion. But of course the algroithm does not proceed left to right uniformly in the illustration, rather recursive calls are made first on the blue range, then the red range, then the green range. The pivot element in each range is underscored. The illustration terminates the process when the range size is <= 1. In the actual implementation a cutoff to insertion sort should happen when the range size is small, but higher than 1. The illustration also ignores possible permutations within elements making up the three ranges.

String Sort Project

Implement the three string sorts discussed above: LSD, MSD, and SQS3w, applying the optimizations discussed. Using collections of strings of various data characteristics, test these algorihms against the optimized generic sorts. The goal is to find recommendations of sorts to use, by data characteristic.