Given an alphabet of symbols, a string over that alphabet is defined by this grammar:
s -> xs, for any x in the alphabet
s -> ε
Generally we denote by R the size of an alphabet. This comes from "radix", an alternative term for "base" as in "base 10 numbers". Both R and its base 2 logarithm (int) log2R are important characteristics of strings.
These are alphabets that are likely to be in use in various IT contexts:
name R lgR characters BINARY 2 1 01 DNA 4 2 ACTG OCTAL 8 3 01234567 DECIMAL 10 4 0123456789 HEXADECIMAL 16 4 0123456789ABCDEF PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ ALPHANUM 62 6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 BASE64 64 6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ ASCII 128 7 ASCII characters EXTENDED_ASCII 256 8 extended ASCII characters (upper range = "control characters") UNICODE16 65536 16 Unicode characters { examples: a (Latin), Ä (Germanic), ǩ (Polish) , Ӝ (Cyrillyc), ∮ (math), ݗ (Arabic), ઊ (Gujarati), ᗖ (unified Canadian Aboriginal), 㗷 (CJK unified ideographs), € (Euro currency), ❤ (smiley face), ⛅ (mountain sunset) }
Note that UNICODE16 will likely take over as the alphabet of choice in many applications. It represents the painted characters of ancient oriental languages as well as the postmodern digital characters that are becoming commonplace in human communications, such as emoticons, "hearts", "smiley faces", and the "euro" currency symbol. In many ways the digital revolution is enabling an expansion of alphabets toward the ancient painted ones, coming full circle.
Q: when will the first child be given a name spelled in post-modern UNICODE16? It's coming! If one of you does that, please let me know.
To use alphabets (and strings) in computing we need some attributes such as captured by this pseudo-class:
class Alphabet { public: Alphabet (String s) // build alphabet from the characters in s char toChar (uint i) // index to character uint toIndx (char c) // character to index bool contains (char c) // true iff c is in alphabet uint R () // number of characters in alphabet (radix) uint lgR () // bits required to represent an index uint[] toIndices (String s) // converts string s to base R integer String toChars (uint[] indices) // converts base R uint to string in the alphabet };
As an example, consider the C-string "cop4531". This is represented in C/C++ as the character array
[c,o,p,4,5,3,1,\0]
the toIndices array of "cop4531" is:
[99,111,112,52,53,51,49]
using the standard mapping between indetegers and ASCII symbols.
Using the same notation as above, R = number of characters in the alphabet, a string can be thought to represent a "base R" number. For example, let n be an unsigned long, that is, a 64-bit number. n is stored in a 64-bit register, which has 8 bytes. Let's say these 8 bytes are b0, b1, b2, b3, b4, b5, b6, b7 (from right to left). Then using bitwise arithmetic, we have
n = (b7 << 56) | (b6 << 48) | (b5 << 40) | (b4 << 32) | (b3 << 24) | (b2 << 16) | (b1 << 8) | (b0 << 0) = b7*256 + b6*248 + b5*240 + b4*232 + b3*224 + b2*216 + b1*28 + b0*20 = b7*28*7 + b6*28*6 + b5*28*5 + b4*28*4 + b3*28*3 + b2*28*2 + b1*28*1 + b0*28*0
Substituting R = 28 = 256 we have this "base 256" representation:
n = b7*R7 + b6*R6 + b5*R5 + b4*R4 + b3*R3 + b2*R2 + b1*R1 + b0*R0
which can be represented by the string with 8 characters from the extended ascii alphabet b7 b6 b5 b4 b3 b2 b1 b0. So we have a 1-1 correspondence between 8-character EXTENDED_ASCII strings and uint64_t integers. In a complete analogous manner we can produce a 1-1 correspondence between 4-character UNICODE16 strings and uint_64 integers:
n = w3*R3 + w2*R2 + w1*R1 + w0*R0
where R = 216 and w0, w1, w2, w3 are 16-bit words (corresponding to UNICODE16 "digits").
One reminder here: The index values for characters in a string run in the opposit direction from significance. The string representation of the decimal number 952 is s[0] = '9', s[1] = '5', s[2] = '2' so that951 = s[0]*R2 + s[1]*R1 + s[2]*R0
Similarly, our EXTENDED_ASCII representation of the 64-bit number above would have the string representation b7 b6 b5 b4 b3 b2 b1 b0, or s[i] = b(7-i).
Looking at the possibilities for arbitrary length strings, the immensity of what can be represented symbolically (rather than numerically) is eye-popping.
The order operator< defined for strings implements "dictionary" or lexicographical order, based on the numerical Index order of the characters. In the worst case, x < y will entail examining all of the characters in the two strings x and y, an Ω(W) operation (where W is the length of the strings). One of the general inquiries in this chapter is to investigate whether efficiencies can be gained in sorting and searching collections of string keys by using the constant-time order operator on the character set itself. We look at three sort algorithms adapted specifically to strings.
LSD string sort: assume that the N keys to be sorted are strings, all of the same length W, and that the alphabet has size R. (The default we are accustomed to is R = 256, strings are extended ascii characters.) Note that we can sort a collection of characters using counting sort.
The general idea for LSD string sort is to apply counting sort on the characters at each position, beginning with the "least significant" (highest index = W-1) character and working up to the "most significant" (smallest index = 0) character. Because counting sort is stable, the final result will be a sort of all of the keys.
This is exactly what we did implementing byte_sort, except that we were sorting actual numbers, which consisted of at most 8 bytes, so our byte_sort looked like this:
mask = 255; for i = 0 ... 7 apply counting sort to (keys & mask); mask = mask << 8;
In other words we just apply counting sort to the right-most byte, then shift over 8 bits, and repeat until we have moved the mask all the way to the left-most byte.
If we think of a strings as a "base R" number, LSD sort does the same thing to string keys, except that we have much bigger "numbers" that cannot be represented numerically in the machine, we must maintain the symbolic representation as strings of "digits".
Variable Length Strings
The algorithm is not well suited to collections of strings of varying length. It can be adapted, but a lot of energy may be wasted comparing low-significant characters to padding or dummy extensions of shorter strings.
Low-significance characters may never be needed
Considering strings from the least significant character first means that
in many cases energy will be spent on ultimately irrelevant comparisons
between characters that won't effect the final sorted order. For example,
the 3 strings
BFBOVOJQWBVFQOQBOQBVOJQBFVJQBVOQ
ANQONVOWNOWNBJWNBONVOQWNVOWNVJNO
CNEOBNONWJONJONVOENVJTNBWNBTOWNJ
can be placed in sorted order by looking only the leading (most significant)
character. Applying LSD string sort would permute the three strings for each
character starting at the right character and going all the way to the left, and only the
very last permutation, taking place on the left-most character, is meaningful.
A general conclusion from this can be that LSD string sort is best suited to situations where the strings tend to have long prefixes in common so that the less signifcant characters become relevant in the sort.
Unconditional execution of loops
Much like the numerical "little brother" algorithm byte_sort, so much energy is used running fixed-length loops that often the Θ(N) algorithm takes second place to optimized generic Θ (N log N) comparison sorts.
You can refresh your perspective by experimenting with area51/sortspy.x. Note that bit_sort is beaten by several generic sorts for most data. Note also that counting_sort is very fast, but it becomes impractical when the maximum "spread" of individual number values in the data is too large. This is due to the locally declared array of size k = 1 + max_spread in the implementation of counting_sort (which has k as a parameter). Bit_sort, byte_sort, word_sort, and LSD string sort all get around this limitation by considering the data one component [bit, byte, word, or character] at a time and looping through the components from least to most significant. Byte_sort is exactly LSD string sort on 8-character extended ASCII strings. Word_sort is exactly LSD string sort on 4-character UNICODE16 strings.
Note that the runtimes for byte_sort are doubled when data is processed using variables of type uint64_t, even thought the input data is restricted to be bounded by UINT32_MAX. For strings, interpret this as having to sort at all character positions even when all of the strings have the same 4-character prefix.
LSD string sort thus runs in time proportional to the number N of strings + the size R of the alphabet (because counting_sort is Θ(N+R)) times the number of characters in the strings: Θ((N+R)*W). Note also that the space overhead is the number of keys plus size of the alphabet: +Θ(N+R).
The LSD approach can be adapted to string keys of varying length, but it is awkward and may be inefficient. For example, suppose we had many keys of length 6 and one of length 100. Then we would end up with 100 applications of counting_sort, most of which are unnecessary.
MSD (most significant digit first) uses a recursive approach: an application of counting_sort to the first (left-most = most significant) character organizes the array of keys into subarrays, one for each value of the leading character; then a recursive call on each of these subarrays completes the sort of the array.
Note that after each application of counting_sort there are R recursive calls, where R is the size of the alphabet. The (maximum possible) depth of the recursions is the string length W.
Sorting an array of strings (objects with string keys) starting with the left-most character can quickly sort most of the elements after just a few characters. Think of, say, 100,000 elements to be sorted. After the first character is sorted, we have 256 distinct subarrays of size (on average) 100,000/256 ~ 400. After these subarrays are sorted using the second character we have approximately 250 * 250 = 62,500 subarrays of average size 400/256 ~ 1.6. Thus we are very quickly thrashing in the weeds of recursive calls.
We saw in our general study of sort algorithms that, for the recursive sorts (merge_sort, quick_sort) having a size cut-off to insertion_sort results in significant improvement in runtime, compared to the pure recursive algorithm. That effect is even more dramatic for MSD string sort, due to the rapid decrease in size and increase in number of subarrays. Empirical studies have shown a 10-fold improvement using the cut-off to insertion_sort, optimizing at around subarray size = 10.
Equal keys
If a substring occurs in the set of keys, long enough so that the cutoff for small subarrays does not apply, then a recursive call is needed for every character in all of the equal keys. Also, counting_sort is not an efficient way to determine that the characters are all equal - the index count array must be created and values counted, only to discover at the end that all counts are for one value. The worst case is when all keys are equal, but a good approximation to the worst case occurs when large numbers of keys have a long common prefix.
The time & space costs for MSD are not as simple to calculate as those of LSD, due to the recursive nature of the algorithm and to the variability due to characteristics of the set of strings being sorted. For random strings, this can be proved [from Sedgewick/Wayne]. Here, N is the number of strings to be sorted, R is the number of characters in the alphabet, W is the maximum length of the strings, and w is the average length of the strings.
Proposition. To sort N random strings from an R-character
alphabet, MSD string sort examines about N
logR N characters, on
average.
Proposition. MSD string sort uses between 8N +
3R and ~7wN + 3WR array accesses to sort N strings from
an R-character alphabet, where w is the average string
length.
Proposition. To sort N strings taken from
an R-character alphabet, the amount of space needed by MSD string sort
is proportional to RW + N in the worst case.
This version of string sort is modelled on 3-way quick sort. The idea is to adapt quick_sort_3w to apply to the leading (left-most, index 0) character in the vector of strings, using the Alphabet::operator<. This will then re-organize the strings into three ranges: those with leading character less than the pivot character, those with leading character equal to the pivot character, and those with leading character greater than the pivot character. Then apply the same algorithm recursively to each of these three sub-ranges, with the middle range considering the second character instead of the first.
The following example illustrates the process:
NEON BNJWOW ABNGRW NVNP ABNGRW BNJWOW BNJWOW GOBJNO GOBJNO DNIW NBKPN MJYR MJYR GOBJNO ABNGRW DNIW DNIW MJYR MER GOBJNO MER MER MER MJYR MJYR NEONNEON NBKPN DNIW NVNP NEONNEON NO NBKPN NVNP NO WNGO NO NO NVNP SNTP WNGO SNTP MER SNTP WNGO
The three ranges are color coded blue, red, and green, with the red color omitted from the first letter in the middle range. The above illustrates a run to completion. But of course the algroithm does not proceed left to right uniformly in the illustration, rather recursive calls are made first on the blue range, then the red range, then the green range. The illustration terminates the process when the range size is <= 1. In the actual implementation a cutoff to insertion sort should happen when the range size is small, but higher than 1. The illustration also ignores possible permutations within elements making up the three ranges.
Implement the three string sorts discussed above: LSD, MSD, and QS3w, applyimg the optimizations discussed. Using collections of strings of various data characteristics, test these algorihms against the optimized generic sorts. The goal is to find recommendations of sorts to use, by data characteristic.