Strings

1 Strings

Given an alphabet of symbols, a string over that alphabet is defined by this grammar:

s -> xs, for any x in the alphabet
s -> ε

Generally we denote by R the size of an alphabet. This comes from "radix", an alternative term for "base" as in "base 10 numbers". Both R and its base 2 logarithm (int) log₂R are important characteristics of strings.

1.1 Alphabets

These are alphabets that are likely to be in use in various IT contexts:

name R lgR   characters

BINARY 2 1 01

DNA 4 2 ACTG

OCTAL 8 3 01234567

DECIMAL 10 4 0123456789

HEXADECIMAL 16 4 0123456789ABCDEF

PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY

LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz

UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ

ALPHANUM 62 6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

BASE64 64 6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

ASCII 128 7 ASCII characters

EXTENDED_ASCII   256 8 extended ASCII characters (upper range = "control characters")

UNICODE16 65536   16 Unicode characters { examples: a (Latin), Ä (Germanic), ǩ (Polish) , Ӝ (Cyrillyc), ∮ (math), ݗ (Arabic), ઊ (Gujarati), ᗖ (unified Canadian Aboriginal), 㗷 (CJK unified ideographs), € (Euro currency), ❤ (smiley face), ⛅ (mountain sunset) }

Note that UNICODE16 will likely take over as the alphabet of choice in many applications. It represents the painted characters of ancient oriental languages as well as the postmodern digital characters that are becoming commonplace in human communications, such as emoticons, "hearts", "smiley faces", and the "euro" currency symbol. In many ways the digital revolution is enabling an expansion of alphabets toward the ancient painted ones, coming full circle.

Q: when will the first child be given a name spelled in post-modern UNICODE16? It's coming! If one of you does that, please let me know.

To use alphabets (and strings) in computing we need some attributes such as captured by this pseudo-class:

class Alphabet
{
  public:
           Alphabet  (String s)       // build alphabet from the characters in s
    char   toChar    (uint i)         // index to character
    uint   toIndx    (char c)         // character to index
    bool   contains  (char c)         // true iff c is in alphabet
    uint   R         ()               // number of characters in alphabet (radix)
    uint   lgR       ()               // bits required to represent an index
    uint[] toIndices (String s)       // converts string s to base R integer
    String toChars   (uint[] indices) // converts base R uint to string in the alphabet
};

As an example, consider the C-string "cop4531". This is represented in C/C++ as the character array

[c,o,p,4,5,3,1,\0]

the toIndices array of "cop4531" is:

[99,111,112,52,53,51,49]

using the standard mapping between indetegers and ASCII symbols.

1.2 Strings as base R numbers

Using the same notation as above, R = number of characters in the alphabet, a string can be thought to represent a "base R" number. For example, let n be an unsigned long, that is, a 64-bit number. n is stored in a 64-bit register, which has 8 bytes. Let's say these 8 bytes are b0, b1, b2, b3, b4, b5, b6, b7 (from right to left). Then using bitwise arithmetic, we have

n = (b7 << 56) | (b6 << 48) | (b5 << 40) | (b4 << 32) | (b3 << 24) | (b2 << 16) | (b1 << 8) | (b0 << 0)
  =  b7*2⁵⁶ + b6*2⁴⁸ + b5*2⁴⁰ + b4*2³² + b3*2²⁴ + b2*2¹⁶ + b1*2⁸ + b0*2⁰
  =  b7*2^8*7 + b6*2^8*6 + b5*2^8*5 + b4*2^8*4 + b3*2^8*3 + b2*2^8*2 + b1*2^8*1 + b0*2^8*0

Substituting R = 2⁸ = 256 we have this "base 256" representation:

n = b7*R⁷ + b6*R⁶ + b5*R⁵ + b4*R⁴ + b3*R³ + b2*R² + b1*R¹ + b0*R⁰

which can be represented by the string with 8 characters from the extended ascii alphabet b7 b6 b5 b4 b3 b2 b1 b0. So we have a 1-1 correspondence between 8-character EXTENDED_ASCII strings and uint64_t integers. In a complete analogous manner we can produce a 1-1 correspondence between 4-character UNICODE16 strings and uint_64 integers:

n = w3*R³ + w2*R² + w1*R¹ + w0*R⁰

where R = 2¹⁶ and w0, w1, w2, w3 are 16-bit words (corresponding to UNICODE16 "digits").

One reminder here: The index values for characters in a string run in the opposit direction from significance. The string representation of the decimal number 952 is s[0] = '9', s[1] = '5', s[2] = '2' so that

951 = s[0]*R² + s[1]*R¹ + s[2]*R⁰

Similarly, our EXTENDED_ASCII representation of the 64-bit number above would have the string representation b7 b6 b5 b4 b3 b2 b1 b0, or s[i] = b(7-i).

Looking at the possibilities for arbitrary length strings, the immensity of what can be represented symbolically (rather than numerically) is eye-popping.

2 String Sorts

The order operator< defined for strings implements "dictionary" or lexicographical order, based on the numerical Index order of the characters. In the worst case, x < y will entail examining all of the characters in the two strings x and y, an Ω(W) operation (where W is the length of the strings). One of the general inquiries in this chapter is to investigate whether efficiencies can be gained in sorting and searching collections of string keys by using the constant-time order operator on the character set itself. We look at three sort algorithms adapted specifically to strings.

2.1 LSD String Sort

LSD string sort: assume that the N keys to be sorted are strings, all of the same length W, and that the alphabet has size R. (The default we are accustomed to is R = 256, strings are extended ascii characters.) Note that we can sort a collection of characters using counting sort.

The general idea for LSD string sort is to apply counting sort on the characters at each position, beginning with the "least significant" (highest index = W-1) character and working up to the "most significant" (smallest index = 0) character. Because counting sort is stable, the final result will be a sort of all of the keys.

This is exactly what we did implementing byte_sort, except that we were sorting actual numbers, which consisted of at most 8 bytes, so our byte_sort looked like this:

mask = 255;
for i = 0 ... 7
  apply counting sort to (keys & mask);
mask = mask << 8;

In other words we just apply counting sort to the right-most byte, then shift over 8 bits, and repeat until we have moved the mask all the way to the left-most byte.

If we think of a strings as a "base R" number, LSD sort does the same thing to string keys, except that we have much bigger "numbers" that cannot be represented numerically in the machine, we must maintain the symbolic representation as strings of "digits".

2.1.1 LSD Pitfalls

Variable Length Strings

The algorithm is not well suited to collections of strings of varying length. It can be adapted, but a lot of energy may be wasted comparing low-significant characters to padding or dummy extensions of shorter strings.
Low-significance characters may never be needed

Considering strings from the least significant character first means that in many cases energy will be spent on ultimately irrelevant comparisons between characters that won't effect the final sorted order. For example, the 3 strings
BFBOVOJQWBVFQOQBOQBVOJQBFVJQBVOQ
ANQONVOWNOWNBJWNBONVOQWNVOWNVJNO
CNEOBNONWJONJONVOENVJTNBWNBTOWNJ
can be placed in sorted order by looking only the leading (most significant) character. Applying LSD string sort would permute the three strings for each character starting at the right character and going all the way to the left, and only the very last permutation, taking place on the left-most character, is meaningful.

A general conclusion from this can be that LSD string sort is best suited to situations where the strings tend to have long prefixes in common so that the less signifcant characters become relevant in the sort.
Unconditional execution of loops

Much like the numerical "little brother" algorithm byte_sort, so much energy is used running fixed-length loops that often the Θ(N) algorithm takes second place to optimized generic Θ (N log N) comparison sorts.

2.1.2 Bit, Byte, and Word Sorts

You can refresh your perspective by experimenting with area51/sortspy.x. Note that bit_sort is beaten by several generic sorts for most data. Note also that counting_sort is very fast, but it becomes impractical when the maximum "spread" of individual number values in the data is too large. This is due to the locally declared array of size k = 1 + max_spread in the implementation of counting_sort (which has k as a parameter). Bit_sort, byte_sort, word_sort, and LSD string sort all get around this limitation by considering the data one component [bit, byte, word, or character] at a time and looping through the components from least to most significant. Byte_sort is exactly LSD string sort on 8-character extended ASCII strings. Word_sort is exactly LSD string sort on 4-character UNICODE16 strings.

Note that the runtimes for byte_sort are doubled when data is processed using variables of type uint64_t, even thought the input data is restricted to be bounded by UINT32_MAX. For strings, interpret this as having to sort at all character positions even when all of the strings have the same 4-character prefix.

2.1.3 LSD Cost Estimates

LSD string sort thus runs in time proportional to the number N of strings + the size R of the alphabet (because counting_sort is Θ(N+R)) times the number of characters in the strings: Θ((N+R)*W). Note also that the space overhead is the number of keys plus size of the alphabet: +Θ(N+R).

2.2 MSD String Sort

The LSD approach can be adapted to string keys of varying length, but it is awkward and may be inefficient. For example, suppose we had many keys of length 6 and one of length 100. Then we would end up with 100 applications of counting_sort, most of which are unnecessary.

MSD (most significant digit first) uses a recursive approach: an application of counting_sort to the first (left-most = most significant) character organizes the array of keys into subarrays, one for each value of the leading character; then a recursive call on each of these subarrays completes the sort of the array.

Note that after each application of counting_sort there are R recursive calls, where R is the size of the alphabet. The (maximum possible) depth of the recursions is the string length W.

2.2.1 MSD Pitfalls

Small Subarrays

Sorting an array of strings (objects with string keys) starting with the left-most character can quickly sort most of the elements after just a few characters. Think of, say, 100,000 elements to be sorted. After the first character is sorted, we have 256 distinct subarrays of size (on average) 100,000/256 ~ 400. After these subarrays are sorted using the second character we have approximately 250 * 250 = 62,500 subarrays of average size 400/256 ~ 1.6. Thus we are very quickly thrashing in the weeds of recursive calls.
We saw in our general study of sort algorithms that, for the recursive sorts (merge_sort, quick_sort) having a size cut-off to insertion_sort results in significant improvement in runtime, compared to the pure recursive algorithm. That effect is even more dramatic for MSD string sort, due to the rapid decrease in size and increase in number of subarrays. Empirical studies have shown a 10-fold improvement using the cut-off to insertion_sort, optimizing at around subarray size = 10.
Equal keys

If a substring occurs in the set of keys, long enough so that the cutoff for small subarrays does not apply, then a recursive call is needed for every character in all of the equal keys. Also, counting_sort is not an efficient way to determine that the characters are all equal - the index count array must be created and values counted, only to discover at the end that all counts are for one value. The worst case is when all keys are equal, but a good approximation to the worst case occurs when large numbers of keys have a long common prefix.

2.2.2 MSD Cost Estimates

The time & space costs for MSD are not as simple to calculate as those of LSD, due to the recursive nature of the algorithm and to the variability due to characteristics of the set of strings being sorted. For random strings, this can be proved [from Sedgewick/Wayne]. Here, N is the number of strings to be sorted, R is the number of characters in the alphabet, W is the maximum length of the strings, and w is the average length of the strings.

Proposition. To sort N random strings from an R-character alphabet, MSD string sort examines about N log_R N characters, on average.
Proposition. MSD string sort uses between 8N + 3R and ~7wN + 3WR array accesses to sort N strings from an R-character alphabet, where w is the average string length.
Proposition. To sort N strings taken from an R-character alphabet, the amount of space needed by MSD string sort is proportional to RW + N in the worst case.

2.3 Three-Way String QuickSort

This version of string sort is modelled on 3-way quick sort. The idea is to adapt quick_sort_3w to apply to the leading (left-most, index 0) character in the vector of strings, using the Alphabet::operator<. This will then re-organize the strings into three ranges: those with leading character less than the pivot character, those with leading character equal to the pivot character, and those with leading character greater than the pivot character. Then apply the same algorithm recursively to each of these three sub-ranges, with the middle range considering the second character instead of the first.

The following example illustrates the process:

NEON     BNJWOW     ABNGRW
NVNP     ABNGRW     BNJWOW
BNJWOW   GOBJNO     GOBJNO   DNIW  
NBKPN    MJYR       MJYR     GOBJNO 
ABNGRW   DNIW       DNIW     MJYR    MER   
GOBJNO   MER        MER      MER     MJYR  
MJYR     NEONNEON   NBKPN
DNIW     NVNP       NEONNEON
NO       NBKPN      NVNP     NO
WNGO     NO         NO       NVNP
SNTP     WNGO       SNTP
MER      SNTP       WNGO

The three ranges are color coded blue, red, and green, with the red color omitted from the first letter in the middle range. The above illustrates a run to completion. But of course the algroithm does not proceed left to right uniformly in the illustration, rather recursive calls are made first on the blue range, then the red range, then the green range. The illustration terminates the process when the range size is <= 1. In the actual implementation a cutoff to insertion sort should happen when the range size is small, but higher than 1. The illustration also ignores possible permutations within elements making up the three ranges.

String Sort Project

Implement the three string sorts discussed above: LSD, MSD, and QS3w, applyimg the optimizations discussed. Using collections of strings of various data characteristics, test these algorihms against the optimized generic sorts. The goal is to find recommendations of sorts to use, by data characteristic.

name	R	lgR	characters
BINARY	2	1	`01`
DNA	4	2	`ACTG`
OCTAL	8	3	`01234567`
DECIMAL	10	4	`0123456789`
HEXADECIMAL	16	4	`0123456789ABCDEF`
PROTEIN	20	5	`ACDEFGHIKLMNPQRSTVWY`
LOWERCASE	26	5	`abcdefghijklmnopqrstuvwxyz`
UPPERCASE	26	5	`ABCDEFGHIJKLMNOPQRSTUVWXYZ`
ALPHANUM	62	6	`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`
BASE64	64	6	`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/`
ASCII	128	7	ASCII characters
EXTENDED_ASCII	256	8	extended ASCII characters (upper range = "control characters")
UNICODE16	65536	16	Unicode characters { examples: a (Latin), Ä (Germanic), ǩ (Polish) , Ӝ (Cyrillyc), ∮ (math), ݗ (Arabic), ઊ (Gujarati), ᗖ (unified Canadian Aboriginal), 㗷 (CJK unified ideographs), € (Euro currency), ❤ (smiley face), ⛅ (mountain sunset) }