Strings and String Algorithms1 StringsGiven an alphabet of symbols, a string over that alphabet is defined by this grammar:
Generally we denote by R the size of an alphabet. R comes from "radix", an alternative term for "base" as in "base 10 numbers". Both R and its base 2 logarithm (int) log2R are important characteristics of strings. 1.1 AlphabetsThese are alphabets that are likely to be in use in various IT contexts:
Note that UNICODE16 will likely take over as the alphabet of choice in many applications. It represents the painted characters of ancient oriental languages as well as the postmodern digital characters that are becoming commonplace in human communications, such as emoticons, "hearts", "smiley faces", and the "euro" currency symbol. In many ways the digital revolution is enabling an expansion of alphabets toward the ancient painted ones, coming full circle. Here is a good video on history and status of Unicode in C++: James McNellis's talk from C++Now 2014 Q: when will the first child be given a name spelled in post-modern UNICODE16? It's coming! If one of you does that, please let me know. To use alphabets (and strings) in computing we need some attributes such as captured by this pseudo-class: class Alphabet { public: Alphabet (String s) // build alphabet from the characters in s char toChar (uint i) // index to character uint toIndx (char c) // character to index bool contains (char c) // true iff c is in alphabet uint R () // number of characters in alphabet (radix) uint lgR () // bits required to represent an index uint[] toIndices (String s) // converts string s to base R integer String toChars (uint[] indices) // converts base R uint to string in the alphabet }; As an example, consider the C-string "cop4531". This is represented in C/C++ as the character array [c,o,p,4,5,3,1,\0] the toIndices array of "cop4531" is: [99,111,112,52,53,51,49] using the standard mapping between integers and ASCII symbols. 1.2 Strings as base R numbersUsing the same notation as above, R = number of characters in the alphabet, a string can be thought to represent a "base R" number. For example, let n be an unsigned long, that is, a 64-bit number. n is stored in a 64-bit register, which has 8 bytes. Let's say these 8 bytes are b0, b1, b2, b3, b4, b5, b6, b7 (from right to left). Then using bitwise arithmetic, we have n = (b7 << 56) | (b6 << 48) | (b5 << 40) | (b4 << 32) | (b3 << 24) | (b2 << 16) | (b1 << 8) | (b0 << 0) = b7*256 + b6*248 + b5*240 + b4*232 + b3*224 + b2*216 + b1*28 + b0*20 = b7*28*7 + b6*28*6 + b5*28*5 + b4*28*4 + b3*28*3 + b2*28*2 + b1*28*1 + b0*28*0 Substituting R = 28 = 256 we have this "base 256" representation: n = b7*R7 + b6*R6 + b5*R5 + b4*R4 + b3*R3 + b2*R2 + b1*R1 + b0*R0 which can be represented by the string with 8 characters from the extended ascii alphabet b7 b6 b5 b4 b3 b2 b1 b0. So we have a 1-1 correspondence between 8-character EXTENDED_ASCII strings and uint64_t integers. In a completely analogous manner we can produce a 1-1 correspondence between 4-character UNICODE16 strings and uint_64 integers: n = w3*R3 + w2*R2 + w1*R1 + w0*R0 where R = 216 and w0, w1, w2, w3 are 16-bit words (corresponding to UNICODE16 "digits"). One reminder here: The index values for characters in a string run in the opposit direction from significance. The string representation of the decimal number 952 is s[0] = '9', s[1] = '5', s[2] = '2' so that 951 = s[0]*R2 + s[1]*R1 + s[2]*R0 Similarly, our EXTENDED_ASCII representation of the 64-bit number above would have the string representation b7 b6 b5 b4 b3 b2 b1 b0, or s[i] = b(7-i). Looking at the possibilities for arbitrary length strings, the immensity of what can be represented symbolically (rather than numerically) is eye-popping. |