Version 09/03/2018 Notes Index ↑ 

Strings and String Algorithms

1 Strings

Given an alphabet of symbols, a string over that alphabet is defined by this grammar:

s -> xs, for any x in the alphabet
s -> ε

Generally we denote by R the size of an alphabet. R comes from "radix", an alternative term for "base" as in "base 10 numbers". Both R and its base 2 logarithm (int) log2R are important characteristics of strings.

1.1 Alphabets

These are alphabets that are likely to be in use in various IT contexts:

name RlgR  characters
BINARY 2101
DNA 42ACTG
OCTAL 8301234567
DECIMAL 1040123456789
HEXADECIMAL 1640123456789ABCDEF
PROTEIN 205ACDEFGHIKLMNPQRSTVWY
LOWERCASE 265abcdefghijklmnopqrstuvwxyz
UPPERCASE 265ABCDEFGHIJKLMNOPQRSTUVWXYZ
ALPHANUM 626ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
BASE64 646ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
ASCII 1287ASCII characters
EXTENDED_ASCII   2568extended ASCII characters (upper range = "control characters")
UNICODE16 65536  16Unicode characters { examples: a (Latin), Ä (Germanic), ǩ (Polish) , Ӝ (Cyrillyc), (math), ݗ (Arabic), (Gujarati), (unified Canadian Aboriginal), (CJK unified ideographs), (Euro currency), (heart), 😉 (winking face), (mountain sunset) }

Note that UNICODE16 will likely take over as the alphabet of choice in many applications. It represents the painted characters of ancient oriental languages as well as the postmodern digital characters that are becoming commonplace in human communications, such as emoticons, "hearts", "smiley faces", and the "euro" currency symbol. In many ways the digital revolution is enabling an expansion of alphabets toward the ancient painted ones, coming full circle.

Here is a good video on history and status of Unicode in C++: James McNellis's talk from C++Now 2014

Q: when will the first child be given a name spelled in post-modern UNICODE16? It's coming! If one of you does that, please let me know.

To use alphabets (and strings) in computing we need some attributes such as captured by this pseudo-class:

class Alphabet
{
  public:
           Alphabet  (String s)       // build alphabet from the characters in s
    char   toChar    (uint i)         // index to character
    uint   toIndx    (char c)         // character to index
    bool   contains  (char c)         // true iff c is in alphabet
    uint   R         ()               // number of characters in alphabet (radix)
    uint   lgR       ()               // bits required to represent an index
    uint[] toIndices (String s)       // converts string s to base R integer
    String toChars   (uint[] indices) // converts base R uint to string in the alphabet
};

As an example, consider the C-string "cop4531". This is represented in C/C++ as the character array

[c,o,p,4,5,3,1,\0]

the toIndices array of "cop4531" is:

[99,111,112,52,53,51,49]

using the standard mapping between integers and ASCII symbols.

1.2 Strings as base R numbers

Using the same notation as above, R = number of characters in the alphabet, a string can be thought to represent a "base R" number. For example, let n be an unsigned long, that is, a 64-bit number. n is stored in a 64-bit register, which has 8 bytes. Let's say these 8 bytes are b0, b1, b2, b3, b4, b5, b6, b7 (from right to left). Then using bitwise arithmetic, we have

n = (b7 << 56) | (b6 << 48) | (b5 << 40) | (b4 << 32) | (b3 << 24) | (b2 << 16) | (b1 << 8) | (b0 << 0)
  =  b7*256 + b6*248 + b5*240 + b4*232 + b3*224 + b2*216 + b1*28 + b0*20
  =  b7*28*7 + b6*28*6 + b5*28*5 + b4*28*4 + b3*28*3 + b2*28*2 + b1*28*1 + b0*28*0

Substituting R = 28 = 256 we have this "base 256" representation:

n = b7*R7 + b6*R6 + b5*R5 + b4*R4 + b3*R3 + b2*R2 + b1*R1 + b0*R0

which can be represented by the string with 8 characters from the extended ascii alphabet b7 b6 b5 b4 b3 b2 b1 b0. So we have a 1-1 correspondence between 8-character EXTENDED_ASCII strings and uint64_t integers. In a completely analogous manner we can produce a 1-1 correspondence between 4-character UNICODE16 strings and uint_64 integers:

n = w3*R3 + w2*R2 + w1*R1 + w0*R0

where R = 216 and w0, w1, w2, w3 are 16-bit words (corresponding to UNICODE16 "digits").

One reminder here: The index values for characters in a string run in the opposit direction from significance. The string representation of the decimal number 952 is s[0] = '9', s[1] = '5', s[2] = '2' so that

951 = s[0]*R2 + s[1]*R1 + s[2]*R0

Similarly, our EXTENDED_ASCII representation of the 64-bit number above would have the string representation b7 b6 b5 b4 b3 b2 b1 b0, or s[i] = b(7-i).

Looking at the possibilities for arbitrary length strings, the immensity of what can be represented symbolically (rather than numerically) is eye-popping.