Version 09/03/2018

Strings and String Algorithms

1 Strings

Given an alphabet of symbols, a string over that alphabet is defined by this grammar:

s -> xs, for any x in the alphabet
s -> ε

Generally we denote by R the size of an alphabet. R comes from "radix", an alternative term for "base" as in "base 10 numbers". Both R and its base 2 logarithm (int) log₂R are important characteristics of strings.

1.1 Alphabets

These are alphabets that are likely to be in use in various IT contexts:

name R lgR   characters

BINARY 2 1 01

DNA 4 2 ACTG

OCTAL 8 3 01234567

DECIMAL 10 4 0123456789

HEXADECIMAL 16 4 0123456789ABCDEF

PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY

LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz

UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ

ALPHANUM 62 6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

BASE64 64 6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

ASCII 128 7 ASCII characters

EXTENDED_ASCII   256 8 extended ASCII characters (upper range = "control characters")

UNICODE16 65536   16 Unicode characters { examples: a (Latin), Ä (Germanic), ǩ (Polish) , Ӝ (Cyrillyc), ∮ (math), ݗ (Arabic), ઊ (Gujarati), ᗖ (unified Canadian Aboriginal), 㗷 (CJK unified ideographs), € (Euro currency), ❤ (heart), 😉 (winking face), ⛅ (mountain sunset) }

Note that UNICODE16 will likely take over as the alphabet of choice in many applications. It represents the painted characters of ancient oriental languages as well as the postmodern digital characters that are becoming commonplace in human communications, such as emoticons, "hearts", "smiley faces", and the "euro" currency symbol. In many ways the digital revolution is enabling an expansion of alphabets toward the ancient painted ones, coming full circle.

Here is a good video on history and status of Unicode in C++: James McNellis's talk from C++Now 2014

Q: when will the first child be given a name spelled in post-modern UNICODE16? It's coming! If one of you does that, please let me know.

To use alphabets (and strings) in computing we need some attributes such as captured by this pseudo-class:

class Alphabet
{
  public:
           Alphabet  (String s)       // build alphabet from the characters in s
    char   toChar    (uint i)         // index to character
    uint   toIndx    (char c)         // character to index
    bool   contains  (char c)         // true iff c is in alphabet
    uint   R         ()               // number of characters in alphabet (radix)
    uint   lgR       ()               // bits required to represent an index
    uint[] toIndices (String s)       // converts string s to base R integer
    String toChars   (uint[] indices) // converts base R uint to string in the alphabet
};

As an example, consider the C-string "cop4531". This is represented in C/C++ as the character array

[c,o,p,4,5,3,1,\0]

the toIndices array of "cop4531" is:

[99,111,112,52,53,51,49]

using the standard mapping between integers and ASCII symbols.

1.2 Strings as base R numbers

Using the same notation as above, R = number of characters in the alphabet, a string can be thought to represent a "base R" number. For example, let n be an unsigned long, that is, a 64-bit number. n is stored in a 64-bit register, which has 8 bytes. Let's say these 8 bytes are b0, b1, b2, b3, b4, b5, b6, b7 (from right to left). Then using bitwise arithmetic, we have

n = (b7 << 56) | (b6 << 48) | (b5 << 40) | (b4 << 32) | (b3 << 24) | (b2 << 16) | (b1 << 8) | (b0 << 0)
  =  b7*2⁵⁶ + b6*2⁴⁸ + b5*2⁴⁰ + b4*2³² + b3*2²⁴ + b2*2¹⁶ + b1*2⁸ + b0*2⁰
  =  b7*2^8*7 + b6*2^8*6 + b5*2^8*5 + b4*2^8*4 + b3*2^8*3 + b2*2^8*2 + b1*2^8*1 + b0*2^8*0

Substituting R = 2⁸ = 256 we have this "base 256" representation:

n = b7*R⁷ + b6*R⁶ + b5*R⁵ + b4*R⁴ + b3*R³ + b2*R² + b1*R¹ + b0*R⁰

which can be represented by the string with 8 characters from the extended ascii alphabet b7 b6 b5 b4 b3 b2 b1 b0. So we have a 1-1 correspondence between 8-character EXTENDED_ASCII strings and uint64_t integers. In a completely analogous manner we can produce a 1-1 correspondence between 4-character UNICODE16 strings and uint_64 integers:

n = w3*R³ + w2*R² + w1*R¹ + w0*R⁰

where R = 2¹⁶ and w0, w1, w2, w3 are 16-bit words (corresponding to UNICODE16 "digits").

One reminder here: The index values for characters in a string run in the opposit direction from significance. The string representation of the decimal number 952 is s[0] = '9', s[1] = '5', s[2] = '2' so that

951 = s[0]*R² + s[1]*R¹ + s[2]*R⁰

Similarly, our EXTENDED_ASCII representation of the 64-bit number above would have the string representation b7 b6 b5 b4 b3 b2 b1 b0, or s[i] = b(7-i).

Looking at the possibilities for arbitrary length strings, the immensity of what can be represented symbolically (rather than numerically) is eye-popping.