Hash Functions

Good hash functions should aim at the assumption of simple uniform hashing: each key is  equally likely to hash into any of the m slots. Draw keys , k, from U with probability P (k), so SUH:

 

 
å
k :h (k) = j
P (k) = 1/m , j = 0, ..., m-1
thus all hash values are equally likely

This is hard to use as P (k)'s are not usually known (even approximately.) If the k's are U[0,1) and independent then h (k) =  ë k m û (m fixed) satisfies SUH.

Hash functions are usually  chosen to be as "independent" of patterns that may exist in the k's. Note: SUH is powerful, but often hash functions need to do "better than" just SUH. Sometimes "mixing" and "separation" are required, so that a close k1 k2 should be far apart in their hash value.

Aside: Why such "separating" hash function? Cryptographic verification of signatures.

My Will  file ® hash ® digest ® sign ® certificate

What if my brother could change my will slightly (add the word "not") and not change the digest? Bad news for me! Even though keys may not be integers, lets consider them to be integers.

Hash Functions

h (k) = k ( mod m )

What are good choices for m?

  1. m ¹ 2p
  2. m ¹ 2p ± c (close to a power of 2)
  3. m ¹ 10p (especially with decimal keys)
  4. m = prime is a good choice.

Note: If your U is well known, you could try to experimentally optimize m. This is called the division method since:

k ( mod m ) = k - m ( ë k/m  û )

Multiplication Method

h (k)  =  ë m ( k A (mod 1) )  û

A is a real number 0 < A < 1

m is usually an integer ( 2p )

k A (mod 1) = k A -  ë k A  û

Example: m = 2p ,

k x  ë A  2w û = 2w r1 + r0

2p ( 2w r1 + r0 ) = 2w+p r1 + r0 2p , the p   m.s.b.s here are h (k)
w   k
x    ë A  2w û

r1 p r0
multiplying by 2p shifts  r0 by p bits (creates an integer our of the p m.s.b.s of  r0 ) What choices of A re best? A: A should be irrational. What irrationals are the most irrational?

 A: A = a + 1    
                   b + 1  
                          c + 1  
                                 c + ...

with repeating continued fractions (solutions to quadratic equations).

Universal Hashing:

With a fixed hash function there are keys that will hash poorly. We previously solved bad worst case behavior by randomizing into average cases: choose your hash functions randomly! This is called Universal Hashing, to do this we must construct a family of hash functions to choose from. H is such a family.

h Î H h: U ® { 0,..., m-1 }

if for each pair x, y Π U # h ' h (x) = h (y) is | H | / m . This means that h ΠH randomly chosen will give h (x) = h (y) (collision) with probability 1/m. This means that on the average (with regard to the functions in H) we get SUH.

Theorem: Let h Î H. We hash n keys into a table of size m, n £ m. Then the number of expected collisions for a key x is less than one.

Proof: cyz = { 1 if h (y) = h (z), 0 otherwise }

E [ cyz ] = 1/m (because h was chosen randomly).

cx = total number of collisions with x in T of size m with n keys.

E [ ( cx) ]  =
å
y Î T
E ( ( cxy) ] = ( n-1 ) ( 1/m )
assumptions: y ¹ x and ( n- 1 ) ( 1/m ) < 1 since n £ m.

How can we design H a universal class?

.

Example: |T| = m, m prime. x = <x0, x1, x2, ...,xr> Bytes, Max value Byte < m

<a0, a1, a2, ...,ar> ai randomly chosen from {0, 1, 2, ..., m-1}

h (x) =
r
å
i = 0 
ai xi ( mod m )

H = Ua {ha}, has mr+1 members.

Theorem: The class H is a universal class.

Proof: Consider x, y, can assume x0 ¹ y0. With {a0,  ...,ar} given:

 a0  (x0 - y0 ) º
    r
--   å
i = 1
a(xi - yi )   ( mod m )
Has only one a0 that solves it. (write down h (x) = h (y) for a0 ).   Since m is prime:

 

 a0  º
   r
-å
i = 1 
a(xi - yi ) (x0 - y0 )-1  (mod m)
This means  a0 can be found to cause a collision each time. There are thus mr different collisions here, one for each of the mr choices of <a1, a2, a3, ...,ar>.

Since there are mr+1 , <a0, a1, a2, ...,ar>'s, x and y collide with probability mr/mr+1 = 1/m Þ H is Universal.

An aside on modular inversion: a-1( mod m ) is the integer that solves:

a a-1 = 1 ( mod m )

For a to have an inverse ( mod m ) it must be that : gcd ( a, m ) = 1, i.e. a and m have no common factor.

One computes gcd ( a, m ) via the Euclidean algorithm ( will analyze this later this term.) A variant called the Extended Euclidean Algorithm: given a, m produces gcd ( a, m ) = xa + ym. If gcd ( a, m ) = 1 the x = a-1 !

Method 2: If m is prime, then am-1 º 1( mod m ) for any a. (This fact is the basis for probabilistic primality testing.) thus a am-2 º 1 ( mod m ) and a-1 º am-2 ( mod m ). If modular multiplication is Q (1) , then what is the cost of modular exponentiation?

a3    =   a11  =   (a2 ) a      ( s - m )

a4    =   a100  =  (a2 )2       ss

a5    =   a101  =   (a2 )2 a    s ( s - m )

a10110   =  a22   = (((a2 )2 a)2 a)2

So starting from the next to the m.s. bit and working towards the l.s. bit of  the exponent, when you come to a '0' Þ square, and when you come to a '1' Þ square-multiply. Why does this work?

Induction:

a1   = a

a10 = a2

a11 = a3

Assume ap is correct

q  =  2 p + 1 ® square-multiply

    =  2 p   ® square

Cost:  Q ( lg p ) operations.

Note: this is the "giant-step" algorithm.

Hash Tables - 4 of 5