7.5 Hash Tables

In their quest for ever faster searching methods, several IBM engineers and scientists discovered hashing in the 1950's. The idea is to transform the information itself into a subscript in a linear array. Let the array that holds the data be T(0), ..., T(n-1). The array T is called a hash table. A hash function h transforms the information x into an integer h(x) such that:

The information x is then stored at T(h(x)), together with any additional information fields associated with x. If the record T(h(x)) is already in use, then a collision occurs, and x must be stored elsewhere. A good hashing scheme minimizes the frequency of collisions by scattering information into random locations in the hash table. The choice of hash functions and the resolution of collisions are discussed below.

The choice of n affects the performance of a hashing scheme. If n is much larger than the number of records to be stored, then collisions are infrequent, but space is wasted.

7.5.1 Hash Functions

Perhaps the simplest hash function maps x, interpreted as a positive integer, to its remainder when divided by n:

If x consists of more than one word, then one can compress x into one word by forming the exclusive-or of the words x1, ..., xk that constitute x:

This method of compression generally works well in practice, for every bit of x participates in the computation. For example, suppose the names Alice, Bobby, Cindy, David, and Ellen are inserted into a hash table with n = 7. On the PC with this compression method we obtain the following table Names:

Names(0)  
Names(1) Bobby
Names(2) Ellen
Names(3) Cindy
Names(4) Alice
Names(5)  
Names(6) David

7.5.2 Collision Resolution

The two fundamental collision resolution schemes are linear probing and separate chaining. Many other collision resolution schemes have been devised. They are described in the references.

In linear probing, when a collision occurs during insertion at T(i), one probes sequentially among T(i+1), T(i+1), ..., T(n-1), T(0), T(1), etc. For example, in the hash table Names above, an attempt to insert Frank causes a collision with David at Names(6). With linear probing, Frank would be stored at Names(0). An attempt to insert Garth into Names causes a collision with Ellen at Names(2). With linear probing, since Names(3) and Names(4) are already occupied, Garth must be stored in Names(5). Although linear probing permits very rapid insertions, deletions are difficult.

In separate chaining, each table entry T(i) is a pointer to the head of a separate linked list. All records that are mapped to the same subscript are stored in the same list. With separate chaining both insertions and deletions are straightforward. In essence, separate chaining partitions a list into n pieces. Thus an insertion into a hash table with separate chaining could be n times faster than insertion into a sorted linked list.