15.6. Closed hashing

In closed hashing, the hash array contains individual elements rather than a collection of elements. When a key we want to insert collides with a key already in the table, we resolve the collision by searching for another open slot within the table where we can place the new key.

Each slot in the hash table contains a hash_entry, composed of one data element and a status field indicating whether that slot is occupied, empty, or deleted.

enum class hash_status { OCCUPIED, EMPTY, DELETED };

template <class T>
struct hash_entry
{
  T data;
  hash_status info;

  hash_entry(): info(hash_status::EMPTY)  {}
  hash_entry(const T& value, hash_status status)
    : data{value}
    , info(status)
    {}
};

The hash_set backing store is an array of hash_entry objects.

template <class Key,
          size_t N,
          class Comparator=std::equal_to<Key>>
class hash_set
{
  public:
    using value_type = Key;
    using key_type   = Key;

    hash_set() = default;
  private:
    array<hash_entry<Key>, N> table;
    Comparator compare;
    size_t sz = 0;
};

Collisions are resolved by trying a series of locations, \(p_0, p_1, p_2, \ldots p_{N-1}\) until we find what we are looking for. Each position is calculated as:

\[pos_i = hash(value) + f(n)\]

where:

  • hash(value) is the home slot

  • f() is a function taking an integer number of tries and returns an integer offset.

How are new positions used in the hash table?

Searching:

Try positions \(p_0, p_1, p_2, \ldots\) until we find the requested value or an EMPTY slot.

Inserting:

Try positions \(p_0, p_1, p_2, \ldots\) until we find the same value, an EMPTY slot, or a DELETED slot. Put the new value in the position pound and mark the position as OCCUPIED.

Erasing:

Try positions \(p_0, p_1, p_2, \ldots\) until we find the requested value or an EMPTY slot. If we find the value, then mark the position as DELETED.

Find takes a value of the hash_entry key type as a parameter and returns the position of the value in the table. It returns N if the value is not in the table.

size_t find (const Key& value) const
{
  size_t hash = std::hash<Key>()(value);
  size_t pos = hash % N;
  size_t count = 0;
  while ((table[pos].info == hash_status::DELETED ||
         (table[pos].info == hash_status::OCCUPIED
         && (!compare(table[pos].data, value))))
      && count < N)
  {
    ++count;
    pos = (hash + next_slot(count)) % N;
  }
  if (count >= N || table[pos].info == hash_status::EMPTY) {
    return N;
  }
  return pos;
}

The loop condition is fairly complicated and needs discussion. There are three ways to exit this loop:

  • We hit an EMPTY space (not DELETED, and not OCCUPIED)

  • We hit an OCCUPIED space that has the value we want

  • We have tried N different positions. (No place left to look!)

With find in place, other search operations are easy. Simply call find and evaluate the results.

constexpr
  bool contains (const Key& value) const noexcept
{
  return  find(value) != N;
}

int count (const Key& value)
{
  unsigned pos = find(value);
  return  (pos == N) ? 0 : 1;
}

Note that since our set is still forcing a uniqueness constraint, count will return only 0 or 1.

The code to remove elements is just as simple. Easier than the erase we implemented for open hashing.

We try to find that element.

  • If found, we mark that slot DELETED and decrement the size.

  • Otherwise, do nothing.

void erase (const Key& value)
{
  unsigned pos = find(value);
  if (pos != N) {
    table[pos].info = hash_status::DELETED;
    --sz;
  }
}

Inserts are a bit more work, because they involve potentially looking for an open slot to store a value.

Because this is a set (and not a multiset) we first call find to see if the value is already there.

bool insert (const Key& value)
{
  size_t hash = std::hash<Key>()(value);
  unsigned pos = find(value);
  if (pos == N) {
    size_t count = 0;
    pos = hash % N;
    while (table[pos].info == hash_status::OCCUPIED && count < N)
    {
      ++count;
      pos = (hash + next_slot(count)) % N;
    }
    if (count >= N) {
      return false;  // could not add, table is full
    }
    table[pos].info = hash_status::OCCUPIED;
    table[pos].data = value;
    ++sz;
    return true;
  }
  // else replace existing value
  table[pos].data = value;
  return true;
}

If not found (pos == N), then we need to find a slot. The loop that does this is similar the find loop, but unlike find, we stop at the first DELETED or EMPTY slot.

In the other searches, we had kept going past DELETED slots, because the element we wanted might have been stored after an element that was later erased. But now we are only looking for an unoccupied slot to put something, so either a slot that has never been occupied (EMPTY) or a slot that used to be occupied but is no longer (DELETED) works.

The example contains #define statements you can use to change how the next slot is found.

Try it with different hash table sizes to see how clumping changes with the different probing strategies.

15.6.1. Choosing the next slot

The function next_slot(n) in the find and insert functions controls the sequence of positions that will be checked. It is the implementation of the function \(f(n)\) mentioned earlier. Recall the find function:

size_t find (const Key& value) const
{
  size_t hash = std::hash<Key>()(value);
  size_t pos = hash % N;
  size_t count = 0;
  while ((table[pos].info == hash_status::DELETED ||
         (table[pos].info == hash_status::OCCUPIED
         && (!compare(table[pos].data, value))))
      && count < N)
  {
    ++count;
    pos = (hash + next_slot(count)) % N;
  }
  if (count >= N || table[pos].info == hash_status::EMPTY) {
    return N;
  }
  return pos;
}

On our \(n_{th}\) try, we examine the position

\[pos_n = hash(value) + f(n)\]

where \(hash(value)\) always returns the home slot for any hashed value. This is the location that the value would be stored if currently unoccupied. The f function computes an offset from the reference location. The most common schemes for choosing the next slot are linear probing, quadratic probing, and double hashing.

Linear probing
\[f(n) = n\]

If a collision occurs at location pos, we next check locations \(pos+1 \pmod N, pos+2 \pmod N, pos+3 \pmod N, \ldots\) and so on.

Because collisions get stored in a location originally intended for another hash code, values have a tendency to clump together in the hash table.

Quadratic probing
\[f(n) = n^2\]

If a collision occurs at location pos, we next check locations \(pos+1 \pmod N, pos+4 \pmod N, pos+9 \pmod N, \ldots\) and so on.

Because the jumps between slots increases as the number of tries increases, this function tends to reduce clumping (and results in shorter searches). ``But`` it is not guaranteed to find an available empty slot if the table is more than half full or if N is not a prime number.

Note

Again, prime numbers!

Remember the earlier discussion about how \(% N\) tends to improve the key distribution when N is prime? You can see why it’s part of programming “folklore” that hash tables should be prime-number sized, even if most programmers can’t say why that’s supposed to be good.

Double hashing
\[f(n) = n * h_2(value)\]

where \(h_2\) is an alternate hash code function.

If a collision occurs at location pos, we next check locations \((pos+1*h_2(value)) \pmod N, pos+2*h_2(value)) \pmod N, pos+3*h_2(value)) \pmod N, \ldots\) and so on.

This also tends to reduce clumping, but, as with quadratic hashing, it is possible to get unlucky and miss open slots when trying to find a place to insert a new key.

15.6.2. Analysis of closed hashing

We define \(\lambda\), the load factor of a hash table, as the number of items contained in the table divided by the table size. In other words, the load factor measures what fraction of the table is full. By definition, \(0 \le \lambda \le 1\).

  • Given an ideal collision strategy, the probability of an arbitrary cell being full is \(\lambda\).

  • Therefore, the probability of an arbitrary cell being empty is \(1 - \lambda\).

  • The average number of table elements we expect to examine before finding an open position is therefore \(\frac {1}{1-\lambda}\).

Since we never look at more than N positions, given an ideal collision strategy, finds and inserts are on average

\[O \left ( min \left ( \frac {1}{1-\lambda}, N \right ) \right )\]

The graph shows how \(\frac {1}{1-\lambda}\) changes as \(\lambda\) increases.

../_images/closed_hashing-1.png

If the table is less than half full (\(\lambda \lt 0.5\)) then we expect to try on average no more than 2 slots during a search or insert. Not too bad. But as \(\lambda\) gets larger, the average number of slots examined grows toward N. As the table fills and sz approaches N, the performance degenerates toward \(O(N)\) behavior.

Because of this, a general rule of thumb for hash tables is to keep them no more than half full. At that load factor, we can treat searches and inserts as \(O(1)\) operations. If we let the load factor get much higher, we start seeing \(O(N)\) performance.

No collision resolution scheme is truly ideal, so keeping the load factor low enough is even more important in practice than this idealized analysis indicates.


More to Explore

  • The content on this page was adapted from Resolving Collisions, by Steven J. Zeil for his data structures course CS361.

You have attempted of activities on this page