Algorithms from zero
Collisions and the real cost of hashing
You have learned that a hash set or map answers “is X in the set?” in O(1) average time. But “average” is the key word. The O(1) promise depends on two things: (1) the hash function spreads keys evenly across the buckets, and (2) each bucket is small. When two different keys map to the same bucket—a collision—they must be stored together, and lookup time grows with the chain length. If every key collides into one bucket, the structure becomes a linked list and time becomes O(n). This lesson explores why collisions happen, how they are resolved, how to keep them rare, and when the O(1) average guarantee actually holds.
After this lesson you can explain why two different keys can hash to the same index (a collision), understand how separate chaining resolves collisions by storing multiple key–value pairs in a linked list at each bucket, recognize that the load factor (items ÷ buckets) determines chain length and thus lookup cost, explain why rehashing maintains O(1) average time even as the table grows, reason about amortized analysis to understand why the occasional O(n) rehash is spread over many cheap operations, identify when hash table performance degrades to O(n) (bad hash function or adversarial input), and distinguish between average-case and worst-case performance in hash table operations.
The problem: what happens when two keys collide?
Imagine a hash function for a table of size 5: it hashes keys into indices 0–4.
function hash(key) {
return key.charCodeAt(0) % 5;
}
hash('Alice') // 'A' (65) % 5 = 0
hash('April') // 'A' (65) % 5 = 0 ← collision!
hash('Bob') // 'B' (66) % 5 = 1
hash('Ben') // 'B' (66) % 5 = 1 ← collision!
hash('Charlie') // 'C' (67) % 5 = 2Alice and April both hash to index 0. Ben and Bob both hash to index 1. Where do we store the second key if the bucket is already occupied?
Separate chaining: store a linked list at each bucket.
The simplest solution is to make each bucket a linked list. When a collision happens, append the key–value pair to the list at that bucket.
// Conceptually, the hash table is an array of linked lists:
// table[0] -> (Alice, 555-1234) -> (April, 555-9999) -> null
// table[1] -> (Bob, 555-5678) -> (Ben, 555-4321) -> null
// table[2] -> (Charlie, 555-0000) -> null
// table[3] -> null
// table[4] -> nullTo insert Alice at index 0: place the (Alice, phone) pair at the head of the list at table[0]. To insert April: hash April, get index 0, append (April, phone) to the list at table[0].
To lookup Alice: hash Alice, get index 0, scan the list at table[0] until you find “Alice”. Average time: O(chain length).
Chain length depends on the load factor.
The load factor is the ratio of items to buckets: α = items ÷ buckets.
If you have 10 items and 20 buckets, α = 0.5. Items are spread across many buckets, so chains are short. If you have 10 items and 2 buckets, α = 5. Items are crowded into few buckets, so chains are long.
With separate chaining:
- Average chain length = α (the load factor).
- Average lookup time = O(1 + α).
- If α is kept small (say, ≤ 0.75), then lookup is O(1 + constant) = O(1) average.
Resizing and rehashing: maintain a low load factor.
As you insert more items, the load factor grows. Chains lengthen, and lookup slows. To fix this, when the load factor exceeds a threshold (e.g., 0.75), the table allocates a new, larger array (usually double the size), and rehashes every key into the new array.
// Before rehashing: 10 items, 5 buckets, α = 2.0
// Allocate new table with 10 buckets
// For each (key, value) in old table:
// newIndex = hash(key, 10) // Re-hash with new table size
// insert into new table[newIndex]
// Result: 10 items, 10 buckets, α = 1.0
// Chains are shorter again; lookup stays O(1) averageRehashing is expensive—it scans every item and re-hashes it, costing O(n) for n items. But it happens rarely. Each item is rehashed only a few times over its lifetime. Amortized over many cheap inserts, the occasional O(n) rehash becomes O(1) amortized per insert. This is why hash table insertion is O(1) amortized, even though individual inserts rarely trigger an O(n) resize.
Worst case: when collisions are not rare.
With a good hash function and low load factor, collisions are rare and chains are short. But if the hash function is poor—or if an attacker deliberately crafts keys that collide—many keys can land in one bucket. In the absolute worst case, every key hashes to the same index, creating a chain of n items. Lookup becomes O(n).
Example: a bad hash function using only the first character will cause all words starting with ‘A’ to collide.
The real O(1) guarantee:
Hash tables achieve O(1) average lookup when:
- The hash function spreads keys uniformly (random-looking indices).
- The load factor is kept small (via rehashing).
With these conditions, the probability that a lookup encounters a long chain is negligible, and O(1) average holds in practice.
Building a chained hash table (simulation):
class ChainedHashTable {
constructor(size = 5) {
this.buckets = Array.from({ length: size }, () => []); // Array of lists
this.size = size;
this.count = 0;
}
hash(key) {
let h = 0;
for (let char of key) {
h = (h * 31 + char.charCodeAt(0)) % this.size;
}
return h;
}
set(key, value) {
const index = this.hash(key);
const bucket = this.buckets[index];
// Check if key already exists
for (let [k, v] of bucket) {
if (k === key) {
v = value; // Update existing
return;
}
}
// Add new key–value pair
bucket.push([key, value]);
this.count++;
}
get(key) {
const index = this.hash(key);
const bucket = this.buckets[index];
// Scan the chain for the key
for (let [k, v] of bucket) {
if (k === key) return v;
}
return undefined;
}
has(key) {
return this.get(key) !== undefined;
}
get loadFactor() {
return this.count / this.size;
}
}
// Example usage
const table = new ChainedHashTable(5);
table.set("Alice", "555-1234");
table.set("April", "555-9999"); // Collision at index 0
table.set("Bob", "555-5678");
console.log(table.get("Alice")); // "555-1234" (scans chain of length 2)
console.log(table.get("April")); // "555-9999" (found after 1 scan)
console.log(table.get("Bob")); // "555-5678" (chain of length 1)
console.log(table.loadFactor); // 0.6 (3 items, 5 buckets)Try the simulation:
Let us trace a lookup in a chained table. We search for “April” in a table with a collision at index 0:
1
function lookup(key) {
2
const index = hash(key);
3
const bucket = table[index];
4
for (let [k, v] of bucket) {
5
if (k === key) return v;
6
}
7
return undefined;
8
}
The lookup had to scan two pairs in the chain before finding April. If the chain were longer, the scan would take more time.
Hash table operations with chaining:
| Operation | Time (average) | Time (worst) | Notes |
|---|---|---|---|
| Lookup / Has | O(1 + α) | O(n) | α = load factor; O(1) if α kept small |
| Insert | O(1 + α) amortized | O(n) | Occasionally O(n) for rehash, but amortized O(1) |
| Delete | O(1 + α) | O(n) | Same as lookup + removal |
Where α = items / buckets and n = total items.
Amortized analysis of resizing:
When the load factor exceeds a threshold (e.g., 0.75), the table doubles in size and every item is rehashed. Rehashing one insertion triggers an O(n) cost. But this is rare: you cannot double the table size more than O(log n) times. Over a sequence of n insertions starting from a small table, the total rehash work is O(n) + O(n/2) + O(n/4) + … = O(2n), or O(1) per insertion on average.
Worst case: O(n) lookup.
If all n keys hash to the same index (poor hash function, adversarial input, or bad luck), the entire table becomes one long chain, and lookup becomes O(n).
Example: if you hash only the first character, all words starting with ‘A’ collide.
In practice:
- With a good hash function (e.g., multiply by a large prime, fold bits), collisions are rare and chains are short.
- Load factor ≤ 0.75 is typical; chains average 0.75 items.
- O(1) average lookup is reliable for typical (non-adversarial) inputs.
In a chained hash table, a collision occurs when two different keys hash to the same index. How is the collision resolved?
A hash table has 100 items and 20 buckets. What is the load factor?
With separate chaining, the average chain length is equal to the load factor. If the load factor is kept ≤ 0.75, what is the average chain length?
When a hash table's load factor exceeds a threshold (e.g., 0.75), what happens during rehashing?
A hash table has n items all hashing to the same bucket due to a poor hash function. What is the lookup time for a key in this table?
Edge cases
Empty buckets: In a hash table with separate chaining, some buckets may remain empty if the hash function does not distribute keys evenly. This is normal and not a sign of failure; it simply means those indices were unlucky.
Very high load factor: If you insert without rehashing and the load factor grows unbounded (e.g., α = 100), chains become very long and average lookup time O(1 + α) becomes O(101), effectively a slow linear scan. This is why real hash tables rehash.
Deleting and shrinking: When items are deleted, the load factor decreases. Some implementations shrink the table (e.g., when α < 0.25) to avoid wasting memory. Shrinking also requires an O(n) rehash.
lesson.inset.misconception
“All keys will eventually collide if the table is full.” Not quite. Even if the load factor is high (say, 2.0), not all keys collide; they are distributed across buckets, with some buckets containing 0, 1, 2, or more keys. The load factor is the average chain length, not a guarantee that all collide. However, a high load factor does mean longer chains on average, so rehashing to lower the load factor is important.
“Rehashing is wasted work.” Rehashing is O(n) per resize, but sizes double, so you resize O(log n) times in total. The amortized cost is O(1) per insert. This small constant overhead is worth the benefit of keeping chains short.
Why is hash table lookup O(1) average but O(n) worst case?
A collision occurs when two different keys hash to the same index. Separate chaining resolves collisions by storing a linked list at each bucket; when a collision happens, append the new key–value pair to the list. To lookup a key, hash it, then scan the short list at that bucket.
Key facts:
- Load factor α = items / buckets. It equals the average chain length. Keep it small (≤ 0.75) to maintain short chains.
- Average lookup time = O(1 + α). With a good hash function and low α, this is O(1).
- Rehashing: When α exceeds a threshold, allocate a larger table and re-hash every item. Cost: O(n) per rehash, but amortized O(1) per insert over many operations.
- Amortized analysis: A single O(n) rehash is expensive, but it happens rarely. Over a sequence of n insertions, the total resize work is O(n), so O(1) per insertion on average.
- Worst case O(n): If all keys collide into one bucket (bad hash function or adversarial input), lookup becomes a linear scan. This is rare with a good hash function.
- Good hash function: Spreads keys uniformly, making collisions rare and chains short. Protects against worst-case degradation.
Hash tables are O(1) average and O(n) worst case. With proper design (good hash function, load factor management, resizing), they are reliable and fast in practice.