Hash MapsLesson 1

How Hashing Works

What a hash function does, how it converts keys to array indices, and why it enables O(1) lookups.

What Is It?

Imagine you have a list of a million records and you need to find one by name. Going through them one by one is slow — on average you'd check half a million records. Even a smarter approach like binary search still takes about 20 comparisons.

What if you could jump directly to the right record in one step?

That is what hashing does. Instead of searching for where a key lives, you calculate where it lives. A hash function is a formula that takes your key — a name, a number, anything — and turns it into an array index. You go straight to that index. No searching.

The whole idea in one line: key → hash function → index → look there.

Analogy

The Locker Room

Imagine a school with 100 lockers, numbered 0 to 99. Every student gets assigned a locker using a simple formula: take the digits in their student ID, add them up, and take the last two digits. That is their locker number.

When a student wants to find their locker, they run the same formula on their ID — and they walk straight to the right locker. No wandering around. No asking for help.

But here is the catch: two students might end up with the same locker number. Maybe student 312 and student 201 both get locker 3. That is called a collision — two different keys landing on the same slot. When that happens, both students share the locker (we hang both items on the same hook). Handling collisions is the main challenge in building a hash map.

Try It Yourself

How It Works

Why We Need Hashing at All

Suppose you store user records in a plain array:

pseudocode
1
2
3
4
5
6
users = [
{ key: "alice", email: "alice@mail.com" },
{ key: "bob", email: "bob@mail.com" },
{ key: "carol", email: "carol@mail.com" },
...and 999,997 more...
]

To find Carol, you start at index 0 and check each key. On average, 500,000 comparisons. That is way too slow for a system with thousands of lookups per second.

How a Hash Function Works

For string keys, one simple approach is to add up the character codes of each letter:

pseudocode
1
2
3
4
5
function simpleHash(key, tableSize):
total = 0
for each character c in key:
total = total + charCode(c)
return total mod tableSize

Let's trace this for the key "cat" with table size 10:

'c'
99
'a'
97
't'
116
total
312
index
2

So "cat" goes in slot 2. Next time someone looks up "cat", we run the same formula, get 2, and go straight to slot 2.

Shrinking the Number Down with Modulo

The hash formula might produce a big number — say 4,738,291. But if our table only has 100 slots, we use the modulo operator to bring it into range:

pseudocode
1
4738291 mod 100 = 91

Modulo gives us the remainder after dividing. The result is always between 0 and tableSize - 1.

What Makes a Good Hash Function

  1. Deterministic — Same key always gives the same index. If hash("alice") gives 7 today, it must give 7 every time.
  1. Spreads values evenly — Keys should land in different slots as much as possible. If everything piles into the same 10 slots, those slots get slow to search through.
  1. Fast to compute — The whole point is to avoid slow searching. If computing the hash is slow, you've gained nothing.

A Problem with the Simple Sum

The simple character-sum approach has one bad flaw: anagrams like "cat", "act", and "tac" all have the same character sum (99 + 97 + 116 = 312), so they all land in the same slot — even though they are different words.

A better approach multiplies each character by a different number based on its position:

pseudocode
1
2
3
4
5
function betterHash(key, tableSize):
total = 0
for i = 0 to length(key) - 1:
total = total + charCode(key[i]) * (31 to the power of i)
return total mod tableSize

Now "cat" and "act" produce different totals because position matters.

Collisions Are Unavoidable

No matter how good your hash function is, collisions will happen. If you have 100 slots and you insert 101 keys, at least two must share a slot — there simply are not enough slots for every key to have its own.

In practice collisions start happening sooner than you'd expect. With 100 slots, you'll likely see your first collision around your 13th or 14th insertion. It's similar to the birthday problem — in a room of 23 people, there's a 50% chance two share a birthday, even though there are 365 days.

Hash Table with 10 Slots"cat" and "act" both hash to index 2 — collision!
bob
cat / act
dog
ant
0
1
2
3
4
5
6
7
8
9

Collision at index 2: both "cat" and "act" hash to the same slot

There are two main ways to handle collisions:

  1. Chaining — Each slot holds a small list. When two keys land on the same slot, both go into that slot's list. Looking up a key means hashing to the slot, then checking the list.
  1. Open addressing — When a slot is taken, you look for the next empty slot nearby. Looking up a key means hashing to the slot, then checking nearby slots if the first one does not match.

Chaining is simpler and more common for beginners. We'll build a full hash map using chaining in the next lesson.

The Load Factor

The load factor measures how full your table is:

pseudocode
1
load factor = number of entries / table size

A load factor of 0.75 means the table is 75% full. As the table fills up, collisions happen more often and lookups slow down. Most hash maps automatically resize — double the table size and move everything over — when the load factor gets too high. The typical threshold is 0.75.

Examples

Example 1: Hashing integers

pseudocode
1
2
3
4
5
tableSize = 10
hash(42) = 42 mod 10 = 2
hash(17) = 17 mod 10 = 7
hash(102) = 102 mod 10 = 2 // collision with 42!
hash(35) = 35 mod 10 = 5

Example 2: Collision pile-up with a small table

pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Insert: apple, banana, cherry, date, fig, grape
Table size: 5
Suppose the hash values work out to:
apple -> 2
banana -> 0
cherry -> 2 (collision with apple)
date -> 3
fig -> 2 (collision with apple and cherry)
grape -> 4
Using chaining:
Index 0: [banana]
Index 1: (empty)
Index 2: [apple] -> [cherry] -> [fig]
Index 3: [date]
Index 4: [grape]
Looking up "fig" means: hash to slot 2, then walk 3 nodes.
This is why we resize when the table gets too full.

Common Mistakes

1. Assuming hash maps are always instant. The average case is fast (one step), but if many keys land in the same slot, that slot becomes a slow list. A bad hash function or a table that is too full can cause this.

2. Using a hash function that is not consistent. If your function uses something that changes — like random numbers or the current time — you'll store a key at one index and look for it at a completely different one. Same key must always produce the same index.

3. Confusing hashing with encryption. A hash function for a hash map does not need to be secure or hard to reverse. It just needs to be fast and spread values evenly. Secure cryptographic hashes (like SHA-256) are much slower and are not what you want here.

4. Forgetting that equal objects must have the same hash. If your keys are custom objects, two objects that are considered equal must produce the same hash value. Otherwise, your map will store duplicates and miss lookups.

Best Practices

Use a well-tested hash function. Don't invent your own for real code — use what your language's standard library provides.

Keep the load factor below 0.75. When the table gets too full, resize it. A 75% full table is the standard threshold.

Think about distribution, not uniqueness. Collisions are normal and handled. Your hash function just needs to spread keys around — not assign every key a unique slot.