Home Engineering

How HashMap Works Internally

10 September 2025 · 14 min read
Table of contents

You use HashMap every day. But when a cache returns null despite the correct key, when a service slows down unexpectedly after data grows, when a bug appears only under concurrent load: most of those problems have roots in not understanding how HashMap works internally. This post explains each mechanism from a design perspective, not from definitions.

Why HashMap Exists

Before HashMap, finding a product by ID in a list meant iterating through every element:

List<Product> catalog = productRepository.findAll(); // 100,000 products

Product found = null;
for (Product p : catalog) {
    if (p.getId().equals(targetId)) {
        found = p;
        break;
    }
}

With 100k products and 1,000 requests per second, the average lookup traverses 50k elements: 50 million comparisons per second. When the catalog grows to 1 million products, that number multiplies by 10. This is the problem HashMap solves: instead of O(n) linear search, it provides O(1) average lookup by mapping a key directly to a storage location.

Trade-off: O(1) is the average case, not the worst case. The worst case is still O(n) when too many collisions occur. Understanding HashMap internals means understanding when O(1) breaks down.

Internal Structure: An Array of Buckets

HashMap is an array of “buckets.” Each bucket is a linked list (or a tree from Java 8 onward) of entries. This array is called table in the Java source code.

// Simplified from java.util.HashMap source
transient Node<K,V>[] table;

static class Node<K,V> implements Map.Entry<K,V> {
    final int hash;
    final K key;
    V value;
    Node<K,V> next; // points to the next entry in the same bucket
}

The default initial capacity is 16 buckets. To determine which bucket a key belongs to, HashMap computes:

// bucket index = hash(key) & (capacity - 1)
int index = (n - 1) & hash;

With capacity 16 (binary: 10000), 16 - 1 = 15 (binary: 01111). ANDing with 15 keeps the last 4 bits of the hash, producing a value from 0 to 15. This is why HashMap capacity is always a power of two: the & (capacity - 1) operation replaces the modulo % capacity, and is significantly faster.

Why this matters: If you initialize new HashMap<>(15), Java rounds up to 16 (the nearest power of two). new HashMap<>(17) creates an array of 32 buckets, not 17. If you set the initial capacity based on expected entry count without accounting for the load factor, the HashMap resizes earlier than you expect.

Below is a picture of what a HashMap with 8 buckets actually looks like. Light-grey buckets are empty. Dark buckets are occupied. Teal entries are linked by next pointers when there is a collision:

Drag · Scroll to zoom

hashCode: Which Bucket Does the Key Go To?

hashCode() maps a key object to an integer. HashMap uses this integer to find the bucket. Java adds a scrambling step before using it:

// HashMap.hash() - mixes high bits into low bits
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

The XOR with h >>> 16 handles the case where two keys have hash codes that differ only in the high bits. Without this mixing step, with a small capacity (16 or 32), those keys would land in the same bucket because & (capacity-1) only keeps the low bits.

What happens with a poor hashCode: If all keys return the same hash code, every entry lands in bucket 0. The HashMap becomes a linked list with O(n) for every operation.

// The worst possible hashCode
public class ProductId {
    private final String value;

    @Override
    public int hashCode() {
        return 42; // all keys go into the same bucket
    }

    @Override
    public boolean equals(Object o) {
        if (!(o instanceof ProductId)) return false;
        return value.equals(((ProductId) o).value);
    }
}

With 1,000 products using this ProductId as a key, every get() traverses a linked list of 1,000 elements. You have just created a data structure worse than a List.

A real production bug: A custom domain object used as a key that does not override hashCode().

// OrderKey overrides equals but forgets hashCode
class OrderKey {
    String orderId;
    String customerId;

    @Override
    public boolean equals(Object o) {
        OrderKey other = (OrderKey) o;
        return orderId.equals(other.orderId) && customerId.equals(other.customerId);
    }
    // hashCode NOT overridden -> uses Object.hashCode() (memory address)
}

Map<OrderKey, OrderDetails> orderIndex = new HashMap<>();
OrderKey k1 = new OrderKey("ORD-001", "CUST-42");
orderIndex.put(k1, details);

// Later in a different service
OrderKey k2 = new OrderKey("ORD-001", "CUST-42"); // same values, different object
orderIndex.get(k2); // null - k1 and k2 have different hashCodes (different memory addresses)

k1.equals(k2) returns true, but k1.hashCode() != k2.hashCode(). HashMap looks for the bucket using k2’s hash code, finds no key there, and returns null. The cache is bypassed entirely.

Trade-off: String.hashCode() iterates over every character on each call, which is O(n) in string length. Java’s String caches the result in a hash field because String is immutable. For mutable objects, you should not cache the hash code if the fields used in the computation can change.

Collision: When Two Keys Land in the Same Bucket

No matter how good the hash code is, two different keys can still land in the same bucket. This is called a collision and is unavoidable in principle (the pigeonhole principle). Java resolves collisions using chaining.

Before Java 8: Each bucket is a linked list. Worst case with n collisions in the same bucket: O(n) per lookup.

Java 8+: When the number of entries in a bucket reaches TREEIFY_THRESHOLD = 8, the linked list converts to a Red-Black tree. Worst case drops from O(n) to O(log n).

// From HashMap source
static final int TREEIFY_THRESHOLD = 8;   // linked list -> tree
static final int UNTREEIFY_THRESHOLD = 6; // tree -> linked list when entry count drops
static final int MIN_TREEIFY_CAPACITY = 64; // only treeify if the table is large enough

When the bucket has not yet reached the threshold, all 8 entries must be traversed sequentially to find the last key:

Drag · Scroll to zoom

After reaching the threshold of 8, that same bucket converts to a Red-Black tree. Looking up prod-411 now requires only 3 steps instead of 8:

Drag · Scroll to zoom

Why treeification exists: Before Java 8, it was possible to DoS a web application by sending a POST request with thousands of parameters crafted to hash into the same bucket. The server would spend O(n²) processing a single request and CPU would hit 100%. Java 8 addressed this with tree conversion plus hash randomization for String keys.

Production implication: If you use a custom class as a key, that class must implement Comparable for treeification to work correctly (a Red-Black tree needs an ordering for comparisons). If Comparable is not implemented, Java falls back to identity hash as a tiebreaker. It still works, but O(log n) is not guaranteed in all cases.

Trade-off: TreeNode consumes more memory than Node (needs parent, left, right, and red pointers). Converting a linked list to a tree has an upfront cost. Treeification only pays off when there are genuinely many collisions. With a good hash code, buckets rarely reach the threshold of 8 and you never pay this cost.

Load Factor and Resize

Load factor = current number of entries / capacity. When it exceeds the threshold (default 0.75), HashMap allocates a new array twice the size and rehashes every entry.

Why 0.75: This is an empirically determined balance between speed and memory. A lower load factor (0.5) means more empty buckets, fewer collisions, and faster lookups, but double the memory. A higher load factor (0.9) saves memory but causes more collisions and slower lookups. At 0.75, collision probability is at an acceptable level with reasonable memory consumption.

What breaks when you ignore resize cost:

// Cache initialized without an initial capacity
Map<String, Product> productCache = new HashMap<>();

// Load 50,000 products
products.forEach(p -> productCache.put(p.getId(), p));
// Resizes occur at: 12, 24, 48, 96, 192, 384, 768, 1536, 3072, 6144, 12288, 24576, 49152
// 13 resizes, each an O(n) rehash - total overhead is significant

Each resize: allocates a new array, recalculates the index for every entry, and copies each one over. With 50k entries, the final resize rehashes 25k entries. Total resize work: roughly 100k unnecessary operations.

Production fix: Pre-size when the entry count is known:

// Formula: initialCapacity = expectedSize / loadFactor + 1
int expectedEntries = 50_000;
Map<String, Product> productCache = new HashMap<>((int) (expectedEntries / 0.75) + 1);
// Or use Guava
Map<String, Product> productCache = Maps.newHashMapWithExpectedSize(expectedEntries);

Trade-off: Pre-sizing allocates memory upfront. If you create many small HashMaps per request (for example, in a service layer to group data), pre-sizing to 10,000 for a map that will hold only 10 entries creates serious GC pressure. Pre-sizing is most valuable for long-lived caches with a predictable size. For short-lived local maps, use the default.

hashCode + equals Contract: The Rule You Cannot Break

The contract: if a.equals(b) is true, then a.hashCode() == b.hashCode() must also be true.

Why this contract exists: HashMap finds a key in two steps. Step 1: use hashCode to find the bucket. Step 2: use equals to find the correct entry within the bucket. If two objects are “equal” according to equals but have different hash codes, step 1 finds the wrong bucket and step 2 is never reached.

The reverse is not required: a.hashCode() == b.hashCode() does not require a.equals(b). That is a normal collision: multiple objects can share the same hash code.

A production bug: mutable object used as a key

class CacheKey {
    private String region;
    private String userId;

    // Correct contract: equals and hashCode are consistent
    @Override
    public boolean equals(Object o) {
        CacheKey other = (CacheKey) o;
        return region.equals(other.region) && userId.equals(other.userId);
    }

    @Override
    public int hashCode() {
        return Objects.hash(region, userId);
    }

    // Setter - this is the problem
    public void setRegion(String region) { this.region = region; }
}

Map<CacheKey, UserSession> sessionCache = new HashMap<>();
CacheKey key = new CacheKey("VN", "user-42");
sessionCache.put(key, session);

// Later, someone modifies the field used in hashCode
key.setRegion("SG");

// Now key.hashCode() has changed - the key lives in the old bucket
// but HashMap searches the new bucket
sessionCache.get(key);    // null - wrong bucket
sessionCache.remove(key); // no-op - entry not found, cannot be removed
// Entry is "orphaned" in the map: cannot be retrieved, cannot be removed
// -> memory leak because the map holds a strong reference to the session object

That entry will live in the map forever: inaccessible and never garbage collected because the map still holds a reference to it. This is a genuine memory leak.

Rule: Use immutable objects as HashMap keys whenever possible. String, Integer, Long, and UUID are all good choices. If a custom class is necessary, ensure the fields that participate in hashCode and equals cannot change after the object is created.

Thread Safety: HashMap Is Not Thread-Safe

HashMap is not designed for concurrent access. When multiple threads read and write simultaneously:

  • Lost update: two threads both call put() into the same bucket; one update is overwritten.
  • Infinite loop (Java 7): during a resize, a linked list can form a cycle due to a race condition. A thread traversing next never terminates and CPU hits 100%.
  • Inconsistent read: a thread reads while another is resizing, and sees a partially initialized array.

Java 8 fixed the infinite loop bug, but HashMap is still not safe for concurrent writes.

A common production bug:

@Service
public class ProductCacheService {

    // WRONG: HashMap shared across multiple threads
    private final Map<String, Product> cache = new HashMap<>();

    public Product getProduct(String id) {
        if (!cache.containsKey(id)) {
            Product product = productRepository.findById(id)
                .orElseThrow(() -> new ProductNotFoundException(id));
            cache.put(id, product); // race condition with other threads
        }
        return cache.get(id);
    }
}

Two threads concurrently call getProduct("prod-42"). Both see containsKey return false. Both query the database. Both call put(). Meanwhile, a third thread is reading a different bucket being affected by the ongoing resize. Result: inconsistent state, potential NullPointerException, and double database calls.

Fix:

// Option 1: ConcurrentHashMap - the right choice in most cases
// Java 8+: uses CAS + per-bin synchronization, does not lock the entire map
private final Map<String, Product> cache = new ConcurrentHashMap<>();

// Important: use computeIfAbsent instead of containsKey + put
// computeIfAbsent is atomic in ConcurrentHashMap
public Product getProduct(String id) {
    return cache.computeIfAbsent(id, key ->
        productRepository.findById(key)
            .orElseThrow(() -> new ProductNotFoundException(key))
    );
}

// Option 2: Collections.synchronizedMap - locks the entire map for each operation
// Simpler but serializes all access; a bottleneck under high concurrency
private final Map<String, Product> cache = Collections.synchronizedMap(new HashMap<>());

// Option 3: HashMap per thread (not shared)
// No contention; suitable for local processing
public Map<String, List<Order>> groupOrdersByCustomer(List<Order> orders) {
    Map<String, List<Order>> grouped = new HashMap<>(); // local, not shared
    orders.forEach(o -> grouped
        .computeIfAbsent(o.getCustomerId(), k -> new ArrayList<>())
        .add(o));
    return grouped;
}

Trade-off: ConcurrentHashMap uses more memory than HashMap due to its internal metadata. For read-heavy workloads (caches), the overhead is negligible. For write-heavy workloads where many threads update the same key, contention still exists, just at the per-bin level rather than the whole-map level. computeIfAbsent on a ConcurrentHashMap is atomic, but the mapping function passed to it should not have heavy side effects (such as a database call while holding a lock), because it can block that bin for an extended period.

Summary

ConceptWhy it existsWhat breaks if violated
hashCode + bucket arrayO(1) lookup instead of O(n)Poor hashCode gives O(n)
Hash scramble ^ h>>>16Even distribution with small capacitiesMany collisions on low bits
Chaining + treeifyHandles collisionsNo tree -> O(n) under attack
Load factor 0.75Balances speed and memoryNo pre-sizing -> unnecessary resizes
hashCode+equals contractTwo-step key lookupViolated -> cache misses, memory leaks
Thread safetyHashMap is not synchronizedShared HashMap -> data corruption

Final mental model: HashMap is an array where the index is computed from the key. The hard parts are: ensuring every key always reaches the correct index (hashCode + equals contract), handling the case where multiple keys reach the same index (collision), keeping the array from getting too full (resize), and ensuring multiple threads do not interfere with each other (thread safety). Understand these four things and you can predict and explain every behavior of HashMap.