Hash (mis)usage: tales from the prod

Performance bottlenecks: why it is important to understand how things work under the hood - to avoid expensive re-works when scaling is needed.

There are different hash functions tailored for different use cases – quick pattern matching in Rabin-Karp algorithm, prefix search of geohash, HMAC in JWT token generation or CRC checks:

sometimes you may want that similar input grouped together maybe by producing similar hash values – Locality-Sensitive Hashing
sometimes you worry about collisions or want to encrypt original values – Cryptographic Hash Functions
in other cases your focus is to compute unique key for input values fast enough – Simple hashing
maybe you need to check integrity or authenticity for some payload with arbitrary size and type – special hash functions

Lets say you have to manage some records, in a read-heavy environment with ocasional writes, that sometimes may happen in big batches. Few dimensions of such data fall under PII/GDPR – hence have to be anonymised. At the same time we want to search entries using those anonymised fields. Initial implementation was built computing its MD5 fingerprint – simple and fast.

When number of entries exceed 7 billions, one of asserts, related to duplicates was triggered. Collision occurred. 🙀

In usual hash function probability of collision depends on the ranges of possible keys. I.e. if keys are unsigned int with 4 bytes – number of possible values is 2^32. In an ideal world – there should be a number of unique elements that this function can hash before first collision, in reality a perfect hash function is easy to build only for a known set of keys. Gperf and similar minimal hash function libraries can help with it, otherwise hash functions aim only to minimize collisions, while focusing on other aspects of operations – performance and different input.

And as a consequence, probability theory strikes back with the Birthday paradox – as with every new generated key we increase the probability of collision for the next key. For the example above, after we created 2^16 keys – just 65k(!) – for every new key we will have a 50% chance of collision.

MD5 is cryptographic hash function with key range of 2^128. Cryptographic functions aim to:

making hard to recover an input string that produce given hash
making even harder to find different input that resolve to the same hash – i.e. decrease probability of collision
under the hood we took hash from the hash from the hash … – hence single letter difference _should_ produce avalanche effect – when similar input produce hashes that didn’t have anything common

Almost 20(!) years ago was reported first collision in MD5, which was considered a vulnarability in respect to usage for cryptographic application. And it still widely used! Recently stumble upon publication in regards of its usage by internet providers as part of RADIUS protocol.

SHA-1 was reported to have collisions back in 2017.

Currently SHA-2 or SHA-3 are the recommended crypto hash functions to avoid headache of backfilling datasets and updating client’s code. Dead letter queue – is a convenient mechanism for handling collisions – to persist messages that can’t be further processed with existing app’s logic.

Another part of the story is an app cache. When you have a read only dataset it is straightforward – fill a hash table in your favorite language. What if data can’t fit into memory? Rule of thumbs is to cache 5-20% percent of data. Depending on access patterns you need to choose when to add new entries and when to get rid of existing values. There are various cache eviction strategies apart of classical least recently used (LRU) or least frequently used (LFU) :

random replacement
expiry based eviction – TTL
basic mix of LRU and LFU segmented LRU or adaptive replacement cache (ARC)
taking into account other factors – access frequency within time window, value’s sizes, value’s weight when some entries are more important than others, etc

And then you start to gradually fill the hashtable, load growth and suddenly you hit noticeable delays when trying to insert new entries. How come?! Average insert time is O(1) but what about worst? What the heck amortized cost means?

Many languages implement hash tables as dynamic arrays – hash function compute index and store key and value in corresponding cell. When the number of entries in array exceed the load factor – a new array is allocated and all existing keys are copied and rehashed into it. So if we already have few hundreds millions of entries – OS would have a lot of work to do. In some languages it also means iterators or references become invalid. To be fair is worth noting that sometimes re-hashing is spread over-time to minimize such drastic performance drop.

To tackle it we tune:

capacity – to reserve big enough, contiguous chunk of memory from the start
load factor – to control when it should be reallocated (if any)
however note that it in some cases it serve as a hint and implementation might do what it think is right

If you know your data – it might be handy to specify your own hash function

murmur hashing – for the sake of speed
Cuckoo hashing – to minimize collision
let me reiterate – perfect hashing can be a great option if keyset is fixed and known in advance

There is also an option to use more optimized data structures like a Slot map – that is cache friendly and guarantee worst case constant time for insert and lookups.

Need help to troubleshoot performance issues?

Technology slowing you down? We’re language-agnostic, full-stack experts who identify and fix the root causes that limit your business growth.

Get a free consultation to start scaling today!