Lets say you have to manage some records, in a read-heavy environment with ocasional writes, that sometimes may happen in big batches. Few dimensions of such data fall under PII/GDPR – hence have to be anonymised. At the same time we want to search entries using those anonymised fields. Initial implementation was built computing its MD5 fingerprint – simple and fast.
When number of entries exceed 7 billions, one of asserts, related to duplicates was triggered. Collision occurred. 🙀
In usual hash function probability of collision depends on the ranges of possible keys. I.e. if keys are unsigned int with 4 bytes – number of possible values is 2^32. In an ideal world – there should be a number of unique elements that this function can hash before first collision, in reality a
perfect hash function is easy to build only for a known set of keys.
Gperf and similar
minimal hash function libraries can help with it, otherwise hash functions aim only to minimize collisions, while focusing on other aspects of operations – performance and different input.