How Bloom Filters Work (and When to Use Them)
A few days ago, a friend of mine — Zangarmash — told me about Bloom filters.
Up until that moment, I had no idea they even existed, but as he explained how they work, I started to grasp their potential. And to me, that could only mean one thing: sooner or later, I’d have to use them.
I’m not usually the type to work on side projects, but lately, I’ve been revisiting my “go-to project” — the one I build every time I study a new programming language and want to go beyond basic tutorials to really challenge myself whit basic stuff.
This time, the language is Go, and during development, Bloom filters came to mind again. I had finally found the perfect use case: using one to check whether a file is possibly present in the file system before attempting access, avoiding unnecessary disk reads.
What Is a Bloom Filter?
A Bloom filter is a data structure that allows you to determine whether an element is definitely not present or possibly present in a set, using a probabilistic lookup mechanism.
This distinction is crucial: a Bloom filter can produce false positives (i.e., it may say an element is present when it’s actually not), but it never returns false negatives.
In other words: if the filter says “not present,” then it’s definitely not there; if it says “maybe present,” it could be wrong.
Understanding this behavior is essential, because there are situations where using a Bloom filter may be pointless — or even counterproductive — compared to other solutions.
Technically, a Bloom filter consists of:
- n, the size of the bit array representing the filter;
- k, the number of distinct hash functions used for each inserted element.
When we add a value to the filter, we apply the k
hash functions, which give us k
positions in the bit array of length n
. At each of these positions, we set the bit to 1.
An Intuitive Example: Mailboxes
Imagine a row of mailboxes numbered from 0 to n
.
Every time you want to “register” a name (say, “Alice”), you use k
different keys (the hash functions) that tell you which mailboxes to open. For each indicated mailbox, you drop in a slip of paper.
When you want to check if “Bob” has already been registered, you repeat the process with the same keys. If all the mailboxes you’re supposed to check already contain slips, then Bob is possibly already registered.
But if even one of those mailboxes is empty, you can say with certainty that Bob has never been registered.
How Does It Work?
For each element we want to add to a Bloom filter, we use k
deterministic hash functions. This is important: given the same input, the hash result must always be the same. That’s essential to determine whether a value is “possibly present” or not.
⚠️ If we used cryptographic hash functions (like SHA-256), we could end up with different results for the same input — due to salting or other non-deterministic mechanisms — which would break the entire logic of the filter.
The Two Core Operations
A Bloom filter supports only two operations:
- Add: adds an element by applying the k hash functions and setting the resulting positions to 1.
- Check: checks if an element is possibly present by reapplying the k hashes and verifying whether all corresponding positions are set to 1.
Removing an element from a Bloom filter is not possible, because we can’t know whether the bit positions we want to clear are also used by other elements.
For this reason, the only safe way to “delete” is to rebuild the filter from scratch — or better yet, avoid deletions altogether.
Adding an Element
Let’s say we want to add the word "Hi" to the filter. We apply the k
hash functions to this string, getting values like 42, 87, 109, and so on.
These values are then normalized to fit within the filter size n
(for example using value % n
), and for each resulting position, we set the corresponding bit to 1.
Checking an element
If even one of the positions is 0, we can be certain that the element is not present in the filter.
If all the bits are 1, then the element might be present — or it might be a false positive caused by other elements previously added.
Why It’s Called “Probabilistic”
This is where the probabilistic nature of a Bloom filter becomes clear. We can’t be 100% sure an element is present, because different inputs might produce the same hash results (collisions).
But we can be completely sure that an element is not present if even one of its corresponding bits is 0.
Optimizing the Number of Hash Functions (k)
An obvious way to reduce false positives is to increase the number of hash functions k
. But be careful: doing this excessively can have the opposite effect, saturating the filter and degrading performance.
There’s actually a sweet spot that balances:
-m
: the number of elements we expect to insert;
-n
: the size of the filter’s bit array;
-k
: the number of hash functions to use.
To find the optimal number of hash functions k, we can use the formula:
k = (n / m) * ln(2)
Suppose we want to create a filter with:
n
= 10,000 total bits,m
= 1,000 elements to insert.
We compute:
k = (10,000 / 1,000) * 0.693 = 10 * 0.693 = 6.93
So the optimal number of hash functions to use is around 7.
Using this value of k helps balance the filter’s accuracy and performance, minimizing the probability of false positives.
How I Implemented a Bloom Filter in My Go Project
In my "go-to project", I decided to use a Bloom filter to optimize checks for the presence of files in the file system.
The logic is simple:
- If the file is not present in the filter, I generate and save it.
- If the filter says the file might be present, I check if the file actually exists. If it does, I do nothing; if it doesn’t, I proceed to generate it.
This use case, though simple, is a perfect fit for a Bloom filter — it helps reduce unnecessary disk accesses and saves precious time.
My Current Implementation
So far, I’ve built a basic version of the filter with fixed values for the size n and the number of hash functions k:
n
= 32 (a very small filter, 32 bits);k
= 3 hash.
To generate the k
hash functions, I used the 64-bit hash from cespare/xxhash, a fast and reliable Go library.
From the 64-bit hash, I derived two values:
hash
: the result ofxxhash64
applied to the input;h1
: the lower 32 bits of hash (hash & 0xFFFFFFFF
);h2
: the upper 32 bits ((hash >> 32) & 0xFFFFFFFF
).
Double Hashing to Get Distinct Hashes
To generate the k
distinct hashes (i.e., filter positions), I use the double hashing technique, which is quite common in Bloom filter implementations:
for i := range b.k {
generated := h1 + h2*i
hashes[i] = uint32(generated % 32)
}
How I Add and Check Elements
In my filter, I use a single uint32 as a bit array (32 bits total).
To add an element:
func (b *bloomFilter) Add(value []byte) {
hashes := b.getHashes(value)
for _, hash := range hashes {
b.Filter |= 1 << hash
}
}
To check if an element is present:
func (b *bloomFilter) Contains(value []byte) bool {
hashes := b.getHashes(value)
for _, hash := range hashes {
if (b.Filter & (1 << hash)) == 0 {
return false
}
}
return true
}
Possible Improvements
This is a simple version that works well for a limited number of elements.
In the future, I’d like to:
- Dynamically calculate the filter size n and the number of hash functions k based on the number of files, to reduce false positives;
- Replace the single uint32 with a larger bit array to handle more data;
- Consider using more advanced or optimized hashing functions.
In conclusion, Bloom filters are powerful and fascinating tools, especially when you want to optimize presence checks without consuming too much memory or computational resources. Of course, they’re not the perfect solution for every use case, but in specific contexts — like my project — they can be incredibly effective.
Had you heard of them before? Have you ever used or implemented one in your own work?
If you have tips, questions, or just want to chat about the topic, feel free to reach out — you can email me or contact me on social media.
I always enjoy exchanging ideas!