In the right use case, Bloom filters seem like magic. That’s a bold statement, but in this tutorial we’ll explore the curious data structure, how best to use it, and a few practical examples using Redis and Node.js.
Bloom filters are a probabilistic, one-way data structure. The word ‘filter’ can be confusing in this context; filter implies that it is an active thing, a verb, but it might be easier to think of it as storage, a noun. With a simple Bloom filter you can do two things:
Add an item.
Check if an item has not been added previously.
These are important limitations to understand—you can’t remove an item nor can you list the items in a Bloom filter. Also, you can’t tell, with certainty, if an item has been added to the filter in the past. This is where the probabilistic nature of a Bloom filter comes in—false positives are possible, but false negatives are not. If the filter is set up correctly, false positives can be extremely rare.
Variants of Bloom filters exist, and they add other abilities, like removal or scaling, but they also add complexity and limitations. It’s important to first understand simple Bloom filters before moving on to the variants. This article will only cover the simple Bloom filters.
With these limitations you have a number of benefits: fixed size, hash-based encryption, and quick lookups.
When you set up a Bloom filter, you give it a size. This size is fixed, so if you have one item or one billion items in the filter, it will never grow beyond the specified size. As you add more items to your filter, the chance of a false positive increases. If you specified a smaller filter, this false positive rate will increase more quickly than if you have a larger size.
Bloom filters are built on the concept of one-way hashing. Much like correctly storing passwords, Bloom filters use a hash algorithm to determine a unique identifier for the items passed into it. Hashes, by nature, cannot be reversed and are represented by a seemingly random string of characters. So, if someone gains access to a Bloom filter, it will not directly reveal any of the contents.
Finally, Bloom filters are quick. The operation involves far fewer comparisons than other methods, and it can easily be stored in-memory, preventing performance-robbing database hits.
Now that you know the limits and advantages of Bloom filters, let’s take a look at some situations where you can use them.
Setup
We’ll be using Redis and Node.js to illustrate Bloom filters. Redis is a storage medium for your Bloom filter; it’s quick, in-memory, and has a few specific commands (GETBIT, SETBIT) that make implementation efficient. I’ll assume that you have Node.js, npm, and Redis installed on your system. Your Redis server should be running on localhost at the default port for our examples to work.
In this tutorial, we won’t be implementing a filter from the ground up; instead, we’ll focus on practical uses with a pre-built module in npm: bloom-redis. bloom-redis has a very concise set of methods: add, contains and clear.
As mentioned earlier, Bloom filters need a hashing algorithm to generated unique identifiers for an item. bloom-redis uses the well-known MD5 algorithm, which, although maybe not the perfect fit for a Bloom filter (a little slow, overkill on bits), will work fine.
Unique Usernames
Usernames, especially those that identify a user in a URL, need to be unique. If you build an application that allows users to change the username, then you’ll likely want a username that has never been used to avoid confusion and sniping of usernames.
Without a Bloom filter, you’d need to reference a table that has every username ever used, and at scale this can be very expensive. Bloom filters allow you to add an item every time a user adopts a new name. When a user checks to see if a username is taken, all you need to do is check the Bloom filter. It will be able to tell you, with absolute certainty, if the requested username has been previously added. It is possible that the filter will falsely return that a username has been used when it hasn’t, but this errs on the side of caution and can cause no real harm (aside from a user maybe not being able to claim ‘k3w1d00d47’).
To illustrate this, let’s build a quick REST server with Express. First, create your package.json file and then run the following terminal commands.
npm install bloom-redis --save
npm install express --save
npm install redis --save
The default options for bloom-redis have the size set at two megabytes. This errs on the side of caution, but it’s quite large. Setting up the size of the Bloom filter is critical: too large and you waste memory, too small and your false positive rate will be too high. The math involved in determining the size is quite involved and beyond the scope of this tutorial, but thankfully there is a Bloom filter size calculator to get the job done without cracking a textbook.
Now, create your app.js as follows:
To run this server: node app.js. Go to your browser and point it to: https://localhost:8010/check?username=kyle. The response should be: {"username":"kyle","status":"free"}.
Now, let’s save that username by pointing your browser at http://localhost:8010/save?username=kyle. The response will be: {"username":"kyle","status":"created"}. If you go back to the address http://localhost:8010/check?username=kyle, the response will be {"username":"kyle","status":"used"}. Similarly, going back to http://localhost:8010/save?username=kyle will result in {"username":"kyle","status":"not-created"}.
From the terminal, you can see the size of the filter:
redis-cli strlen username-bloom-filter.
Right now, with one item, it should show 338622.
Now, go ahead and try adding more usernames with the /save route. Try as many as you’d like.
If you then check the size again, you may notice that your size has gone up slightly, but not for every addition. Curious, right? Internally, a Bloom filter sets individual bits (1’s / 0’s) at different positions in the string saved at username-bloom. However, these are not contiguous, so if you set a bit at index 0 and then one at index 10,000, everything between will be 0. For practical uses, it’s not initially important to understand the precise mechanics of every operation—just know that this is normal and that your storage in Redis will never exceed the value you specified.
Fresh Content
Fresh content on a website keeps a user coming back, so how do you show a user something new every time? Using a traditional database approach, you could add a new row to a table with the user identifier and the identifier of the story, and then you would query that table when deciding to show a piece of content. As you might imagine, your database will grow extremely quickly, especially with growth of both users and content.
In this case, a false negative (e.g. not showing an unseen piece of content) has very little consequence, making Bloom filters a viable option. At first blush, you may think that you’d need a Bloom filter for every user, but we’ll use a simple concatenation of the user identifier and the content identifier, and then insert that string into our filter. This way we can use a single filter for all users.
In this example, let’s build another basic Express server that displays content. Every time you visit the route /show-content/any-username (with any-username being any URL-safe value), a new piece of content will be shown until the site is out of content. In the example, the content is the first line of the top ten books on Project Gutenberg.
We’ll need to install one more npm module. From the terminal, run:
npm install async --save
Your new app.js file:
If you carefully pay attention to round-trip time in Dev Tools, you’ll notice that the more you request a single path with a username, the longer it takes. While checking the filter takes a fixed time, in this example, we’re checking for the presence of more items. Bloom filters are limited in what they can tell you, so you’re testing for the presence of each item. Of course, in our example it is fairly simple, but testing for hundreds of items would be inefficient.
Stale Data
In this example, we’ll build a small Express server that will do two things: accept new data via POST, and display the current data (with a GET request). When the new data is POST’ed to the server, the application will check for its presence in the filter. If it is not present, we’ll add it to a set in Redis, otherwise we’ll return null. The GET request will fetch it from Redis and send it to the client.
This is different than the previous two situations, in that false positives would not be okay. We’ll be using the Bloom filter as a first line of defense. Given the properties of Bloom filters, we’ll only know for certain that something isn’t in the filter, so in this case we can go ahead and let the data in. If the Bloom filter returns that is probably in the filter, we’ll do a check versus the actual data source.
So, what do we gain? We gain the speed of not having to check versus the actual source every time. In situations where the data source is slow (external APIs, pokey databases, the middle of a flat file), the speed increase is really needed. To demonstrate the speed, let’s add in a realistic delay of 150ms in our example. We’ll also use the console.time / console.timeEnd to log the differences between a Bloom filter check and a non-Bloom filter check.
In this example, we’ll also use an extremely constrained number of bits: just 1024. It will fill up quickly. As it fills, it will show more and more false positives—you’ll see the response time increase as the false positive rate fills up.
This server uses the same modules as before, so set the app.js file to:
Since POSTing to a server can be tricky with a browser, let’s use curl to test.
curl --data “your data goes here" --header "Content-Type: text/plain" http://localhost:8012/
A quick bash script can be used to show how filling up the entire filter looks:
Looking at a filling or full filter is interesting. Since this one is small, you can easily view it with redis-cli. By running redis-cli get stale-filter from the terminal between adding items, you’ll see the individual bytes increase. A full filter will be \xff for every byte. At this point, the filter will always return positive.
Conclusion
Bloom filters aren’t a panacea solution, but in the right situation, a Bloom filter can provide a speedy, efficient complement to other data structures.