A Bloom filter is. probabilistic data-structure used to test whether an element is present in a set. Blooms are affected by false positives, meaning that when a bloom returns true it does not mean that the searched element is part of the set but that it is “likely” to be part of the set. nDPI (and most tools ntop develops) uses Bloom filters in order to speed-up search operations by using a quick membership check that avoids slower checks. For instance if ntopng needs to know whether host A has talked with host B, instead of doing slow database queries or keeping memory hungry data-structures it uses blooms.
Bloom filters are implements as a fixed length bit array, where individual bits are set using a few hash functions. The false positive rate (p) is defined as (1 – e(-((k * n)/m)))^k where (k) = number of hash functions ,(m) = number of bloom filter bits, (n) = number of filter items. A Bloom filter is usually built with a given capacity in order to match a given false-positive probability objective.
On one hand Bloom are efficient are they are both fast and compact in space. On the other hand, minimising the false positive rate requires to know in advance the number of filter items, that is usually unknown when the filter is created, as we do not know in advance with how many different hosts a given host will talk to. The result is that either the filter is too short (in terms of number of bits) and this the false positive rate is much higher than expected, or too big and thus a lot of memory is wasted (as well performance due to memory faults).
In order to address this problem, nDPI implements two different bitmaps (full API details in ndpi_api.h):
- ndpi_bitmap that can store 32 bit hash values (e.g. you can use an IPv4 address as hash value). This is based on roaring bitmaps, a compressed bitmap.
- ndpi_bitmap64 that instead can store 64-bit hash values. This is based on fuse filters.
The advantage of these data-structures is that:
- Contrary to Blooms where you need to know in advance the value of (m) here we support the whole 32 or 64 bit range. So with the above bitmaps the false positive rate is constant according to (p).
- The space used is limited as these are compressed data-structures that can be manipulated (e.g. for membership checks) without decompressing them.
- Being them compact in memory and efficient during operations, they do offer the above advantages without a speed penalty with respect to Blooms.
Just to give you an idea, with ndpi_bitmap64 you can store 1 million 64 bit hash values using 11% of the space used to store such values as 64 bit integers. This means that not only using it you need ~1/10th of the memory used to keep the raw integers in memory, while being able to do membership queries such as you would do with hashes.
On a nutshell, the above nDPI bitmaps implement Bloom filters with a fixed/lowest false probability rate, limited memory usage and fast (~45 nsec on a Mac M1) membership query.
In a future blog post, we’ll explain how using them enabled nDPI to use < 200 kB for indexing in memory 30’000 Internet domain names, with respect to 24 MB previously used, this while keeping a fast(er) lookup time.