Probabilistic to-visit Queue

A probabilistic collection leveraging Bloom Filters to store a set of 1 billion visited URLs in 18mb.

  ยท   6 min read

I’ve been in the process of coding a Web Crawler in Rust and faced a problem with keeping an in-memory to-visit queue. The idea is to have a collection of nodes to visit, which discards duplicates, with minimal memory overhead.

The Problem

A problem like this could be solved by using a simple set. Before inserting a new element, we check for membership into the visited set and discard duplicates.

In the following examples, rust is used, and elements are Strings, coming from crawled web pages.

pub struct Sieve {
    filter: HashSet<String>,
    urls: VecDeque<String>,
}

impl Sieve {
    pub fn new() -> Sieve {
        Sieve {
            filter: HashSet::new(),
            urls: VecDeque::new(),
        }
    }

    pub fn push(&mut self, url: String) {
        if self.filter.contains(&url) {
            return;
        }

        self.filter.insert(url.clone());
        self.urls.push_back(url);
    }

    pub fn pop(&mut self) -> Option<String> {
        self.urls.pop_front()
    }
}

This should be fine right? The answer is yes if the elements you add to it are limited, otherwise, the memory consumption of the set grows indefinitely.

We can do better by relying on the same ideas but with a different Set implementation.

Bloom Filters

A Bloom Filters is a space-efficient probabilistic data structure. It can be used to answer membership queries with one caveat, it makes errors with a certain probability.

The Algorithm

An Bloom Filter is composed by two elements:

  • A bit array of $m$ bits, initially set to zero;
  • $d$ hash functions, where $d$ is a small constant.

Adding an element

To add a new element, we feed it to the $d$ hash functions, obtaining $d$ indexes in the range $[0, m-1]$, used to set these positions to one in the bit array.

Checking for membership

As for insertion, we compute $d$ hash values, and then check if all those bits are set to one in the bit array.

Dealing with False Positives

A Filter is constructed with two parameters in mind: the number of expected insertions, and the wanted false-positive rate. Those two quantities determine the required number of bits and hash functions in the structure.

Given $n$ the number of wanted elements, and $\epsilon$ the wanted false positive rate, the following holds: $$ m = -\frac{n * \ln(p)}{\ln(2)^2} $$

The number of hash functions can be computed as: $$ d = -\frac{\ln(p)}{\ln(2)} $$

As those relations hold, we can balance the tradeoffs between speed, as more hash functions mean more bits to set and false-positive rate.

Practical Hash Computation

We do not want to compute an excessive number of hash functions, as they can be quite costly. An easy way to reduce the required number of hash computations is to compute two hash values and combine them like so:

$$ h_i = h_1 + h_2 * i$$

To compute two different hash values we can hash an element once and split the hash bits in two.

Implementation

Here is a practical implementation of the Bloom Filter. To represent a bit array we use a vector of 128 bit numbers where each bit is considered as a different position.

pub struct Filter {
    size: usize,
    d: u32,
    bits: Vec<u128>,
    set_bits: u32,
}

Construction

Given the expected number of insertions and false-positive rate, we compute the optimal number of bits to use.

pub fn new(n: usize, p: f64) -> Filter {
    let log_2 = 2_f64.ln();
    let log_p = p.ln();

    let size = ((-(n as f64) * log_p) / (log_2 * log_2)) as usize;
    let d = (-log_p / log_2).ceil() as u32;
    let bits = vec![0; (size as f64 / 128.0).ceil() as usize];
    let set_bits = 0;

    Filter {
        size,
        d,
        bits,
        set_bits,
    }
}

Insertion

We compute $d$ hash values and use them to set the corresponding bits in the bit array.

pub fn add(&mut self, data: &[u8]) {
    let (h1, h2) = Self::hash(data);

    for i in 0..self.d {
        let bit = (h1 as u128 + h2 as u128 * i as u128) as usize % self.size;
        self.bits[bit / 128] |= 1 << (bit % 128);
    }
}

Membership Queries

We compute $d$ hash values and use them to check that the bits are all set.

pub fn contains(&mut self, data: &[u8]) -> bool {
    let (h1, h2) = Self::hash(data);

    for i in 0..self.d {
        let bit = (h1 as u128 + h2 as u128 * i as u128) as usize % self.size;
        if self.bits[bit / 128] & (1 << (bit % 128)) == 0 {
            return false;
        }
    }
    true
}

Estimating size

Size can be estimated with a simple formula:

$$s = - \frac{m}{d} \ln \Bigg [ 1 - \frac{X}{m} \Bigg ] $$

Where $X$ is the number of bits set to one.

pub fn estimated_size(&self) -> usize {
    (-(self.size as f64 / self.d as f64)
        * (1f64 - self.set_bits as f64 / self.size as f64).ln()) as usize
}

We hence need a new value, set_bits, which can be updated on each insertion. Here is the tweaked add function.

pub fn add(&mut self, data: &[u8]) {
    ...
    // subtract old number of set bits in this block
    self.set_bits -= self.bits[bit / 128].count_ones();
    // set bit to one
    self.bits[bit / 128] |= 1 << (bit % 128);
    // add new number of set bits in this block
    self.set_bits += self.bits[bit / 128].count_ones();    
    ...
}

Memory Footprint

A great property of those Filters is that their memory footprint is fixed, independent of how many elements we store. What degrades is the false positive rate, until it reaches one. What degrades is the false positive rate, until it reaches one.

The space efficiency is amazing, for instance, 1 Billion elements, with a false positive rate of 0.01% can be stored in just over 18mb.

The Solution

Here is the final solution to this problem. The Sieve struct now accepts a parameter indicating the expected number of elements that are enqueued. Once we surpass this threshold, the filter performance starts to deteriorate.

Here you can find the full working code.

use log::warn;

pub struct Sieve {
    filter: Filter,
    urls: VecDeque<String>,
    expected_size: usize,
}

impl Sieve {
    pub fn new(expected_urls_num: usize) -> Sieve {
        Sieve {
            filter: Filter::new(expected_urls_num, 0.01),
            urls: VecDeque::new(),
            expected_size: expected_urls_num,
        }
    }

    pub fn push(&mut self, url: String) {
        let url_bytes = url.as_bytes();

        if self.filter.contains(url_bytes) {
            return;
        }

        self.filter.add(url_bytes);
        self.urls.push_back(url);

        let filter_size = self.filter.estimated_size();
        if filter_size >= self.expected_size {
            warn!(
                "Filter size ({}) exceeds Sieve expected size ({})",
                filter_size, self.expected_size
            );
        }
    }

    pub fn pop(&mut self) -> Option<String> {
        self.urls.pop_front()
    }
}

Thank you for reading this far, feel free to get in touch for suggestions or clarifications! Have a nice day ๐Ÿ˜ƒ