Sep 12, 2021

Bloom Filter

简介

Bloom Filter 由 Burton Howard Bloom 在 1970 年提出，用少量空间换取判断元素是否在集合内的时间开销。布隆过滤器在时间空间、保密、可伸缩性上都有很好的表现。

当一个元素被加入集合时，通过K个散列函数将这个元素映射成一个位数组中的K个点，把它们置为1。检索时，我们只要看看这些点是不是都是1就（大约）知道集合中有没有它了：如果这些点有任何一个0，则被检元素一定不在；如果都是1，则被检元素很可能在。这就是布隆过滤器的基本思想。

分析

结论

为了获得最优的准确率，当 \(k = ln2 \cdot (m/n)\) 时，布隆过滤器获得最优的准确性。

\(n\) 一般不变，而由分析如果给定 \(k\) ，要求此时错误率最小的话 \(m\) 和 \(k\) 是一一对应，也就是说，可以认为需要权衡如何用最少的 \(m\) 或 \(k\) ，满足 \(n\) 下的错误率上界 \(\epsilon\) 。

参数

哈希函数的个数 \(k\)
布隆过滤器位数组的容量 \(m\)
布隆过滤器插入的数据数量 \(n\)

错误率

对于一次 hash ，一个 bit 被设为 \(0\) 的概率为：

\begin{equation} 1 - \frac 1 m \end{equation}

当我们讨论错误率时，我们分析最大错误率的情况，即是已经写入 \(n-1\) 个独立的值的时候，计算时以写入 \(n\) 个值做近似，则对于一个 bit ，如果我们新写入一个值前，它为 \(1\) 的概率为：

\begin{equation} 1-(1-\frac 1 m)^{kn} \end{equation}

那么，如果我们新写入一个值之前，其对应的 \(k\) 个 bit 都为 \(1\) ，那么后续判断这个值就会误判，那么这个错误率的上界为：

\begin{equation} (1-[1-\frac 1 m]^{kn})^k \approx (1-e^{- \frac {kn} m})^k \end{equation}

改进

Counting BloomFilter: 支持删除，不存 bit 而是一个 int。

实现

LevelDB Bloom Filter

特别之处是使用了 double-hashing ，添加一个元素只使用一次哈希函数。

  class BloomFilterPolicy : public FilterPolicy {
   public:
    explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
      // We intentionally round down to reduce probing cost a little bit
      // k = ln(2) * m / n
      k_ = static_cast<size_t>(bits_per_key * 0.69);  // 0.69 =~ ln(2)
      if (k_ < 1) k_ = 1;
      if (k_ > 30) k_ = 30;
    }

    const char* Name() const override { return "leveldb.BuiltinBloomFilter2"; }

    // 用一个 Key Slices 一次创建出一个 Bloomfilter
    void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
      // Compute bloom filter size (in both bits and bytes)
      // m = n * m / n
      size_t bits = n * bits_per_key_;

      // For small n, we can see a very high false positive rate.  Fix it
      // by enforcing a minimum bloom filter length.
      if (bits < 64) bits = 64;

      size_t bytes = (bits + 7) / 8;
      bits = bytes * 8;  // 补足

      const size_t init_size = dst->size();
      dst->resize(init_size + bytes, 0);
      dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter
      char* array = &(*dst)[init_size];
      for (int i = 0; i < n; i++) {
        // Use double-hashing to generate a sequence of hash values.
        // See analysis in [Kirsch,Mitzenmacher 2006].
        uint32_t h = BloomHash(keys[i]);
        const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
        for (size_t j = 0; j < k_; j++) {
          const uint32_t bitpos = h % bits;
          array[bitpos / 8] |= (1 << (bitpos % 8));
          h += delta;
        }
      }
    }

    // 匹配
    bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {
      const size_t len = bloom_filter.size();
      if (len < 2) return false;

      const char* array = bloom_filter.data();
      const size_t bits = (len - 1) * 8;

      // Use the encoded k so that we can read filters generated by
      // bloom filters created using different parameters.
      const size_t k = array[len - 1];
      if (k > 30) {
        // Reserved for potentially new encodings for short bloom filters.
        // Consider it a match.
        return true;
      }

      uint32_t h = BloomHash(key);
      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
      for (size_t j = 0; j < k; j++) {
        const uint32_t bitpos = h % bits;
        if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
        h += delta;
      }
      return true;
    }

   private:
    size_t bits_per_key_;
    size_t k_;
  };

参考资料

《Space/Time Trade-offs in Hash Coding with Allowable Errors》by Burton Howard Bloom