In this problem, you’ll design and analyze a cardinality estimator that works on a different principle than the HyperLogLog estimator we saw in lecture. For the purposes of this problem, let’s assume our elements are drawn from the set 𝒰 and that the true cardinality of the set seen is n.
Here’s an initial data structure for cardinality estimation. Choose a hash function h uniformly at random from a family of 2-independent hash functions ℋ, where every function in ℋ maps 𝒰 to the open interval of real numbers (0, 1). (Hashing to a uniformly-random real number poses some theoretical challenges; in practice, you’d use a slightly different strategy. For the purposes of this problem, assume this is possible.)
Our data structure works by hashing the elements it sees using h and doing some internal bookkeeping to keep track of the kth-smallest hash code out of all the hash codes seen so far, ignoring duplicates. The fact that we ignore duplicate hash codes is important; we’d like it to be the case that if we call see(x) multiple times, it has the same effect as just calling see(x) a single time. (The fancy term for this is that the see operation is idempotent.) We’ll implement estimate() by returning the value n̂ = hk/k , where hk denotes the kth smallest hash code seen.
i.Explain, intuitively, why n̂ is a somewhat reasonable guess for the actual number of elements.Let ε ∈ (0, 1) be some accuracy parameter that’s provided to us.
ii.Prove that Pr[^n ≥ (1+ε)n ] ≤ k2ε2. This shows that by tuning k, we can make it unlikely that we overestimate the true value of n.
Use the techniques we covered in class: use indicator variables and some sort of concentration inequality. What has to happen for the estimate n̂ to be too large? As a reminder, your hash function is only assumed to be 2-independent, so you can’t assume it behaves like a truly random function and can only use properties of 2-independent hash functions.
As a hint, n̂ is not an unbiased estimator and computing E[n̂] is extremely challenging – as in,we’re not sure how to do it! See if you can solve this problem without computing E[n̂].
Using a proof analogous to the one you did in part (ii) of this problem, we can also prove that
Pr[^n ≤ (1−ε)n] ≤ k2ε 2.
The proof is very similar to the one you did in part (ii), so we won’t ask you to write this one up. However, these two bounds collectively imply that by tuning k, you can make it fairly likely that you get an estimate within ±εn of the true value! All that’s left to do now is to tune our confidence in our answer.
iii. Using the above data structure as a starting point, design a cardinality estimator with tunable parameters ε ∈ (0, 1) and δ ∈ (0, 1) such that
see(x) takes time O(poly(ε-1, log δ-1)), where poly(x, y) denotes “something bounded from above by a polynomial in x and y,” such as x3y + y 2 log x;
estimate() takes time O(poly(log δ-1)), and if C denotes the estimate returned this way, then
Pr[ |C – n| > εn ] < δ; and
the total space usage is O(poly(ε-1, log δ-1)).
You’ve just built a tunable cardinality estimator that just needs 2-independent hash functions. Nicely done!
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: email@example.com 微信:easydue