mirror of
https://github.com/balkian/balkian.github.com.git
synced 2025-09-13 13:22:21 +00:00
85 lines
33 KiB
HTML
85 lines
33 KiB
HTML
<!doctype html><html lang=en-us dir=ltr><head><meta charset=utf-8><meta name=viewport content='width=device-width,initial-scale=1'><meta name=description content=" TL;DR HyperLogLogs (hll) can be a replacement for a set when only the number of unique elements (cardinality) is needed, and an approximation is enough.\nIt is a very efficient structure both in time (add is O(1)) and space (fixed size typically ~12KB, depending on the desired error rate).\nMoreover, two hll can be merged very efficiently ($O(1)$), with no loss of precision.\nThe problem A simplification Imagine you are a store owner, and you want to count how many unique customers you get each day. Simple enough, you only need some pen and a piece of paper. Every time a customer walks in, you check whether their name is already on your paper[^1]. If it is, you don’t need to do anything. If it isn’t, you add it.\n"><title>HyperLogLog: magical (kind of) unique counters</title><link rel=canonical href=https://balkian.com/p/hyperloglog/><link rel=stylesheet href=/scss/style.min.314a81d1fef50606da5df138ce819c12f9ed0c4f2487c1964bccb4c3cc737879.css><meta property='og:title' content="HyperLogLog: magical (kind of) unique counters"><meta property='og:description' content=" TL;DR HyperLogLogs (hll) can be a replacement for a set when only the number of unique elements (cardinality) is needed, and an approximation is enough.\nIt is a very efficient structure both in time (add is O(1)) and space (fixed size typically ~12KB, depending on the desired error rate).\nMoreover, two hll can be merged very efficiently ($O(1)$), with no loss of precision.\nThe problem A simplification Imagine you are a store owner, and you want to count how many unique customers you get each day. Simple enough, you only need some pen and a piece of paper. Every time a customer walks in, you check whether their name is already on your paper[^1]. If it is, you don’t need to do anything. If it isn’t, you add it.\n"><meta property='og:url' content='https://balkian.com/p/hyperloglog/'><meta property='og:site_name' content='J. Fernando Sánchez'><meta property='og:type' content='article'><meta property='article:section' content='Post'><meta property='article:published_time' content='2025-09-08T20:08:22+02:00'><meta property='article:modified_time' content='2025-09-08T20:08:22+02:00'><meta name=twitter:title content="HyperLogLog: magical (kind of) unique counters"><meta name=twitter:description content=" TL;DR HyperLogLogs (hll) can be a replacement for a set when only the number of unique elements (cardinality) is needed, and an approximation is enough.\nIt is a very efficient structure both in time (add is O(1)) and space (fixed size typically ~12KB, depending on the desired error rate).\nMoreover, two hll can be merged very efficiently ($O(1)$), with no loss of precision.\nThe problem A simplification Imagine you are a store owner, and you want to count how many unique customers you get each day. Simple enough, you only need some pen and a piece of paper. Every time a customer walks in, you check whether their name is already on your paper[^1]. If it is, you don’t need to do anything. If it isn’t, you add it.\n"><link rel="shortcut icon" href=/img/favicon.ico></head><body class=article-page><script>(function(){const e="StackColorScheme";localStorage.getItem(e)||localStorage.setItem(e,"auto")})()</script><script>(function(){const t="StackColorScheme",e=localStorage.getItem(t),n=window.matchMedia("(prefers-color-scheme: dark)").matches===!0;e=="dark"||e==="auto"&&n?document.documentElement.dataset.scheme="dark":document.documentElement.dataset.scheme="light"})()</script><div class="container main-container flex on-phone--column extended"><aside class="sidebar left-sidebar sticky"><button class="hamburger hamburger--spin" type=button id=toggle-menu aria-label="Toggle Menu">
|
||
<span class=hamburger-box><span class=hamburger-inner></span></span></button><header><figure class=site-avatar><a href=/><img src=/img/me_hu_57f477f2a0e68f7e.png width=300 height=300 class=site-logo loading=lazy alt=Avatar>
|
||
</a><span class=emoji>💭</span></figure><div class=site-meta><h1 class=site-name><a href=/>J. Fernando Sánchez</a></h1><h2 class=site-description>My ramblings and reflections</h2></div></header><ol class=menu-social><li><a href=https://github.com/CaiJimmy/hugo-theme-stack target=_blank title=GitHub rel=me><svg class="icon icon-tabler icon-tabler-brand-github" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z" fill="none"/><path d="M9 19c-4.3 1.4-4.3-2.5-6-3m12 5v-3.5c0-1 .1-1.4-.5-2 2.8-.3 5.5-1.4 5.5-6a4.6 4.6.0 00-1.3-3.2 4.2 4.2.0 00-.1-3.2s-1.1-.3-3.5 1.3a12.3 12.3.0 00-6.2.0C6.5 2.8 5.4 3.1 5.4 3.1a4.2 4.2.0 00-.1 3.2A4.6 4.6.0 004 9.5c0 4.6 2.7 5.7 5.5 6-.6.6-.6 1.2-.5 2V21"/></svg></a></li><li><a href=https://git.sinpapel.es/balkian target=_blank title=gitea rel=me><svg viewBox="0 0 640 640" width="32" height="32"><path d="m395.9 484.2-126.9-61c-12.5-6-17.9-21.2-11.8-33.8l61-126.9c6-12.5 21.2-17.9 33.8-11.8 17.2 8.3 27.1 13 27.1 13l-.1-109.2 16.7-.1.1 117.1s57.4 24.2 83.1 40.1c3.7 2.3 10.2 6.8 12.9 14.4 2.1 6.1 2 13.1-1 19.3l-61 126.9c-6.2 12.7-21.4 18.1-33.9 12" style="fill:#fff"/><path d="M622.7 149.8c-4.1-4.1-9.6-4-9.6-4s-117.2 6.6-177.9 8c-13.3.3-26.5.6-39.6.7v117.2c-5.5-2.6-11.1-5.3-16.6-7.9.0-36.4-.1-109.2-.1-109.2-29 .4-89.2-2.2-89.2-2.2s-141.4-7.1-156.8-8.5c-9.8-.6-22.5-2.1-39 1.5-8.7 1.8-33.5 7.4-53.8 26.9C-4.9 212.4 6.6 276.2 8 285.8c1.7 11.7 6.9 44.2 31.7 72.5 45.8 56.1 144.4 54.8 144.4 54.8s12.1 28.9 30.6 55.5c25 33.1 50.7 58.9 75.7 62 63 0 188.9-.1 188.9-.1s12 .1 28.3-10.3c14-8.5 26.5-23.4 26.5-23.4S547 483 565 451.5c5.5-9.7 10.1-19.1 14.1-28 0 0 55.2-117.1 55.2-231.1-1.1-34.5-9.6-40.6-11.6-42.6M125.6 353.9c-25.9-8.5-36.9-18.7-36.9-18.7S69.6 321.8 60 295.4c-16.5-44.2-1.4-71.2-1.4-71.2s8.4-22.5 38.5-30c13.8-3.7 31-3.1 31-3.1s7.1 59.4 15.7 94.2c7.2 29.2 24.8 77.7 24.8 77.7s-26.1-3.1-43-9.1m300.3 107.6s-6.1 14.5-19.6 15.4c-5.8.4-10.3-1.2-10.3-1.2s-.3-.1-5.3-2.1l-112.9-55s-10.9-5.7-12.8-15.6c-2.2-8.1 2.7-18.1 2.7-18.1L322 273s4.8-9.7 12.2-13c.6-.3 2.3-1 4.5-1.5 8.1-2.1 18 2.8 18 2.8L467.4 315s12.6 5.7 15.3 16.2c1.9 7.4-.5 14-1.8 17.2-6.3 15.4-55 113.1-55 113.1" style="fill:#609926"/><path d="M326.8 380.1c-8.2.1-15.4 5.8-17.3 13.8s2 16.3 9.1 20c7.7 4 17.5 1.8 22.7-5.4 5.1-7.1 4.3-16.9-1.8-23.1l24-49.1c1.5.1 3.7.2 6.2-.5 4.1-.9 7.1-3.6 7.1-3.6 4.2 1.8 8.6 3.8 13.2 6.1 4.8 2.4 9.3 4.9 13.4 7.3.9.5 1.8 1.1 2.8 1.9 1.6 1.3 3.4 3.1 4.7 5.5 1.9 5.5-1.9 14.9-1.9 14.9-2.3 7.6-18.4 40.6-18.4 40.6-8.1-.2-15.3 5-17.7 12.5-2.6 8.1 1.1 17.3 8.9 21.3s17.4 1.7 22.5-5.3c5-6.8 4.6-16.3-1.1-22.6 1.9-3.7 3.7-7.4 5.6-11.3 5-10.4 13.5-30.4 13.5-30.4.9-1.7 5.7-10.3 2.7-21.3-2.5-11.4-12.6-16.7-12.6-16.7-12.2-7.9-29.2-15.2-29.2-15.2s0-4.1-1.1-7.1c-1.1-3.1-2.8-5.1-3.9-6.3 4.7-9.7 9.4-19.3 14.1-29-4.1-2-8.1-4-12.2-6.1-4.8 9.8-9.7 19.7-14.5 29.5-6.7-.1-12.9 3.5-16.1 9.4-3.4 6.3-2.7 14.1 1.9 19.8z" style="fill:#609926"/></svg></a></li><li><a href='https://scholar.google.com/citations?user=JLNusZ8AAAAJ&hl=en' target=_blank title="Google scholar" rel=me><svg aria-label="Google Scholar" role="img" viewBox="0 0 512 512"><rect width="512" height="512" rx="15%" fill="#4285f4"/><path fill="#fff" d="M213 111l-107 94h69c5 45 41 64 78 67-7 18-4 27 7 39-43 1-103 26-103 67 4 45 63 54 92 54 38 1 81-19 90-54 4-35-10-54-31-71-23-18-28-28-21-40 15-17 35-27 39-51 2-17-2-28-6-43l45-38-1 16c-3 2-5 6-5 9v103c2 13 22 11 23 0V160c0-3-2-7-5-8v-25l16-16zm58 141c-61 10-87-87-38-99 56-11 83 86 38 99zm-5 73c60 13 61 63 10 78-44 9-82-4-81-30 0-25 35-48 71-48z"/></svg></a></li></ol><ol class=menu id=main-menu><li><a href=/><svg class="icon icon-tabler icon-tabler-home" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><polyline points="5 12 3 12 12 3 21 12 19 12"/><path d="M5 12v7a2 2 0 002 2h10a2 2 0 002-2v-7"/><path d="M9 21v-6a2 2 0 012-2h2a2 2 0 012 2v6"/></svg>
|
||
<span>Home</span></a></li><li><a href=/search/><svg class="icon icon-tabler icon-tabler-search" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="10" cy="10" r="7"/><line x1="21" y1="21" x2="15" y2="15"/></svg>
|
||
<span>Search</span></a></li><li><a href=/links/><svg class="icon icon-tabler icon-tabler-link" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><path d="M10 14a3.5 3.5.0 005 0l4-4a3.5 3.5.0 00-5-5l-.5.5"/><path d="M14 10a3.5 3.5.0 00-5 0l-4 4a3.5 3.5.0 005 5l.5-.5"/></svg>
|
||
<span>Links</span></a></li><li><a href=/cheatsheet/><svg class="icon icon-tabler icon-tabler-infinity" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><path d="M9.828 9.172a4 4 0 100 5.656A10 10 0 0012 12a10 10 0 012.172-2.828 4 4 0 110 5.656A10 10 0 0112 12 10 10 0 009.828 9.172"/></svg>
|
||
<span>Cheatsheets</span></a></li><li><a href=/projects/><svg class="icon icon-tabler icon-tabler-clock" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="12" cy="12" r="9"/><polyline points="12 7 12 12 15 15"/></svg>
|
||
<span>Projects</span></a></li><li><a href=/archives/><svg class="icon icon-tabler icon-tabler-archive" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><rect x="3" y="4" width="18" height="4" rx="2"/><path d="M5 8v10a2 2 0 002 2h10a2 2 0 002-2V8"/><line x1="10" y1="12" x2="14" y2="12"/></svg>
|
||
<span>Archives</span></a></li><li><a href=/about/><svg class="icon icon-tabler icon-tabler-user" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="12" cy="7" r="4"/><path d="M6 21v-2a4 4 0 014-4h4a4 4 0 014 4v2"/></svg>
|
||
<span>About</span></a></li><li class=menu-bottom-section><ol class=menu><li id=dark-mode-toggle><svg class="icon icon-tabler icon-tabler-toggle-left" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="8" cy="12" r="2"/><rect x="2" y="6" width="20" height="12" rx="6"/></svg>
|
||
<svg class="icon icon-tabler icon-tabler-toggle-right" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="16" cy="12" r="2"/><rect x="2" y="6" width="20" height="12" rx="6"/></svg>
|
||
<span>Dark Mode</span></li></ol></li></ol></aside><aside class="sidebar right-sidebar sticky"><section class="widget archives"><div class=widget-icon><svg class="icon icon-tabler icon-tabler-hash" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><line x1="5" y1="9" x2="19" y2="9"/><line x1="5" y1="15" x2="19" y2="15"/><line x1="11" y1="4" x2="7" y2="20"/><line x1="17" y1="4" x2="13" y2="20"/></svg></div><h2 class="widget-title section-title">Table of contents</h2><div class=widget--toc><nav id=TableOfContents><ol><li><a href=#the-problem>The problem</a><ol><li><a href=#a-simplification>A simplification</a></li><li><a href=#the-real-problem>The real problem</a></li></ol></li><li><a href=#enter-hyperloglog>Enter HyperLogLog</a></li><li><a href=#drawbacks>Drawbacks</a></li><li><a href=#additional-properties>Additional properties</a></li><li><a href=#slidinghyperloglog>SlidingHyperLogLog</a></li><li><a href=#python>Python</a></li></ol></nav></div></section></aside><main class="main full-width"><article class=main-article><header class=article-header><div class=article-details><div class=article-title-wrapper><h2 class=article-title><a href=/p/hyperloglog/>HyperLogLog: magical (kind of) unique counters</a></h2></div><footer class=article-time><div><svg class="icon icon-tabler icon-tabler-calendar-time" width="56" height="56" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><path d="M11.795 21H5a2 2 0 01-2-2V7a2 2 0 012-2h12a2 2 0 012 2v4"/><circle cx="18" cy="18" r="4"/><path d="M15 3v4"/><path d="M7 3v4"/><path d="M3 11h16"/><path d="M18 16.496V18l1 1"/></svg>
|
||
<time class=article-time--published>08 Sep 2025</time></div><div><svg class="icon icon-tabler icon-tabler-clock" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="12" cy="12" r="9"/><polyline points="12 7 12 12 15 15"/></svg>
|
||
<time class=article-time--reading>10 minute read</time></div></footer></div></header><section class=article-content><blockquote><p><strong>TL;DR</strong>
|
||
<a class=link href=https://en.wikipedia.org/wiki/HyperLogLog target=_blank rel=noopener>HyperLogLogs</a> (hll) can be a replacement for a set when only the number of unique elements (cardinality) is needed, <strong>and an approximation is enough</strong>.</p><p>It is a very efficient structure both in time (<code>add</code> is <code>O(1)</code>) and space (fixed size typically ~12KB, depending on the desired error rate).</p><p>Moreover, two hll can be <strong>merged very efficiently ($O(1)$)</strong>, with no loss of precision.</p></blockquote><h2 id=the-problem>The problem</h2><h3 id=a-simplification>A simplification</h3><p>Imagine you are a store owner, and you want to count how many unique customers you get each day.
|
||
Simple enough, you only need some pen and a piece of paper.
|
||
Every time a customer walks in, you check whether their name is already on your paper[^1].
|
||
If it is, you don’t need to do anything.
|
||
If it isn’t, you add it.</p><p>[^1] You’re a great store keeper, so you know every single customer by name!</p><p>At the end of the day, you simply count the number of names in your paper to get your answer.</p><p>One problem you may find is that checking whether a customer is already in the list can be slow, and you may not be able to keep up with the ludicrous thoroughput of customers to your successful store.
|
||
There are some tricks to alleviate this:</p><ul><li>Sorting. Keeping your list of names always sorted. This can be very difficult with pen and paper :)</li><li>Partitioning. Separating your clients by some criteria (e.g., first letter in their surname) and assigning a portion of your paper for each possible value. You may repeat this process for each portion.</li><li>Deferring deduplication. Write their names every time, and wait until a later time to check if you have already seen them. That later time can be a less busy moment of the day, or some time after the day is over.</li></ul><p>The second problem is that the list of users</p><p>In a programming context, the first problem translate to time complexity.
|
||
The typical approach is to have something like a a hash set.
|
||
That makes checking very fast ($O(1)$).
|
||
Adding a new name is also fast, unless the set needs to grow (amortized $O(1)$).
|
||
A drawback is that hash sets need to be reasonably empty to work as intended, so we are trading speed for space.</p><p>The second problem translates to memory utilization.
|
||
If each name has a length of $l$ bytes, and we have a total number of $u$ unique users, we would need <strong>at least</strong> $l \times u$ bytes, so $O(l \times u)$.
|
||
For simplification, we can express $u$ as a product of how many times customers enter the store ($n$) and the ratio ($r$) of those visits that are from a unique customer: $u = n * r$.</p><h3 id=the-real-problem>The real problem</h3><p>You are such a great owner that your business has grown and you now have ten ($10$) more stores in town.
|
||
Naturally, you want to see how each of the ten locations is doing, and you want to keep track of how unique customers your business has as a whole.
|
||
So you need to:</p><ul><li>Keep a record of each individual store. In our case, that would mean $10$ times more total records.</li><li>Somehow merge the records from each location into one. If we have two counters, we need to go through all the elements of the smallest one, and add them to the bigger one. (plus optionally copying the biggest counter or starting anew and adding both to avoid modifying it)</li></ul><p>This compounds the problems from the previous section:</p><ul><li>Each store does roughly the same, so the time and space needed grow $s$ times (for $s$ stores).</li><li>To merge the counters, we roughly need to go through each unique customer of each store.</li><li>We need to store the result. Depending on how many repetitions from store to store you get, this can go from as big as your biggest counter (all others are contained), up to the sum of all the counters (no repetition).</li></ul><p>So, in summary, assuming we can insert with $O(1)$ and we get on average $n$ customers per store, we need roughly:</p><ul><li>$O(s \times n)$ operations to get unique customers for every store.</li><li>$O(s \times n \times r)$ operations to join them.</li><li>$O((s+1) \times n \times r \times l)$ total space to save all of this information.</li></ul><p>The two key factores here are that:</p><ul><li>Space grows linearly with the number of unique customers $n \times r$</li><li>Space grows lineraly with the number of stores ($s$)</li><li>Space grows linearly with the size of the key we are storing ($l$, length of names)</li><li>Merging time grows linearly with the number of stores ($s$)</li><li>Merging time grows linearly with the number of members in each counter ($n \times r$)</li></ul><h2 id=enter-hyperloglog>Enter HyperLogLog</h2><p>Let’s look at a simpler problem before we explain the hyperlog.</p><p>Imagine you toss a coin until you get tails (or up to 100 throws).
|
||
How likely are you to draw a sequence of at least 10 heads?
|
||
For a fair coin, the probability of that event would be $1/2^{10}$
|
||
And 5 heads?
|
||
$1/2^5$.
|
||
And, in general, $1/2^h$, for $h$ heads.</p><p>We can flip the problem on its head (pun intended).
|
||
Let’s say I repeat the same coin tossing multiple times, each time stopping when I get tails.
|
||
If I tell you I got <strong>at most</strong> 10 heads and tell you to guess <strong>how many tails I got</strong> (i.e., total sequences), what would you say?</p><p>A naive answer would be that, in order to get an event with $1/2^10$ chance, I probably did about $2^10$ attempts.
|
||
Much fewer or much more, and I would have gotten a different number.</p><blockquote><p>[!Warning]
|
||
This part is based on my (limited) understanding of the math behind HLL.
|
||
Please, take it with a grain of salt, and feel free to correct me.</p></blockquote><p>The problem with this approach is that it produces very rough estimates (see the <a class=link href=https://en.wikipedia.org/wiki/Flajolet%E2%80%93Martin_algorithm target=_blank rel=noopener>Flajolet–Martin algorithm</a>, especially for higher values of $h$.</p><p>The key idea behind HyperLogLog is that, instead of keeping track of a single number, we separate our attempts into multiple groups (registers).
|
||
Say, $m$ registers (for maxima).
|
||
The first attempt affects the first register, the second the second one, and so on, until we run out of registers and we start again.
|
||
Then, we get $m$ estimates of $n_attempts / m$.
|
||
It turns out using the harmonic mean of all the attempts and multiplying it by $m$ is a great estimate of the total number!</p><p>Could we use the same idea to solve our original problem?
|
||
We only need a way to transform our input (the names) into a set of tosses.
|
||
The output has to be deterministic (to work as a unique detector), uniform (assumption).</p><p>But that is precisely the definition of a hashing function!
|
||
It turns an input into a series of ones and zeros of a fixed length.
|
||
And the distribution of values of a good hashing function should be uniform in order to minimize collisions.</p><p>We can turn a hash into a sequence of tosses by keeping only the leading part up to the first 1.</p><p>Actually, another practical difference is that, instead of doing a round robin selection of registers, we will use part of the hash to select which register to update.
|
||
The distribution of values should still be uniform, and we eliminate the need to keep a counter.
|
||
So, we use just enough bits from the beginning to select the register ($log_2(m)$ for $m$ registers), and we convert the rest into a sequence.</p><p>It turns out that counting the number of leading zeros (most significant bit) is very fast on modern computers due to hardware support.
|
||
And, no, that is not a coincidence.</p><p>In practical terms, we also need to apply some correction correct for hashing collisions.</p><p>Another practical difference is that there are hashing collisions. between and multiple hashes that start with the same sequence of zeros (and a one).</p><p>That means that, a HLL has $m$ registers (maxima), and for each element that is added it:</p><ul><li>Calculates a hash of each element (to get a uniform distribution and a fixed size)</li><li>Assigns the element to one of the $R$ registers, $j$, based on the head of that hash (first $log_2(R)$ bits).</li><li>Counts $z$, the number of leading zeros in the tail of the hash</li><li>Increments the maximum in that register ($M[j] = max(z, M[j])$</li></ul><p>To get an estimate of the cardinality we have three situations:</p><ul><li>The HLL does not have enough elements: $V > 0$, where $V$ is the number of zero elements. We use linear Counting ($E = m \times log( m / V))$)</li><li>The HLL has enough elements. We estimate the cardinality $E$ as: $E = m \times harmonic_mean(M) \times c$. Where $c$ is a factor that corrects for hash collisions.</li><li>The HLL has too many elements. In this case, we need to correct the estimate from the second case.</li></ul><p>The error of our estimation will be a function of the number of registers: $sigma = 1.04 / sqrt(m)$.</p><p>To merge two HLLs with the same number of registers and hashing functions, you only need to compare each register position, and get the maximum of the two values.
|
||
This operation is very fast ($O(m)$, which is small in practice).
|
||
There is also no loss of relative precision.</p><p>Going back to our original problem, we will analyze the effect of using HLLs to estimate each the number of unique customers in each store.
|
||
If we assume our hashing is $O(1)$ and we get on average $n$ customers per store, we get roughly:</p><ul><li>$O(s \times n)$ operations to get each store’s unique customers</li><li>$O(s \times m)$ operations to join them. Where $m$ is the number of registers in each HLL.</li><li>$O((s+1) \times m)$ total space to save all of this information.</li></ul><p>Hence, compared to our solution using sets:</p><ul><li><strong>Space <del>grows linearly with</del> does not depend on the number of unique customers</strong></li><li>Space grows lineraly with the number of stores ($m$)</li><li><strong>Space <del>grows linearly with</del> does not depend on the size of the key we are storing ($l$)</strong></li><li>Space depends linearly with the number of buckets, which is a function of the precision/error rate</li><li>Merging time grows linearly with the number of stores ($m$)</li><li><strong>Merging time <del>grows linearly with</del> does not depend on the number of members in each counter ($n$)</strong></li></ul><h2 id=drawbacks>Drawbacks</h2><p>As magical as HLLs are, there are some caveats:</p><ul><li>The cardinality provided is an <strong>estimation</strong> up to an <strong>error_rate</strong>, usually in the rage of 1-2%</li><li>In order to merge two HLLs, they need to have:<ul><li>The same number of registers</li><li>The same hash function</li></ul></li><li>HyperLogLogs are append-only. You cannot remove elements</li><li>You cannot calculate the difference of two HLLs (but you can calculate its cardinality)</li></ul><h2 id=additional-properties>Additional properties</h2><ul><li>The result of merging two HLLs has the same <strong>error_rate</strong></li><li>It is possible to calculate the cardinality of:<ul><li>the difference of two HLLs ($cardinality(a - b) = cardinality(a ∪ b) - cardinality(b)$)</li><li>the intersection of two HLLs ($cardinality(a ∩ b) = cardinality(a) + cardinality(b) - cardinality(a ∪ b)$)</li></ul></li></ul><h2 id=slidinghyperloglog>SlidingHyperLogLog</h2><p>There are some really neat variations over HLLs for specific use cases.
|
||
One I am specially interested in is the SlidingHyperLogLog, which can calculate the cardinality of events over a rolling window.
|
||
An application of this would be to keep counters of unique IPs seen on a network over the past $W$ minutes.</p><p>Since HLL are only additive, it is not possible to apply the typical approach in sliding windows of removing old elements that are no longer within the window.
|
||
The naive solution to get a Sliding Window counter of width $W$ ould be be to get all events that happened in the past $W$ minutes and construct a HLL with them.
|
||
You would need to do that every $T$ minutes or for every incoming event, depending on your strategy/needs.
|
||
That is unnecessarily wasteful.
|
||
Imagine having a window of $W=120$ minutes with a resolution of $T=1$ minute.</p><p>Instead, the SlidingHyperLogLog exploits the fact that HLLs only keep track of the maximum count in each register.
|
||
The intuition is that, when we add an element at time $t$ with count $V$ and belonging to register $r$, we can:</p><ul><li>Forget any elements in register $r$ that:<ul><li>Were added before $t-W$</li><li>Had a count $<=V$</li></ul></li><li>Add this event and time ($(t, V)$) to a list of values for the register</li><li>Update the count in the register to the maximum of all the remaining elements</li></ul><p>Using some clever data structures (e.g., a binary heap) and algorithms, we can very efficiently perform the insertion and removal of events.</p><h2 id=python>Python</h2><p>For now, I’ve only used the <a class=link href=https://github.com/svpcom/hyperloglog/ target=_blank rel=noopener>HyperLogLog</a> library, which is a python implementation that relies on $numpy$:</p><div class=highlight><div class=chroma><table class=lntable><tr><td class=lntd><pre tabindex=0 class=chroma><code><span class=lnt>1
|
||
</span><span class=lnt>2
|
||
</span><span class=lnt>3
|
||
</span><span class=lnt>4
|
||
</span><span class=lnt>5
|
||
</span><span class=lnt>6
|
||
</span><span class=lnt>7
|
||
</span><span class=lnt>8
|
||
</span><span class=lnt>9
|
||
</span></code></pre></td><td class=lntd><pre tabindex=0 class=chroma><code class=language-python data-lang=python><span class=line><span class=cl><span class=err>!</span><span class=n>pip</span> <span class=n>install</span> <span class=n>hyperloglog</span>
|
||
</span></span><span class=line><span class=cl>
|
||
</span></span><span class=line><span class=cl><span class=kn>from</span> <span class=nn>hyperloglog</span> <span class=kn>import</span> <span class=n>HyperLogLog</span>
|
||
</span></span><span class=line><span class=cl>
|
||
</span></span><span class=line><span class=cl><span class=n>h</span> <span class=o>=</span> <span class=n>HyperLogLog</span><span class=p>(</span><span class=n>error_rate</span><span class=o>=</span><span class=mf>0.01</span><span class=p>)</span>
|
||
</span></span><span class=line><span class=cl>
|
||
</span></span><span class=line><span class=cl><span class=n>h</span><span class=o>.</span><span class=n>add</span><span class=p>(</span><span class=s2>"Hello from my blog"</span><span class=p>)</span>
|
||
</span></span><span class=line><span class=cl>
|
||
</span></span><span class=line><span class=cl><span class=nb>print</span><span class=p>(</span><span class=nb>len</span><span class=p>(</span><span class=n>h</span><span class=p>))</span>
|
||
</span></span></code></pre></td></tr></table></div></div></section><footer class=article-footer><section class=article-copyright><svg class="icon icon-tabler icon-tabler-copyright" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round"><path stroke="none" d="M0 0h24v24H0z"/><circle cx="12" cy="12" r="9"/><path d="M14.5 9a3.5 4 0 100 6"/></svg>
|
||
<span>Licensed under CC BY-NC-SA 4.0</span></section></footer><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.16.9/dist/katex.min.css integrity=sha384-n8MVd4RsNIU0tAv4ct0nTaAbDJwPJzDEaqSD1odI+WdtXRGWt2kTvGFasHpSy3SV crossorigin=anonymous><script src=https://cdn.jsdelivr.net/npm/katex@0.16.9/dist/katex.min.js integrity=sha384-XjKyOOlGwcjNTAIQHIpgOno0Hl1YQqzUOEleOLALmuqehneUG+vnGctmUb0ZY0l8 crossorigin=anonymous defer></script><script src=https://cdn.jsdelivr.net/npm/katex@0.16.9/dist/contrib/auto-render.min.js integrity=sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05 crossorigin=anonymous defer></script><script>window.addEventListener("DOMContentLoaded",()=>{const e=document.querySelector(".main-article");renderMathInElement(e,{delimiters:[{left:"$$",right:"$$",display:!0},{left:"$",right:"$",display:!1},{left:"\\(",right:"\\)",display:!1},{left:"\\[",right:"\\]",display:!0}],ignoredClasses:["gist"]})})</script></article><script src=https://giscus.app/client.js data-repo=balkian/balkian.github.com data-repo-id=MDEwOlJlcG9zaXRvcnk2OTQxMTEw data-category="Blog comments" data-category-id=DIC_kwDOAGnpts4Cnm1b data-mapping=pathname data-strict=0 data-reactions-enabled=1 data-emit-metadata=0 data-input-position=top data-theme=light data-lang=en data-loading crossorigin=anonymous async></script><script>function setGiscusTheme(e){let t=document.querySelector("iframe.giscus-frame");t&&t.contentWindow.postMessage({giscus:{setConfig:{theme:e}}},"https://giscus.app")}(function(){addEventListener("message",t=>{if(event.origin!=="https://giscus.app")return;e()}),window.addEventListener("onColorSchemeChange",e);function e(){setGiscusTheme(document.documentElement.dataset.scheme==="light"?"light":"dark_dimmed")}})()</script><footer class=site-footer><section class=copyright>©
|
||
2012 -
|
||
2025 J. Fernando Sánchez</section><section class=powerby>Built with <a href=https://gohugo.io/ target=_blank rel=noopener>Hugo</a><br>Theme <b><a href=https://github.com/CaiJimmy/hugo-theme-stack target=_blank rel=noopener data-version=3.30.0>Stack</a></b> designed by <a href=https://jimmycai.com target=_blank rel=noopener>Jimmy</a></section></footer><div class=pswp tabindex=-1 role=dialog aria-hidden=true><div class=pswp__bg></div><div class=pswp__scroll-wrap><div class=pswp__container><div class=pswp__item></div><div class=pswp__item></div><div class=pswp__item></div></div><div class="pswp__ui pswp__ui--hidden"><div class=pswp__top-bar><div class=pswp__counter></div><button class="pswp__button pswp__button--close" title="Close (Esc)"></button>
|
||
<button class="pswp__button pswp__button--share" title=Share></button>
|
||
<button class="pswp__button pswp__button--fs" title="Toggle fullscreen"></button>
|
||
<button class="pswp__button pswp__button--zoom" title="Zoom in/out"></button><div class=pswp__preloader><div class=pswp__preloader__icn><div class=pswp__preloader__cut><div class=pswp__preloader__donut></div></div></div></div></div><div class="pswp__share-modal pswp__share-modal--hidden pswp__single-tap"><div class=pswp__share-tooltip></div></div><button class="pswp__button pswp__button--arrow--left" title="Previous (arrow left)">
|
||
</button>
|
||
<button class="pswp__button pswp__button--arrow--right" title="Next (arrow right)"></button><div class=pswp__caption><div class=pswp__caption__center></div></div></div></div></div><script src=https://cdn.jsdelivr.net/npm/photoswipe@4.1.3/dist/photoswipe.min.js integrity="sha256-ePwmChbbvXbsO02lbM3HoHbSHTHFAeChekF1xKJdleo=" crossorigin=anonymous defer></script><script src=https://cdn.jsdelivr.net/npm/photoswipe@4.1.3/dist/photoswipe-ui-default.min.js integrity="sha256-UKkzOn/w1mBxRmLLGrSeyB4e1xbrp4xylgAWb3M42pU=" crossorigin=anonymous defer></script><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/photoswipe@4.1.3/dist/default-skin/default-skin.min.css crossorigin=anonymous><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/photoswipe@4.1.3/dist/photoswipe.min.css crossorigin=anonymous></main></div><script src=https://cdn.jsdelivr.net/npm/node-vibrant@3.1.6/dist/vibrant.min.js integrity="sha256-awcR2jno4kI5X0zL8ex0vi2z+KMkF24hUW8WePSA9HM=" crossorigin=anonymous></script><script type=text/javascript src=/ts/main.1e9a3bafd846ced4c345d084b355fb8c7bae75701c338f8a1f8a82c780137826.js defer></script><script>(function(){const e=document.createElement("link");e.href="https://fonts.googleapis.com/css2?family=Lato:wght@300;400;700&display=swap",e.type="text/css",e.rel="stylesheet",document.head.appendChild(e)})()</script></body></html> |