Still driven by that post, I decided to do something about it. So I went and tried to read the original paper. It was mind blowing. Not because it was great (I am sure it is), but because it’s been a while since I had to read some serious math.
Then I came across this article, which could also be named “HyperLogLog for dummies”: Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data. It was much more readable. From it I learned about Murmur hash and was surprised that there’s no ruby gem for this. Fortunately, C++ source code was available, so I decided to wrap this as a gem.
It went much smoother than I anticipated and I had working version within 3 or 4 hours. I didn’t have experience with native extensions before, someone experienced probably could have done it in 5 minutes, but still I consider this as an achievement! :)
So, only when I tried to push the gem to RubyGems, I discovered that there IS a gem with this name: murmurhash3. The public API is too cumbersome if you ask me, but it works and probably has less bugs than my version.
Here’s how you use it:
1 2 3 4 5 6 7 8 |
|
You can turn this array of integers into a hex string like this:
1
|
|
If you know a better/faster/more idiomatic way, please let me know :)