-
Notifications
You must be signed in to change notification settings - Fork 3
Spell out rationale for recommending cp32 over rrs1 #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, Author of https://github.com/chmduquesne/rollinghash here. I am not very active on the project nowadays, but I am still trying to keep it in good shape. I did tons of benchmark experiments back int the days, one of them being https://github.com/chmduquesne/rollinghash/blob/master/roll/main.go, where I wanted to see at which frequency of the n lower bits of a checksum are all 1 (the approach taken by bup). I agree with the recommendation to always stick to buzhash{32,64} unless you know what you are doing. One important thing about what you call
Mathematically, it is easy to see why. We have:
What is the maximal value of this A? Let us ignore the modulus for now and focus on the sum. Pennarun was choosing a window size of 64 bytes. Let's assume all of those 64 bytes have a value of 256: this makes for
Since 16384 < 65521, the result remains the same after the modulus. Since the checksum is the result of appending B to A but A has a maximum value of 16384, some of the middle bits can never flip to 1. This gets better if you swap A and B, because
The sum is bigger and wraps around faster. However, B is taken modulo 65521, which also prevents some of the bits to flip to 1. Long story short, one should never use Experimentally, buzhash behaves much better and covers the whole space. |
The best way to convince yourself about my statements is experimentally. Just run a checksum through a stream of random data and see how often the n lower bits of a checksum are all 1. This is the file I linked in the previous comment. |
Also, I never had the opportunity to discuss about this with anyone, but I think that all these projects are making it hard for the newcomer to understand what is going on by using a criterion "all the lower bits should be 1". It would be much easier to understand if the criterion was based on a number of leading zeros, because one could use an analogy with bitcoin: all we are doing, by sliding a window with a required number of digits equal to some number, is finding a block of data with a requirement on the checksum. The same way bitcoin does when mining blocks. If I were to design my own tool, I would switch to such a criterion. |
Yeah, we discovered some of this (#3 (comment)). I feel like I proposed somewhere just dropping |
Per @bobg's comment in #24, we should spell out why cp32 is the recommended hash function.
The text was updated successfully, but these errors were encountered: