-
-
Notifications
You must be signed in to change notification settings - Fork 25
Make brotli stream matcher more robust to uncompressed binary and ASCII data files #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Brotli stream matcher to better distinguish uncompressed ASCII data from valid Brotli-compressed streams by adding a more robust detection function and extensive fuzzy tests.
- Consolidate repetitive quality-level tests into a loop and add
TestBrotli_Fuzzy_Both
for deterministic ASCII vs. compressed checks - Introduce
isValidBrotliStream
with an ASCII-only filter and helper functions (isASCII
,isASCIIByte
) - Update
Match
to call the new stream-based validation without breaking backward compatibility
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
brotli_test.go | Refactor per-quality tests into a loop, add comprehensive fuzzy testing suite |
brotli.go | Implement isValidBrotliStream method with ASCII detection, wire it into Match |
Comments suppressed due to low confidence (1)
brotli_test.go:58
- The fuzzy tests cover ASCII and Brotli-compressed data, but they don’t verify how pure non-ASCII binary data is handled. Consider adding a test case for random non-ASCII binary inputs to ensure they aren’t misidentified as Brotli-compressed.
func TestBrotli_Fuzzy_Both(t *testing.T) {
That's weird, I have not enabled automatic review from Copilot. |
Must be from a setting I have enabled, my bad! I thought it should only have happened on my own repositories. I find it now provides helpful feedback about 10-15% of the time as of the last few weeks. Prior to that it was about 0.01% of the time. Although obviously it still gets a lot wrong as shown here. |
I added some fuzzy binary data generation as a new test for #36 All of the compressed data is being accurately identified however there are a number of uncompressed binary data tests which are being incorrectly identified as brotli in both main branch and this PR. So far I have been unsuccessful finding a fix for that case. If we could pass a configuration to archives which allows us to specify which of the registered formats we actually want enabled for use with the Kind of crazy such a widely used compression format has no reliable method of detection?? |
I tried adding a final check which tries to read significantly more data from the stream, decompresses it, and compares the compressed versus decompressed output sizes. The assumption being that if its a legitimate compressed stream that the output would be some reasonable ratio larger. However I couldn't get it to work reliably. |
I encountered the same issue as well. There are no uncompressed files, and the br compression was detected via bystream. |
No worries, surprising to me too! Thanks for the improvements. It looks like the tests are failing though. We'll need to get them to pass before we can merge this. |
Yes, the uncompressed binary case is proving tricky. I have tried a number of different things to try and catch that edge case but so far have been unsuccessful. I will keep trying periodically when I have some spare time. |
OK, I was able to get something working that passes all existing tests as well as a bunch of new ones I added, and I have updated the PR description. What are your thoughts on adding configuration which disables brotli stream matching, perhaps via a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice -- thanks for the effort on this!
Let's merge this and try it out. As for a new exported API to customize formats in the Identify routine, I'm open to a new API (though probably a little different) -- but first let's see how this goes. If it still gets too many false positives, then I'm down for a refactoring/additional API.
Closes #36