Skip to content

Conversation

dpgarrick
Copy link
Contributor

@dpgarrick dpgarrick commented Jul 11, 2025

  • Reduces false positives when detecting brotli compression on ASCII and binary data
  • Comprehensive test coverage with deterministic fuzzy testing and checks for small streams (as the original logic included several branches for small stream detection stuff)
  • The changes maintain backward compatibility while providing more accurate detection of brotli-compressed streams.

Closes #36

@Copilot Copilot AI review requested due to automatic review settings July 11, 2025 03:11
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Brotli stream matcher to better distinguish uncompressed ASCII data from valid Brotli-compressed streams by adding a more robust detection function and extensive fuzzy tests.

  • Consolidate repetitive quality-level tests into a loop and add TestBrotli_Fuzzy_Both for deterministic ASCII vs. compressed checks
  • Introduce isValidBrotliStream with an ASCII-only filter and helper functions (isASCII, isASCIIByte)
  • Update Match to call the new stream-based validation without breaking backward compatibility

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
brotli_test.go Refactor per-quality tests into a loop, add comprehensive fuzzy testing suite
brotli.go Implement isValidBrotliStream method with ASCII detection, wire it into Match
Comments suppressed due to low confidence (1)

brotli_test.go:58

  • The fuzzy tests cover ASCII and Brotli-compressed data, but they don’t verify how pure non-ASCII binary data is handled. Consider adding a test case for random non-ASCII binary inputs to ensure they aren’t misidentified as Brotli-compressed.
func TestBrotli_Fuzzy_Both(t *testing.T) {

@mholt
Copy link
Owner

mholt commented Jul 11, 2025

Copilot AI review requested due to automatic review settings 15 hours ago

That's weird, I have not enabled automatic review from Copilot.

@dpgarrick
Copy link
Contributor Author

Copilot AI review requested due to automatic review settings 15 hours ago

That's weird, I have not enabled automatic review from Copilot.

Must be from a setting I have enabled, my bad! I thought it should only have happened on my own repositories.

I find it now provides helpful feedback about 10-15% of the time as of the last few weeks. Prior to that it was about 0.01% of the time.

Although obviously it still gets a lot wrong as shown here.

@dpgarrick
Copy link
Contributor Author

dpgarrick commented Jul 14, 2025

I added some fuzzy binary data generation as a new test for #36

All of the compressed data is being accurately identified however there are a number of uncompressed binary data tests which are being incorrectly identified as brotli in both main branch and this PR.

So far I have been unsuccessful finding a fix for that case.

If we could pass a configuration to archives which allows us to specify which of the registered formats we actually want enabled for use with the Identify routine and whether we want ByName and/or ByStream matching enabled for each of those that would be one workaround...

Kind of crazy such a widely used compression format has no reliable method of detection??

@dpgarrick
Copy link
Contributor Author

I tried adding a final check which tries to read significantly more data from the stream, decompresses it, and compares the compressed versus decompressed output sizes. The assumption being that if its a legitimate compressed stream that the output would be some reasonable ratio larger. However I couldn't get it to work reliably.

@18672881844
Copy link

I encountered the same issue as well. There are no uncompressed files, and the br compression was detected via bystream.

@mholt
Copy link
Owner

mholt commented Jul 14, 2025

Must be from a setting I have enabled, my bad! I thought it should only have happened on my own repositories.

No worries, surprising to me too!

Thanks for the improvements. It looks like the tests are failing though. We'll need to get them to pass before we can merge this.

@dpgarrick
Copy link
Contributor Author

Yes, the uncompressed binary case is proving tricky. I have tried a number of different things to try and catch that edge case but so far have been unsuccessful. I will keep trying periodically when I have some spare time.

@dpgarrick dpgarrick changed the title Make brotli stream matcher more robust to ASCII data files Make brotli stream matcher more robust to uncompressed binary and ASCII data files Jul 14, 2025
@dpgarrick dpgarrick marked this pull request as draft July 14, 2025 21:38
@dpgarrick dpgarrick marked this pull request as ready for review August 25, 2025 04:56
@dpgarrick
Copy link
Contributor Author

OK, I was able to get something working that passes all existing tests as well as a bunch of new ones I added, and I have updated the PR description.

What are your thoughts on adding configuration which disables brotli stream matching, perhaps via a new IdentifyWithOptions routine that allows providing some configuration with respect to the registered formats, e.g. which are registered and which match options should be used for each? Perhaps slightly overkill given most other formats seem to be able to be reliably detected, just brotli is not...

Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice -- thanks for the effort on this!

Let's merge this and try it out. As for a new exported API to customize formats in the Identify routine, I'm open to a new API (though probably a little different) -- but first let's see how this goes. If it still gets too many false positives, then I'm down for a refactoring/additional API.

@mholt mholt merged commit f07cd8e into mholt:main Aug 25, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identify appears to match a binary file as application/x-br
3 participants