-
Notifications
You must be signed in to change notification settings - Fork 114
feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669
Conversation
eric-wang-1990
commented
Apr 3, 2025
- Create a new file Lz4Utilities.cs to abstract common Lz4 decompress util functions for both cloud fetch and arrow batch.
- Add support for decompress arrow batch with Lz4
- Rename adbc.spark.cloudfetch.lz4.enabled to adbc.spark.lz4Compression.enabled since it is not specific to cloudfetch
- Add test to test both cloudFetch and arrowBatch in StatementTests.
Just curious, does Spark/Databricks not use Arrow IPC compression? It compresses the file as a whole separately? |
Yeah I asked a similar question in a previous PR, but apparently it doesn't (or it wouldn't need a separate out-of-band flag for the compression setting). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I've asked for some changes and made some (optional) suggestions for improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
// Get the underlying buffer and its valid length without copying | ||
return new ReadOnlyMemory<byte>(outputStream.GetBuffer(), 0, (int)outputStream.Length); | ||
// Note: We're not disposing the outputStream here because we're returning its buffer. | ||
// The memory will be reclaimed when the ReadOnlyMemory is no longer referenced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we necessarily need to remove or change the comment, but this isn't strictly correct. There are a lot of misunderstandings about IDisposable
, but disposing a MemoryStream
does nothing to its internal buffer. Because the buffer is a managed array, it will not get garbage-collected until there are no more references to it -- even if the MemoryStream
that created it is disposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fair to not want to assume anything about what MemoryStream.Dispose
might do to its internal buffer, but we're now assuming instead that a MemoryStream
doesn't need to be disposed. (This happens to be true.)