Skip to content

feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 4, 2025

Conversation

eric-wang-1990
Copy link
Contributor

  1. Create a new file Lz4Utilities.cs to abstract common Lz4 decompress util functions for both cloud fetch and arrow batch.
  2. Add support for decompress arrow batch with Lz4
  3. Rename adbc.spark.cloudfetch.lz4.enabled to adbc.spark.lz4Compression.enabled since it is not specific to cloudfetch
  4. Add test to test both cloudFetch and arrowBatch in StatementTests.

@github-actions github-actions bot added this to the ADBC Libraries 18 milestone Apr 3, 2025
@lidavidm
Copy link
Member

lidavidm commented Apr 3, 2025

Just curious, does Spark/Databricks not use Arrow IPC compression? It compresses the file as a whole separately?

@CurtHagenlocher
Copy link
Contributor

Just curious, does Spark/Databricks not use Arrow IPC compression? It compresses the file as a whole separately?

Yeah I asked a similar question in a previous PR, but apparently it doesn't (or it wouldn't need a separate out-of-band flag for the compression setting).

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I've asked for some changes and made some (optional) suggestions for improvements.

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

// Get the underlying buffer and its valid length without copying
return new ReadOnlyMemory<byte>(outputStream.GetBuffer(), 0, (int)outputStream.Length);
// Note: We're not disposing the outputStream here because we're returning its buffer.
// The memory will be reclaimed when the ReadOnlyMemory is no longer referenced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we necessarily need to remove or change the comment, but this isn't strictly correct. There are a lot of misunderstandings about IDisposable, but disposing a MemoryStream does nothing to its internal buffer. Because the buffer is a managed array, it will not get garbage-collected until there are no more references to it -- even if the MemoryStream that created it is disposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fair to not want to assume anything about what MemoryStream.Dispose might do to its internal buffer, but we're now assuming instead that a MemoryStream doesn't need to be disposed. (This happens to be true.)

@CurtHagenlocher CurtHagenlocher merged commit 028d22f into apache:main Apr 4, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants