feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669

eric-wang-1990 · 2025-04-03T08:32:35Z

Create a new file Lz4Utilities.cs to abstract common Lz4 decompress util functions for both cloud fetch and arrow batch.
Add support for decompress arrow batch with Lz4
Rename adbc.spark.cloudfetch.lz4.enabled to adbc.spark.lz4Compression.enabled since it is not specific to cloudfetch
Add test to test both cloudFetch and arrowBatch in StatementTests.

lidavidm · 2025-04-03T10:32:56Z

Just curious, does Spark/Databricks not use Arrow IPC compression? It compresses the file as a whole separately?

CurtHagenlocher · 2025-04-03T14:16:57Z

Just curious, does Spark/Databricks not use Arrow IPC compression? It compresses the file as a whole separately?

Yeah I asked a similar question in a previous PR, but apparently it doesn't (or it wouldn't need a separate out-of-band flag for the compression setting).

CurtHagenlocher

Thanks for the change! I've asked for some changes and made some (optional) suggestions for improvements.

csharp/src/Drivers/Apache/Spark/Lz4Utilities.cs

csharp/src/Drivers/Apache/Spark/SparkDatabricksReader.cs

csharp/src/Drivers/Apache/Spark/Lz4Utilities.cs

csharp/src/Drivers/Apache/Spark/SparkDatabricksReader.cs

csharp/src/Drivers/Apache/Spark/SparkStatement.cs

…lz4_to_arrowbatch

CurtHagenlocher

Thanks!

CurtHagenlocher · 2025-04-04T17:53:02Z

csharp/src/Drivers/Apache/Spark/Lz4Utilities.cs

+                // Get the underlying buffer and its valid length without copying
+                return new ReadOnlyMemory<byte>(outputStream.GetBuffer(), 0, (int)outputStream.Length);
+                // Note: We're not disposing the outputStream here because we're returning its buffer.
+                // The memory will be reclaimed when the ReadOnlyMemory is no longer referenced.


I don't know that we necessarily need to remove or change the comment, but this isn't strictly correct. There are a lot of misunderstandings about IDisposable, but disposing a MemoryStream does nothing to its internal buffer. Because the buffer is a managed array, it will not get garbage-collected until there are no more references to it -- even if the MemoryStream that created it is disposed.

It's fair to not want to assume anything about what MemoryStream.Dispose might do to its internal buffer, but we're now assuming instead that a MemoryStream doesn't need to be disposed. (This happens to be true.)

add Lz4 compression to arrow batch reader

45e4d8d

eric-wang-1990 requested a review from CurtHagenlocher as a code owner April 3, 2025 08:32

github-actions bot added this to the ADBC Libraries 18 milestone Apr 3, 2025

remove unneed function

04d940b

CurtHagenlocher requested changes Apr 3, 2025

View reviewed changes

eric-wang-1990 added 3 commits April 3, 2025 14:22

Optimize Spark readers: improve memory usage and error handling

ce9a736

Merge branch 'main' of https://github.com/apache/arrow-adbc into add_…

c29c2e7

…lz4_to_arrowbatch

lint

627cc10

eric-wang-1990 requested a review from CurtHagenlocher April 4, 2025 07:32

CurtHagenlocher approved these changes Apr 4, 2025

View reviewed changes

CurtHagenlocher merged commit 028d22f into apache:main Apr 4, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669

feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669

eric-wang-1990 commented Apr 3, 2025

lidavidm commented Apr 3, 2025

CurtHagenlocher commented Apr 3, 2025

CurtHagenlocher left a comment

CurtHagenlocher left a comment

CurtHagenlocher Apr 4, 2025

CurtHagenlocher Apr 4, 2025

feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669

feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader #2669

Conversation

eric-wang-1990 commented Apr 3, 2025

lidavidm commented Apr 3, 2025

CurtHagenlocher commented Apr 3, 2025

CurtHagenlocher left a comment

Choose a reason for hiding this comment

CurtHagenlocher left a comment

Choose a reason for hiding this comment

CurtHagenlocher Apr 4, 2025

Choose a reason for hiding this comment

CurtHagenlocher Apr 4, 2025

Choose a reason for hiding this comment