Simplify Write API: Add `InsertExec`, port in memory insert to use `DataSink` #6347

alamb · 2023-05-12T21:10:54Z

Which issue does this PR close?

Rationale for this change

I want to make it easier for DataFusion to be extended with new write sources. I want to use this both for COPY ... TO ... statements as well as @JanKaul is looking into using it for Delta.rs: #6339 (comment)

What changes are included in this PR?

Add DataSink API (thanks @tustvold and @JanKaul for this discussion)
Port MemTable write to be in terms of DataSink (it was really helpful to have a specific implemetation to try this with -- thanks @metesynnada )
Remove MemoryWriteExec, the MemTable specific execution plan
Update tests (TODO)
Ensure there is a test for inserting the wrong schema into a table

Are these changes tested?

Yes

Are there any user-facing changes?

The API for insert changes.

alamb · 2023-05-12T21:12:21Z

datafusion/core/src/datasource/datasource.rs

-        _state: &SessionState,
-        _input: Arc<dyn ExecutionPlan>,
-    ) -> Result<Arc<dyn ExecutionPlan>> {
+    /// Return a [`DataSink`] suitable for writing to this table


Here is the new interface -- I think this probably should also get some sort of "options" parameter too so that we can provide a way to pass in settings like row_group_size from the various inserts).

I prefer to add that parameter at some future point when I have an actual usecase with COPY rather than guessing exactly what is needed.

alamb · 2023-05-12T21:13:33Z

datafusion/core/src/datasource/memory.rs

+        let num_partitions = self.batches.len();
+
+        // buffer up the data round robin stle into num_partitions new buffers
+        let mut new_batches = vec![vec![]; num_partitions];


I am quite pleased that by following @tustvold 's guidance and pushing the partitioning choice into the DataSink implementations, that the logic becomes quite a bit simpler and flexible with seemingly no performance penalty

alamb · 2023-05-12T21:15:02Z

datafusion/core/src/physical_plan/insert.rs

+use crate::physical_plan::Distribution;
+use datafusion_common::DataFusionError;
+
+/// Execution plan for writing record batches to a [`DataSink`]


This is basically a generic version of MemoryWriteExec that calls a dyn DataSink to do the actual writing

alamb · 2023-05-12T21:27:12Z

datafusion/core/src/physical_plan/memory.rs

-
-    // Test the less-lock mode by inserting a large number of batches into a table.
-    #[tokio::test]
-    async fn test_one_to_one_mode() -> Result<()> {


I am not quite sure how to port these tests yet -- I am thinking about it

I studied these tests carefully. I believe these cases with different partitioning strategies are already covered by https://github.com/apache/arrow-datafusion/blob/eb918ab217213d5e07e71e53c118a8409d2f71a0/datafusion/core/src/datasource/memory.rs#L455-L476

alamb · 2023-05-15T16:29:36Z

Given the API change proposal does not have consensus I am closing this PR for now until we get one. I think I can get most of the benefit I was looking for from #6354

alamb added the api change Changes the API exposed to users of the crate label May 12, 2023

github-actions bot added the core Core DataFusion crate label May 12, 2023

alamb commented May 12, 2023

View reviewed changes

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 12, 2023

alamb commented May 12, 2023

View reviewed changes

alamb mentioned this pull request May 15, 2023

Simplified TableProvider::Insert API #6339

Closed

alamb added 2 commits May 15, 2023 11:23

Add InsertExec, port in memory insert to use DataSink

2b0ed37

Update tests

925c01b

alamb force-pushed the alamb/simplified_insert branch from f70741b to 925c01b Compare May 15, 2023 15:46

alamb mentioned this pull request May 15, 2023

INSERT returns number of rows written, add InsertExec to handle common case. #6354

Merged

alamb closed this May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify Write API: Add `InsertExec`, port in memory insert to use `DataSink` #6347

Simplify Write API: Add `InsertExec`, port in memory insert to use `DataSink` #6347

Uh oh!

alamb commented May 12, 2023 •

edited

Loading

Uh oh!

alamb May 12, 2023

Uh oh!

alamb May 12, 2023

Uh oh!

alamb May 12, 2023

Uh oh!

alamb May 12, 2023

Uh oh!

alamb May 15, 2023

Uh oh!

alamb commented May 15, 2023

Uh oh!

Uh oh!

Simplify Write API: Add InsertExec, port in memory insert to use DataSink #6347

Simplify Write API: Add InsertExec, port in memory insert to use DataSink #6347

Uh oh!

Conversation

alamb commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb May 12, 2023

Choose a reason for hiding this comment

Uh oh!

alamb May 12, 2023

Choose a reason for hiding this comment

Uh oh!

alamb May 12, 2023

Choose a reason for hiding this comment

Uh oh!

alamb May 12, 2023

Choose a reason for hiding this comment

Uh oh!

alamb May 15, 2023

Choose a reason for hiding this comment

Uh oh!

alamb commented May 15, 2023

Uh oh!

Uh oh!

Simplify Write API: Add `InsertExec`, port in memory insert to use `DataSink` #6347

Simplify Write API: Add `InsertExec`, port in memory insert to use `DataSink` #6347

alamb commented May 12, 2023 •

edited

Loading