Skip to content

RFC: Implement initial support for COPY ... TO ... statement #6313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 10, 2023

Which issue does this PR close?

Closes #5654
Closes #5988

Rationale for this change

  1. I would like to to make it easier to test out datafusion (both for new comers as well as when I am trying to benchmark all the things)

What changes are included in this PR?

(I'll try and break this up into smaller pieces for easier review but I want to show it all working together)

  • Initial parser support for COPY .. TO ... statements
  • New LogicalPlan::CopyTo variant
  • Planing / Runtime support for copying to Parquet files
    -[ ] Parser tests
    -[ ] Add end user documentation
    -[ ] Properly support writing single parquet files
    -[ ] sqllogictests

I also plan to file follow on tasks (like support for other file formats, options, etc)

Are these changes tested?

Yes

Are there any user-facing changes?

Yes there is a new (documented) statement

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels May 10, 2023
@@ -450,6 +457,36 @@ impl SessionContext {
self.read_batch(record_batch)
}

// Execute a COPY TO statement, returning the number of rows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is the mechanism as used by CREATE TABLE AS SELECT (aka LogicalPlan::CreateMemTable). It is different than the mechanism used by INSERT INTO ... ` (added in #5520 by @metesynnada ) that uses an ExecutionPlan.

The difference bothers me, but I can see the benefits of both approaches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing with this more this evening and I think I came up with something that is half way between that I like even better. Will keep iterating and report back

Copy link
Contributor

@ozankabak ozankabak May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have something you may be able to leverage, check this PR out. It extends the ExecutionPlan approach to writing files, I think you can leverage that work here too. With that change, COPY TO and INSERT INTO will use the same ExecutionPlan-based approach -- the only difference would be related to appending vs overwriting.

FYI, if you are curious about timing, we plan to finalize and submit to upstream in a week or so.

Copy link
Contributor Author

@alamb alamb May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ozankabak -- I have reviewed https://github.com/synnada-ai/arrow-datafusion/pull/89 and I have thought about how to incorporate the same structure

I really like your idea to use the the same plans for COPY TO and INSERT INTO. After some more thought, I have an idea of how to plan COPY statements using the same plans as an INSERT.

Here is a proposal for a simplified API: #6339

I'll try and bang out a PR shortly

@alamb
Copy link
Contributor Author

alamb commented May 11, 2023 via email

@alamb alamb changed the title Implement initial support for COPY ... TO ... statement RFC: Implement initial support for COPY ... TO ... statement May 11, 2023
@alamb
Copy link
Contributor Author

alamb commented May 15, 2023

i have enough feedback for now and I working to add this functionality in pieces, so no need for this PR now

@alamb alamb closed this May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LogicalPlan COPY TO statement support Implement COPY ... TO statement
2 participants