Skip to content

Create new empty external table #7228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alamb opened this issue Aug 8, 2023 · 7 comments · Fixed by #7276 or #7294
Closed

Create new empty external table #7228

alamb opened this issue Aug 8, 2023 · 7 comments · Fixed by #7276 or #7294
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

Is your feature request related to a problem or challenge?

Thanks to the great work from @metesynnada @devinjdangelo and others, it is now possible to have DataFusion insert data into CSV and JSON tables.

However, the User Experience / UX is tough as it is not easy to write datasets because: External tables require an existing location to exist, even if empty

So for example, if I wanted to write to json files in /tmp/my_table I currently need to create a the target file / directory externally

mkdir /tmp/my_table
echo "" > /tmp/my_table/1.json
datafusion-cli
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.006 seconds.

❯
\q
$ cat /tmp/my_table/1.json

{"x":1,"y":2}
{"x":3,"y":4}

Describe the solution you'd like

I would like to be able to have datafusion create the targets directly

Something like this, without any setup:

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.006 seconds.
$ cat /tmp/my_table/1.json

{"x":1,"y":2}
{"x":3,"y":4}

Describe alternatives you've considered

Option 1: Automatically create the target

One option is simply to remove the existence check on CREATE EXTERNAL TABLE

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.006 seconds.

The downside is that the reason CREATE EXTERNAL TABLE errors if the target doesn't exist is to help people debug errors when reading

Option 2: Add new DDL to CREATE EXTERNAL TABLE

Perhaps we could add a phrase to CREATE EXTERNAL TABLE like FOR WRITE

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table' FOR WRITE;
0 rows in set. Query took 0.002 seconds.

The semantics would be if the target file/directory doesn't already exist, then create it rather than error

Option 3: Add a config parameter to control the default behavior

We could also add a config parameter like

set catalog.auto_create_tables = true;
SET

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table' FOR WRITE;
0 rows in set. Query took 0.002 seconds.

with the same semantics as option 2 (create it target doesn't exist)

Additional context

No response

@alamb alamb added the enhancement New feature or request label Aug 8, 2023
@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2023

I think OPTON 2 would be ideal and the easiest to use. Maybe we could even add a HINT in the message if CREATE TABLE failed with "not found" that FOR WRITE could be added to make a new files

@devinjdangelo
Copy link
Contributor

devinjdangelo commented Aug 8, 2023

I agree UX needs work. An additional problem is ListingTableInsertMode cannot be controlled from SQL or sesision config parameters. So while you can append new files to a table via rust like so:

ctx.register_json(
        "json_table_sink",
        out_path,
        NdJsonReadOptions::default()
            .insert_mode(ListingTableInsertMode::AppendNewFiles)
    )
    .await?;

ctx.sql("insert into json_table_sink ....")

Registering a table via SQL doesn't expose this setting, so it always defaults to ListingTableInsertModeAppendToFile.

I am thinking Option3 makes the most sense to solve these issues. My concern with option2 is we will have to add a lot of new SQL grammar to support these and future write options (e.g. compression, parquet row group size ect...). For parquet read configs, we have a lot of options controlled via session config, and we could follow a similar pattern for default write behaviors (both parquet and non parquet specific write behaviors).

For option 2, maybe some syntax like create external table WITH (...) where (...) is arbitrary config tuples could work. I believe trino has some syntax like this for creating tables.

@flashmouse
Copy link

I would like to implement this feature, I'll try work on it this weekend. for now I prefer to implement option3, any advice are welcome at any time. thx!

@2010YOUY01
Copy link
Contributor

The recent work on data export looks awesome 👀
I'm wondering is the current work (ctx.register_*(); ctx.sql("insert into...")') just building reusable infra for create external table... and the TODO COPY ... TO ... syntax? Is that the case they mostly have the same functionality but with different syntax

@2010YOUY01
Copy link
Contributor

For option 2, maybe some syntax like create external table WITH (...) where (...) is arbitrary config tuples could work. I believe trino has some syntax like this for creating tables.

This syntax looks really nice 👍🏼
Looks like Option 3 can be used for default settings, and this syntax can do local config override for individual queries
DuckDB's COPY ... TO ... syntax has very similar syntax for those arbitrary config tuples
https://duckdb.org/docs/sql/statements/copy

@alamb
Copy link
Contributor Author

alamb commented Aug 10, 2023

I would like to implement this feature, I'll try work on it this weekend. for now I prefer to implement option3, any advice are welcome at any time. thx!

I think option 3 should be a relatively straightforward exercise of:

  1. Adding the new config options
  2. threading them down to the creation of the listing table (look in the source code for the error string)
  3. Writing some tests -- ideally by extending create_external_table.slt -- documentation is in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests

@devinjdangelo
Copy link
Contributor

I did not intend to close this issue with my recent PR. Can we reopen it?

#7276 provides a possible workaround to the issue here, but not a complete solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
4 participants