Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sql): support inserts with default constraints #9844

Merged
merged 17 commits into from
Aug 25, 2024

Conversation

IndexSeek
Copy link
Member

Description of changes

This change intends to support scenarios where a user needs to insert into a table where a table contains DEFAULT values on columns, meaning the incoming object may have fewer columns than the target table schema.

It's common to insert into tables with DEFAULT constraints; in many cases, these are sequences, and it is a bit tricky to provide the sequence "nextval" approach today.

This topic was initially brought up in Zulip.

Here is an example of this in practice:

import ibis
import pandas as pd

con = ibis.duckdb.connect()
con.raw_sql("CREATE TABLE example (a INTEGER DEFAULT 1, b VARCHAR NOT NULL);")

df = pd.DataFrame(
    {
        "a": [1, 2, 3],
        "b": ["foo", "bar", "baz"],
    }
)

Backend error

If the user excludes a required column (e.g., NOT NULL without a DEFAULT) from the container used for insert, the backend will error out with its respective error.

con.insert("example", df[["a"]])

Successful insert with excluded columns

con.insert("example", df[["b"]])
con.table("example").to_pandas()
a b
1 foo
1 bar
1 baz

I'm still working through the tests for this, but wanted to put this out here in case anyone wanted to take a look before I can return to it.

@IndexSeek IndexSeek force-pushed the feat/insert-provided-columns-only branch from fc06dbe to 5c6d1eb Compare August 16, 2024 01:07
@IndexSeek
Copy link
Member Author

IndexSeek commented Aug 16, 2024

I have made more progress on the tests. I got the Oracle backend passing, but it required a hacky solution to ensure the identifier was being quoted.

import ibis.backends.sql.compilers as sc

quoted = getattr(sc, con.dialect.__name__.lower()).compiler.quoted

It works, but I wonder if there is a better way.

I will try to look into the remaining backends that are still failing, which are: Clickhouse, Trino, Druid, Exasol.

Edit:
I was able to get the Clickhouse query to compile by suffixing it with ORDER BY "a", but I noticed in another line that create table was not implemented, so I will mark this as notimpl as well.

@IndexSeek IndexSeek force-pushed the feat/insert-provided-columns-only branch from ade767e to dc4a244 Compare August 23, 2024 00:35
columns=[
sg.to_identifier(col, quoted=quoted)
for col in columns
if col in source_cols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a variable for source.schema and then use col in variable to do this lookup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the change I have made will reduce the time complexity to avoid needing to check again against the source_cols list. I have rewritten things a bit here, using a variable named columns to refer to the column list to be used for the insert SQL expression.

@IndexSeek IndexSeek force-pushed the feat/insert-provided-columns-only branch from b302ab5 to 258271f Compare August 24, 2024 15:10
@IndexSeek
Copy link
Member Author

IndexSeek commented Aug 24, 2024

I swapped things around here to use issubset - my idea was that if the source contains fewer columns than the target, but the column names are the same, that's fine, as we can use the source's columns in the insert list.

issubset will also evaluate to True if the columns are the same in both source and target, regardless of order. In this scenario, we want to use the order of source, so that the SQL expression will be written properly.

So if my source schema is:

ibis.Schema {
  b  string
  a  int64
}

But my target schema is:

ibis.Schema {
  a  int64
  b  string
}

This is okay, because the underlying SQL expression come out to:

INSERT INTO target (b, a)
SELECT * 
FROM source;

and in the event that maybe the target schema has a default constraint on column "a", and our source schema looks like

ibis.Schema {
  b  string
}

the query will be written like so:

INSERT INTO target (b)
SELECT * 
FROM source;

@IndexSeek IndexSeek marked this pull request as ready for review August 24, 2024 17:17
@IndexSeek
Copy link
Member Author

I have marked this one ready for review, but I have a couple of questions.

  1. The Oracle backend is still failing, and I wasn't sure if there was a clean way to address that with a SQLGlot method to ensure quoting the identifiers in the Create expression from parse_one or if I should adjust the ct_sql variable to enclose the object identifiers in double-quotes. My concern is that the Snowflake backend may encounter a similar outcome when cloud runs are triggered.
  2. Druid is marked "notyet," but fails because raw_sql is not supported. Should I include that in the list with exasol?

@IndexSeek IndexSeek changed the title feat(sql): support inserts with default constraints [WIP] feat(sql): support inserts with default constraints Aug 24, 2024
ibis/backends/tests/test_client.py Outdated Show resolved Hide resolved
@IndexSeek IndexSeek force-pushed the feat/insert-provided-columns-only branch from 8c0a4b8 to cabd583 Compare August 24, 2024 20:58
@IndexSeek IndexSeek requested a review from cpcloud August 25, 2024 00:21
@cpcloud
Copy link
Member

cpcloud commented Aug 25, 2024

I'll add an xfail marker to the Druid backend, since it doesn't support CREATE TABLE.

@cpcloud
Copy link
Member

cpcloud commented Aug 25, 2024

Oh, looks like you did that already. Let me see what's failing then.

@cpcloud
Copy link
Member

cpcloud commented Aug 25, 2024

@IndexSeek I improved the implementation bit: we can use our Schema.keys() method (whose output behaves like a set) instead of constructing a set and calling issubset with a sequence (which itself will construct a set if the input argument isn't already a set).

This avoids constructing two throwaway sets, the effects of which can show up in wide-table use cases.

@cpcloud cpcloud added this to the 9.4 milestone Aug 25, 2024
@cpcloud cpcloud added feature Features or general enhancements ddl Issues related to creating or altering data definitions sql Backends that generate SQL labels Aug 25, 2024
@IndexSeek
Copy link
Member Author

@IndexSeek I improved the implementation bit: we can use our Schema.keys() method (whose output behaves like a set) instead of constructing a set and calling issubset with a sequence (which itself will construct a set if the input argument isn't already a set).

This avoids constructing two throwaway sets, the effects of which can show up in wide-table use cases.

This is a very nice improvement. Thank you for the assistance here and the explanation, @cpcloud!

@cpcloud cpcloud added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Aug 25, 2024
@cpcloud
Copy link
Member

cpcloud commented Aug 25, 2024

Running the cloud backend test suite, then if that's all green this is good to merge!

@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Aug 25, 2024
try:
db = getattr(con, "current_database", None)
except NotImplementedError:
db = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to put up a separate PR to fix the reason why this is so gross.

@cpcloud cpcloud added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Aug 25, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Aug 25, 2024
@cpcloud cpcloud merged commit 86a3c06 into ibis-project:main Aug 25, 2024
87 checks passed
@IndexSeek IndexSeek deleted the feat/insert-provided-columns-only branch August 26, 2024 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements sql Backends that generate SQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants