Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata wrongly used in INSERT INTO statement column lineage #648

Closed
JoelJan opened this issue Sep 3, 2024 · 9 comments
Closed

Metadata wrongly used in INSERT INTO statement column lineage #648

JoelJan opened this issue Sep 3, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@JoelJan
Copy link

JoelJan commented Sep 3, 2024

Describe the bug

  • In SQL INSERT INTO statement, the metadata provided with DummyMetadataProvider results in incorrect column-level lineage when the order of columns in the metadata does not match the column order in the INSERT statement.

SQL

INSERT INTO analyze
(
    a,
    c,
    b
)
SELECT 
    a,
    c,
    b
FROM foo;

To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named test.sql

  • API (Application Programming Interface: provide the python code you're using and the output.
    For example:
from sqllineage.runner import LineageRunner
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
with open("test.sql") as f:
    sql = f.read()
result = LineageRunner(sql, metadata_provider = DummyMetadataProvider({analyze.a,analyze.b,analyze.c}))
print(result.get_column_lineage())

Output:

[(Column: <default>.foo.a, Column: <default>.analyze.a), (Column: <default>.foo.c, Column: <default>.analyze.b), (Column: <default>.foo.b, Column: <default>.analyze.c)]

Expected behavior
The INSERT INTO statement with a list of columns should map columns by name, not by order. The expected lineage should correctly map as follows:

 [(Column: <default>.foo.a, Column: <default>.analyze.a), (Column: <default>.foo.c, Column: <default>.analyze.c), (Column: <default>.foo.b, Column: <default>.analyze.b)]

This result is consistent with the behavior when the metadata provider is omitted.

Python version (available via python --version)

  • 3.11.0rc1

SQLLineage version (available via sqllineage --version):

  • 1.5.3

Additional context
I tested the SQL behavior on Oracle, and it correctly inserts values by column names, not by order.

@JoelJan JoelJan added the bug Something isn't working label Sep 3, 2024
@reata
Copy link
Owner

reata commented Feb 7, 2025

Can you provide some additional info on {analyze.a,analyze.b,analyze.c}?

They're referring to variable that doesn't exist so I cannot reproduce your case. Also DummyMetaDataProvider is supposed to accept dict as input, not set.

@reata reata added question Further information is requested and removed bug Something isn't working labels Feb 7, 2025
@JoelJan
Copy link
Author

JoelJan commented Feb 7, 2025

Sorry, for my mistake. I guess it should be {"analyze":"a", "analyze"':"b", "analyze":"c"}. Just definition of table analyze with three columns, a, b and c.

@reata
Copy link
Owner

reata commented Feb 7, 2025

Seems I still can't reproduce it. Here's the changed runnable python script:

from sqllineage.runner import LineageRunner
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
with open("test.sql") as f:
    sql = f.read()
result = LineageRunner(sql, metadata_provider = DummyMetaDataProvider({"analyze": ["a", "b", "c"]}))
print(result.get_column_lineage())

with the sql you provided, the output is as expected.

[(Column: <default>.foo.a, Column: <default>.analyze.a), (Column: <default>.foo.b, Column: <default>.analyze.b), (Column: <default>.foo.c, Column: <default>.analyze.c)]

@JoelJan
Copy link
Author

JoelJan commented Feb 7, 2025

Found my old code to test it more. The issue is only there with owner/schema being included.

INSERT INTO oracle_owner.analyze
(
    a,
    c,
    b
)
SELECT 
    a,
    c,
    b
FROM foo;
from sqllineage.runner import LineageRunner
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
with open("test.sql") as f:
    sql = f.read()
result = LineageRunner(sql, metadata_provider = DummyMetaDataProvider({"oracle_owner.analyze": ["a", "b", "c"]}))
print(result.get_column_lineage())

@reata
Copy link
Owner

reata commented Feb 7, 2025

Well I don't see how output could be like this, where c and b are not correct mapping:

[(Column: <default>.foo.a, Column: <default>.analyze.a), (Column: <default>.foo.c, Column: <default>.analyze.b), (Column: <default>.foo.b, Column: <default>.analyze.c)]

It's consistently the result below with lineage being correct and order being a,b,c:

[(Column: <default>.foo.a, Column: <default>.analyze.a), (Column: <default>.foo.b, Column: <default>.analyze.b), (Column: <default>.foo.c, Column: <default>.analyze.c)]

Or do you expect the lineage like above and the order being a,c,b?

[(Column: <default>.foo.a, Column: <default>.analyze.a), (Column: <default>.foo.c, Column: <default>.analyze.c), (Column: <default>.foo.c, Column: <default>.analyze.c)]

@JoelJan
Copy link
Author

JoelJan commented Feb 7, 2025

I don´t think you added owner/schema. In your last result there is still "default.analyze". My result look like this:

[(Column: <default>.foo.a, Column: oracle_owner.analyze.a), (Column: <default>.foo.c, Column: oracle_owner.analyze.b), (Column: <default>.foo.b, Column: oracle_owner.analyze.c)]

@reata
Copy link
Owner

reata commented Feb 7, 2025

Oh, I see it now! Thanks. Let me spend some time to figure it out tomorrow.

@JoelJan
Copy link
Author

JoelJan commented Feb 7, 2025

Great! Thanks a lot. Sorry for the non-executable code.

@reata
Copy link
Owner

reata commented Feb 8, 2025

It turns out this is already fixed in master branch. I will release v1.5.4 today.

@reata reata closed this as completed Feb 8, 2025
@reata reata added bug Something isn't working and removed question Further information is requested labels Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants