Skip to content

Find a safe alternative to LogicalPlan::using_columns() #14118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jonahgao opened this issue Jan 14, 2025 · 5 comments
Open

Find a safe alternative to LogicalPlan::using_columns() #14118

jonahgao opened this issue Jan 14, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@jonahgao
Copy link
Member

jonahgao commented Jan 14, 2025

Describe the bug

Background

Select with wildcard over a USING/NATURAL JOIN should deduplicate join columns.
For example:

with t as(select 1 as a, 2 as b) select * from t t1 join t t2 using(a)

This query above should output the column 'a' only once.

LogicalPlan::using_columns() is used to find these join columns and to help exclude duplicated columns when expanding wildcards.

Problem

using_columns() works by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. This may lead to some output columns being incorrectly excluded.

For example, the result of the query below is different from other databases.

create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;

To Reproduce

Run query in CLI (compiled from the latest main: 722307f)

> create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
0 row(s) fetched.
Elapsed 0.008 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.015 seconds.

++
++
++
3 row(s) fetched.
Elapsed 0.013 seconds.

It outputs no columns.

Expected behavior

In PostgreSQL it does output one column.

psql (16.6 (Ubuntu 16.6-0ubuntu0.24.04.1))
Type "help" for help.

psql=> create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
CREATE TABLE
INSERT 0 3
 a
---
 3
 4
 5
(3 rows)

Additional context

No response

@jonahgao jonahgao added the bug Something isn't working label Jan 14, 2025
@logan-keede
Copy link
Contributor

logan-keede commented Jan 16, 2025

Okay, this definitely sent me on a ride but here are the results after multiple logging statement, function analysis and postgresql's planner analysis, most of which were rather useless the bug stems from what should be an incorrect regex (have not found the code for it yet).

Why do I think so?

CREATE TABLE t(a int) AS VALUES
    (1),
    (2),
    (3);
select * from (select t.a+2 as a from t join t bc using(a)) as bc;

-->
+---+
| a |
+---+
| 3 |
| 4 |
| 5 |
+---+

it seems like this bug occurs when beginning of alias matches with actual table name(not vice versa) and culprit code ends up excluding all the columns
further inspection reveals that using_columns function seems to be working fine for this as it returns

[{Column { relation: Some(Bare { table: "t" }), name: "a" }, Column { relation: Some(Bare { table: "t2" }), name: "a" }}]

another function exclude_using_columns()(which finds which columns should be excluded) returns
{Column { relation: Some(Bare { table: "t2" }), name: "a" }}

which also seems fine the problematic part should be the piece of code where this exclusion actually happens(MIA)

btw this is the output of logger when we use bc alias instead of t2

[{Column { relation: Some(Bare { table: "t" }), name: "a" }, Column { relation: Some(Bare { table: "bc" }), name: "a" }}]
{Column { relation: Some(Bare { table: "t" }), name: "a" }}

and bug does not occur when alias is t but tablename is t2
Please assign this issue to me, let me know if I am missing something as I am rather new to this code base or if you have any idea where I can find my "patient" code.
Thanks,
Logan

@logan-keede
Copy link
Contributor

Hi @jonahgao
I think I have found a potential solution, which is kinda weird

Solution:-
In exprlist_to_fields() do not exclude columns from wildcard_schema more specifically do not explicitly call exclude_using_columns()

Rationale:-
wildcard_schema already does not have any extra columns most probably due to how schema and plan are built (atleast in this case), trying to exclude_using_columns sometime result in deletion of a necessary column with no duplicate depending on naming of tables and aliases.

I am not sure if this will affect some other cases or not, but even if it does it will only show extra columns which is better than not showing any column at all.

also this does not seem to affect tests negatively, so should I go with this or not?(I would put a PR by tomorrow, if you would like to see that first.)
Thanks,
Logan

@jonahgao
Copy link
Member Author

jonahgao commented Jan 17, 2025

Hi @logan-keede , thanks for your investigation.

Solution:-
In exprlist_to_fields() do not exclude columns from wildcard_schema more specifically do not explicitly call exclude_using_columns()

exclude_using_columns is necessary, for example, for the following query, we need to ensure that it outputs only one a column.

DataFusion CLI v44.0.0

> create table t1(a int, b int);

> create table t2(a int, c int);

> select * from t1 join t2 using(a);
+---+---+---+
| a | b | c |
+---+---+---+
+---+---+---+

I am not sure if this will affect some other cases or not, but even if it does it will only show extra columns which is better than not showing any column at all.

I think displaying extra columns may introduce more bugs, especially since the case in this issue is quite rare

@logan-keede
Copy link
Contributor

logan-keede commented Jan 17, 2025

I thought so, I looked at some other avenues,

Here are some things that might help anyone trying to solve this in future.
Problem:-
exclude_using_columns return the column from lexicographical largest table so if the join is between two table t2 and t then it will return common columns from t2

If we make the sub-query alias name t2 too it tries to exclude a from that table too.

Possible Solution:-
Currently excluded_columns from sub-query is leaking into super query.

A normal statement without join should not have the need to use exclude_using_columns, and even if it does, it should not need to look at sub-queries within it since they have already excluded redundant columns.

Possible implementation Strategies:-

  1. After calculating sub-query The expression should become logical equivalent of select * from t2.
  2. Make exclude using columns shallower/non-recursive i.e. do not let it search for redundant column in sub-query.

@jonahgao, please let me know if you think this approach might work. Regardless, I would like to be unassigned from this issue, as I believe it is beyond my current capabilities. I plan to look for more beginner-friendly issues and spend some time familiarizing myself with the codebase first.

PS: I might have circled back to the original issue/Problem, regardless I hope the process contributed something. ^_^

PS2: I think #1468 has the potential to resolve this issue though it does not seem like we can expect it anytime soon.

@jonahgao
Copy link
Member Author

  1. Make exclude using columns shallower/non-recursive i.e. do not let it search for redundant column in sub-query.

I think it is the correct approach. Since unnamed subqueries do not appear in the logical plan tree, we need to do this check earlier. The fix likely requires some refactoring of the existing code, which is not friendly for people unfamiliar with the codebase. Unassigned and thanks @logan-keede again ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants