Find a safe alternative to `LogicalPlan::using_columns()` #14118

jonahgao · 2025-01-14T02:43:20Z

Describe the bug

Background

Select with wildcard over a USING/NATURAL JOIN should deduplicate join columns.
For example:

with t as(select 1 as a, 2 as b) select * from t t1 join t t2 using(a)

This query above should output the column 'a' only once.

LogicalPlan::using_columns() is used to find these join columns and to help exclude duplicated columns when expanding wildcards.

Problem

using_columns() works by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. This may lead to some output columns being incorrectly excluded.

For example, the result of the query below is different from other databases.

create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;

To Reproduce

Run query in CLI (compiled from the latest main: 722307f)

> create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
0 row(s) fetched.
Elapsed 0.008 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.015 seconds.

++
++
++
3 row(s) fetched.
Elapsed 0.013 seconds.

It outputs no columns.

Expected behavior

In PostgreSQL it does output one column.

psql (16.6 (Ubuntu 16.6-0ubuntu0.24.04.1))
Type "help" for help.

psql=> create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
CREATE TABLE
INSERT 0 3
 a
---
 3
 4
 5
(3 rows)

Additional context

No response

The text was updated successfully, but these errors were encountered:

logan-keede · 2025-01-16T12:46:23Z

Okay, this definitely sent me on a ride but here are the results after multiple logging statement, function analysis and postgresql's planner analysis, most of which were rather useless the bug stems from what should be an incorrect regex (have not found the code for it yet).

Why do I think so?

CREATE TABLE t(a int) AS VALUES
    (1),
    (2),
    (3);
select * from (select t.a+2 as a from t join t bc using(a)) as bc;

-->
+---+
| a |
+---+
| 3 |
| 4 |
| 5 |
+---+

it seems like this bug occurs when beginning of alias matches with actual table name(not vice versa) and culprit code ends up excluding all the columns
further inspection reveals that using_columns function seems to be working fine for this as it returns

[{Column { relation: Some(Bare { table: "t" }), name: "a" }, Column { relation: Some(Bare { table: "t2" }), name: "a" }}]

another function exclude_using_columns()(which finds which columns should be excluded) returns
{Column { relation: Some(Bare { table: "t2" }), name: "a" }}

which also seems fine the problematic part should be the piece of code where this exclusion actually happens(MIA)

btw this is the output of logger when we use bc alias instead of t2

[{Column { relation: Some(Bare { table: "t" }), name: "a" }, Column { relation: Some(Bare { table: "bc" }), name: "a" }}]
{Column { relation: Some(Bare { table: "t" }), name: "a" }}

and bug does not occur when alias is t but tablename is t2
Please assign this issue to me, let me know if I am missing something as I am rather new to this code base or if you have any idea where I can find my "patient" code.
Thanks,
Logan

logan-keede · 2025-01-16T21:30:19Z

Hi @jonahgao
I think I have found a potential solution, which is kinda weird

Solution:-
In exprlist_to_fields() do not exclude columns from wildcard_schema more specifically do not explicitly call exclude_using_columns()

Rationale:-
wildcard_schema already does not have any extra columns most probably due to how schema and plan are built (atleast in this case), trying to exclude_using_columns sometime result in deletion of a necessary column with no duplicate depending on naming of tables and aliases.

I am not sure if this will affect some other cases or not, but even if it does it will only show extra columns which is better than not showing any column at all.

also this does not seem to affect tests negatively, so should I go with this or not?(I would put a PR by tomorrow, if you would like to see that first.)
Thanks,
Logan

jonahgao · 2025-01-17T02:19:47Z

Hi @logan-keede , thanks for your investigation.

Solution:-
In exprlist_to_fields() do not exclude columns from wildcard_schema more specifically do not explicitly call exclude_using_columns()

exclude_using_columns is necessary, for example, for the following query, we need to ensure that it outputs only one a column.

DataFusion CLI v44.0.0

> create table t1(a int, b int);

> create table t2(a int, c int);

> select * from t1 join t2 using(a);
+---+---+---+
| a | b | c |
+---+---+---+
+---+---+---+

I am not sure if this will affect some other cases or not, but even if it does it will only show extra columns which is better than not showing any column at all.

I think displaying extra columns may introduce more bugs, especially since the case in this issue is quite rare

logan-keede · 2025-01-17T19:15:20Z

I thought so, I looked at some other avenues,

Here are some things that might help anyone trying to solve this in future.
Problem:-
exclude_using_columns return the column from lexicographical largest table so if the join is between two table t2 and t then it will return common columns from t2

If we make the sub-query alias name t2 too it tries to exclude a from that table too.

Possible Solution:-
Currently excluded_columns from sub-query is leaking into super query.

A normal statement without join should not have the need to use exclude_using_columns, and even if it does, it should not need to look at sub-queries within it since they have already excluded redundant columns.

Possible implementation Strategies:-

After calculating sub-query The expression should become logical equivalent of select * from t2.
Make exclude using columns shallower/non-recursive i.e. do not let it search for redundant column in sub-query.

@jonahgao, please let me know if you think this approach might work. Regardless, I would like to be unassigned from this issue, as I believe it is beyond my current capabilities. I plan to look for more beginner-friendly issues and spend some time familiarizing myself with the codebase first.

PS: I might have circled back to the original issue/Problem, regardless I hope the process contributed something. ^_^

PS2: I think #1468 has the potential to resolve this issue though it does not seem like we can expect it anytime soon.

jonahgao · 2025-01-20T02:07:51Z

Make exclude using columns shallower/non-recursive i.e. do not let it search for redundant column in sub-query.

I think it is the correct approach. Since unnamed subqueries do not appear in the logical plan tree, we need to do this check earlier. The fix likely requires some refactoring of the existing code, which is not friendly for people unfamiliar with the codebase. Unassigned and thanks @logan-keede again ❤️

jonahgao added the bug Something isn't working label Jan 14, 2025

jonahgao mentioned this issue Jan 14, 2025

fix: incorrect NATURAL/USING JOIN schema #14102

Merged

jonahgao assigned logan-keede Jan 16, 2025

jonahgao unassigned logan-keede Jan 20, 2025

jonahgao mentioned this issue Feb 7, 2025

Create UNION plan node with correct schema #14380

Open

jonahgao self-assigned this Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find a safe alternative to `LogicalPlan::using_columns()` #14118

Find a safe alternative to `LogicalPlan::using_columns()` #14118

jonahgao commented Jan 14, 2025 •

edited

Loading

logan-keede commented Jan 16, 2025 •

edited

Loading

logan-keede commented Jan 16, 2025

jonahgao commented Jan 17, 2025 •

edited

Loading

logan-keede commented Jan 17, 2025 •

edited

Loading

jonahgao commented Jan 20, 2025

Find a safe alternative to LogicalPlan::using_columns() #14118

Find a safe alternative to LogicalPlan::using_columns() #14118

Comments

jonahgao commented Jan 14, 2025 • edited Loading

Describe the bug

Background

Problem

To Reproduce

Expected behavior

Additional context

logan-keede commented Jan 16, 2025 • edited Loading

logan-keede commented Jan 16, 2025

jonahgao commented Jan 17, 2025 • edited Loading

logan-keede commented Jan 17, 2025 • edited Loading

jonahgao commented Jan 20, 2025

Find a safe alternative to `LogicalPlan::using_columns()` #14118

Find a safe alternative to `LogicalPlan::using_columns()` #14118

jonahgao commented Jan 14, 2025 •

edited

Loading

logan-keede commented Jan 16, 2025 •

edited

Loading

jonahgao commented Jan 17, 2025 •

edited

Loading

logan-keede commented Jan 17, 2025 •

edited

Loading