Skip to content

[SPARK-52817][SQL] Fix Like Expression performance #51510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

zhixingheyi-tian
Copy link
Contributor

@zhixingheyi-tian zhixingheyi-tian commented Jul 16, 2025

What changes were proposed in this pull request?

Make contains function to be used in like expression with multiple '%'.

Why are the changes needed?

In some customers' cases , user sometimes use multiple '%' for like expression.

For Example:

SELECT * FROM testData where value not like '%%HotFocus%%'
SELECT * FROM testData where value not like '%%%HotFocus%%%'

In these SQL queries, cannot convert Like expressions to contains function in logical planning. So the performance is very poor.

How was this patch tested?

Added UTs and Existed UTs

@github-actions github-actions bot added the SQL label Jul 16, 2025
@wangyum
Copy link
Member

wangyum commented Jul 16, 2025

Could you add a test to cover this change?

@zhixingheyi-tian
Copy link
Contributor Author

Hi @wangyum

Have Add UTs.

cc @cloud-fan @baibaichen @dongjoon-hyun

@beliefer
Copy link
Contributor

Could you add description ?

@cloud-fan
Copy link
Contributor

Can you provide more context in the PR description? I don't understand what you are doing in this PR.

@zhixingheyi-tian
Copy link
Contributor Author

@beliefer @cloud-fan

Have added.

private val endsWith = "%([^_%]+)".r
private val startsAndEndsWith = "([^_%]+)%([^_%]+)".r
private val contains = "%([^_%]+)%".r
private val startsWith = "([^_%]+)%+".r
Copy link
Contributor

@cloud-fan cloud-fan Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a single % is the same as more than one %? Can we leave a code comment to explain this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to automata theory, consecutive wildcard characters are equivalent to a single wildcard character.

Have added the code comment.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-52817][SQL] Fix Like Expression performance [SPARK-52817][SQL] Fix Like Expression performance Jul 18, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @peter-toth

@wangyum wangyum closed this in 72ce64e Jul 19, 2025
@wangyum
Copy link
Member

wangyum commented Jul 19, 2025

Merged to master.

haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
### What changes were proposed in this pull request?

Make contains function to be used in like expression with multiple '%'.

### Why are the changes needed?

In some customers' cases , user sometimes use  multiple '%' for  like expression.

For Example:
```
SELECT * FROM testData where value not like '%%HotFocus%%'
SELECT * FROM testData where value not like '%%%HotFocus%%%'
```

In these SQL queries,  cannot convert Like expressions to contains function in logical planning. So the performance is very poor.

### How was this patch tested?

Added UTs and Existed UTs

Closes apache#51510 from zhixingheyi-tian/fix-like.

Authored-by: zhixingheyi-tian <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants