Skip to content

Fix for dissect processor overeager delimiter consumption #128885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lukewhiting
Copy link
Contributor

This PR addresses #119264 and the overly greedy consumption of delimiters by the ES implementation of the dissect processor.

This code modifies the implementation to correctly match the spec and only attempts to greedily consume repeated delimiters if the skip right padding operator is used.

In the process of this, 2 new unit tests were created to highlight the current incorrect behaviour and one tests which did not match the specification was modified to correctly reflect the desired behaviour.

To ensure our implementation aligns with the rest of the product stack, logstash-plugins/logstash-filter-dissect#92 was merged to add the new units tests to Logstash's implimentation of Disect where it passed without the issues raised in #119264, further indicating that our current implementation is faulty.

@lukewhiting lukewhiting added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jun 4, 2025
@lukewhiting lukewhiting requested a review from Copilot June 4, 2025 11:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses #119264 by fixing the overly greedy consumption of delimiters in the dissect processor implementation to correctly adhere to the spec.

  • Updated the delimiter consumption logic in DissectParser.java to only greedily consume delimiters when the skip right padding operator is active.
  • Modified and added unit tests in DissectParserTests.java to reflect and validate the corrected behavior.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
libs/dissect/src/test/java/org/elasticsearch/dissect/DissectParserTests.java Updated test assertions and added tests for delimiter consumption edge cases.
libs/dissect/src/main/java/org/elasticsearch/dissect/DissectParser.java Modified while loop logic to incorporate skipRightPadding() checks for delimiter consumption.
Comments suppressed due to low confidence (2)

libs/dissect/src/main/java/org/elasticsearch/dissect/DissectParser.java:230

  • [nitpick] Consider adding a clarifying comment to explain that the loop only processes consecutive delimiters when the skip right padding operator is enabled, ensuring the behavior aligns with the updated spec.
while (dissectPair.key.skipRightPadding() && i < input.length) {

libs/dissect/src/test/java/org/elasticsearch/dissect/DissectParserTests.java:319

  • [nitpick] Update expected output has been modified to reflect the revised delimiter consumption behavior; consider adding a comment to detail the rationale behind this test change.
assertMatch("%{+a/1},%{+a/3}-%{+a/2} %{b}", "foo,bar----baz lol", Arrays.asList("a", "b"), Arrays.asList("foo---bazbar", "lol"));

@lukewhiting lukewhiting force-pushed the 119264-disect-delimiter-behaviour branch from fc673d0 to f682972 Compare June 4, 2025 12:50
@lukewhiting lukewhiting changed the title Potential fix Fix for dissect processor overeager delimiter consumption Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants