Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229

andsel · 2025-03-05T11:26:52Z

Release notes

Reimplements BufferedTokenizer to leverage pure Java classes instead of use JRuby runtime's classes.

What does this PR do?

Reimplement the BufferedTokenizerExt in pure Java using an iterable, while the BufferedTokenizerExt is a shell around this new class.

The principal method extract, which dice the data by separator, instead of returning a RubyArray now return an Iterable which is a wrapper on a chain of a couple of Iterators.
The first iterator (DataSplitter) accumulates data in a StringBuilder and then dice the data by separator.
The second iterator, used in cascade to the first, validate that each returned token size respect the sizeLimit parameter.

To be compliant with some usage patterns which expect an empty? method to be present in the returned object from extract, like this, the extract method of the BufferedTokenizerExt return an Iterable adapter custom class with such method.

On the test side the code that tested BufferedTokenizerExt is moved to test the new BufferedTokenizer, so some test classes was renamed:

BufferedTokenizerExtTest mostly becomes BufferedTokenizerTest, but there is still a small BufferedTokenizerExtTest remained to test charset conversion use cases.
BufferedTokenizerExtWithDelimiterTest -> BufferedTokenizerWithDelimiterTest
BufferedTokenizerExtWithSizeLimitTest -> BufferedTokenizerWithSizeLimitTest
a test used to verify overflow condition givenTooLongInputExtractDoesntOverflow (code ref) was removed because not anymore applicable.

On the benchmarking side:

BufferedTokenizerExtBenchmark -> BufferedTokenizerBenchmark with the adaptation to the new tokenizer class.

As can be seen by benchmark reports in Logs section, this PR provide almost an improvement of 6x against the previous implementation.

Why is it important/What is the impact to the user?

As a developer I want the BufferedTokenizer implementation is simpler than the existing.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

verify as in BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483
verify as in Size limit bytes logstash-plugins/logstash-codec-json_lines#43
verify charset encoding is respected
exhaustive test https://buildkite.com/elastic/logstash-exhaustive-tests-pipeline/builds/1650

How to test this PR locally

Run same tests as in:

BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483
Size limit bytes logstash-plugins/logstash-codec-json_lines#43 in this case after a couple of minutes Logstash goes OOM, which is a change in behaviour that I want to discuss in Protect new implementation of BufferedTokenizer against OOM #17275
Verify character encoding doesn't change:
- run the following pipeline

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"

using the following script to send £ sign in latin1

require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

Related issues

Closes BufferedTokenizerExt applies sizeLimit check only of first token of input fragment #17017

Logs

Benchmarks

The benchmarks was updated to run for 3 seconds (instead of 100 ms) and to report in milliseconds (instead of nanoseconds).

baseline

Ran with:

./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerExtBenchmark.*"

Benchmark                                                               Mode  Cnt     Score   Error   Units
BufferedTokenizerExtBenchmark.multipleTokenPerFragment                 thrpt   10   553.913 ± 6.223  ops/ms
BufferedTokenizerExtBenchmark.multipleTokensCrossingMultipleFragments  thrpt   10   222.815 ± 4.411  ops/ms
BufferedTokenizerExtBenchmark.onlyOneTokenPerFragment                  thrpt   10  1549.777 ± 9.237  ops/ms

this PR

Ran with:

./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerBenchmark.*"

Benchmark                                                            Mode  Cnt     Score     Error   Units
BufferedTokenizerBenchmark.multipleTokenPerFragment                 thrpt   10  3308.716 ± 167.549  ops/ms
BufferedTokenizerBenchmark.multipleTokensCrossingMultipleFragments  thrpt   10  1245.505 ±  52.843  ops/ms
BufferedTokenizerBenchmark.onlyOneTokenPerFragment                  thrpt   10  9468.777 ± 182.184  ops/ms

mergify · 2025-03-05T11:27:33Z

This pull request does not have a backport label. Could you fix it @andsel? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

mergify · 2025-03-05T11:27:33Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

mergify · 2025-03-05T11:36:59Z

This pull request does not have a backport label. Could you fix it @andsel? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

mergify · 2025-03-05T11:37:00Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

yaauie

I haven't fully-validated yet, but wanted to pass on some bits from my first pass:

BufferedTokenizerExt§IterableAdapterWithEmptyCheck#isEmpty is inverted
specs can be improved (yaauie@b524a67) with a custom matcher that validates both empty? (which maps to isEmpty) and entries (which is provided by the jruby shim extending java-Iterator with RubyEnumerable)

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

logstash-core/spec/logstash/util/buftok_spec.rb

yaauie

I think this is on the right track, and appreciate the clean-Java implementation.

While the previous implementations have not been thread-safe and
had undefined behaviour when contending threads invoked BufferedTokenizer#extract and/or BufferedTokenizer#flush, making the BufferedTokenizer#extract return a lazy iterator introduces some risk, as interacting with that iterator mutates the underlying buffer.

Looking at all of the current uses of FileWatch::BufferedTokenizer in core and plugins, I don't see this as a significant risk, but if we wanted to mitigate it we would need to synchronize all of the methods on BufferedTokenizer§DataSplitter that deal with mutable state.

I've added some notes about reducing overhead, correctly reporting when the buffer is non-empty with unprocessed bytes, and clearing the accumulator during a flush operation.

yaauie · 2025-04-09T23:05:20Z

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

    }

    @JRubyMethod(name = "empty?")
    public IRubyObject isEmpty(final ThreadContext context) {
-        return RubyUtil.RUBY.newBoolean(headToken.toString().isEmpty() && (inputSize == 0));
+        return RubyUtil.RUBY.newBoolean(tokenizer.isEmpty());


🤔 previously FileWatch::BufferedTokenizer#empty? returned true if there was unterminated input in the buffer, but now it doesn't.

the original implementation stated that the token was empty

headToken.toString().isEmpty() should be interpreted as "no token was collected"

inputSize == 0 the size of collected head (it's assigned to the the headToken length on each extract:

logstash/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

Lines 150 to 153 in 187c925

headToken.append(input.pop(context)); // put the leftovers in headToken for later

inputSize = headToken.length();

return input;

}

)

so inputSize is 0 iff the head token is empty and also there weren't any token fragments on input provided.

So if I'm not wrong isEmpty true effectively states that no token parts are available.
The proposed change in this PR effectively return true if there aren't more token available, so respect to that is a slip in the implementation that will be fixed.

🤔

Before this change, the return value of FileWatch::BufferedTokenizer#empty? was not deterministic while FileWatch::BufferedTokenizer#extract was being invoked, but was deterministic outside of that. Since FileWatch::BufferedTokenizer#extract always consumed all terminated tokens, this FileWatch::BufferedTokenizer#empty? only needed to consider the remaining unterminated buffer.

The proposed change in BufferedTokenizer#isEmpty changes that: it says that the BufferedTokenizer is empty if the terminated tokens in the iterator have all been consumed (and does not consider the unterminated buffer).

My proposed BufferedTokenizer§DataSplitter#isBufferEmpty() here (and the wiring through to BufferedTokenizer#isEmpty here) considers the unconsumed input in the accumulator, which effectively considers both unconsumed terminated tokens and any trailing unterminated buffer.

Worth noting:

Just as the return value wasn't deterministic while BufferedTokenizer#extract was being invoked before, it's now also not stable while iterating over the newly-lazy iterator. I think that's an expected side-effect; if a caller wants it to be stable before iterating, then they can first send it Enumerable#entries to consume all available tokens into an array.

logstash-core/src/main/java/org/logstash/common/BufferedTokenizer.java

yaauie · 2025-04-09T23:35:08Z

logstash-core/src/main/java/org/logstash/common/BufferedTokenizer.java

+        public String flush() {
+            return accumulator.substring(currentIdx);
+        }
+


If we need to report that the BufferedTokenizer is not empty when it has unprocessed bytes in its buffer, but the DataSplitter implements Iterator<String> is passed through the JRuby Bridge, I'm wary of confusing an isEmpty() that means !hasNext() with one that means that there is unprocessed data in the buffer.

Here's an DataSplitter#isBufferEmpty() that should make it more clear exactly what it means.

// considered empty if caught up to the accumulator public boolean isBufferEmpty() { return currentIdx <= accumulator.length(); }

Shouldn't be that the currentIndex has reached the end of the accumulator?

Lol. Yes. Wrong direction.

Even though currentIdx should never be > accumulator.length() due to the rest of the implementation, I elected to use >= here for safety.

Suggested change

// considered empty if caught up to the accumulator

public boolean isBufferEmpty() {

return currentIdx >= accumulator.length();

}

yaauie · 2025-04-09T23:35:26Z

logstash-core/src/main/java/org/logstash/common/BufferedTokenizer.java

+    }
+
+    public boolean isEmpty() {
+        return !dataSplitter.hasNext();


Suggested change

return !dataSplitter.hasNext();

return dataSplitter.isBufferEmpty();

…ts to use it

…Ext class

…just one time

…once reached the next separator

… the first token

…Ext because it's expected in some use cases, like: https://github.com/logstash-plugins/logstash-input-file/blob/55a4a7099f05f29351672417036c1342850c7adc/lib/filewatch/watched_file.rb#L250

…, but an OOM error is thrown from JDK libraries if an int overflow happens.

- specs improved (yaauie/logstash@b524a67) with a custom matcher that validates both `empty?` (which maps to isEmpty) and `entries` (which is provided by the jruby shim extending java-Iterator with RubyEnumerable)

…is fixed

… DataSplitter with synchornized so that can be used in multithreaded contexts

elastic-sonarqube · 2025-04-11T14:13:21Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

elasticmachine · 2025-04-11T14:20:45Z

💚 Build Succeeded

Buildkite Build
Commit: 1a346e2

History

💚 Build #2843 succeeded b1578e5
💚 Build #2842 succeeded 2727831
💔 Build #2833 failed 9741517
💚 Build #2740 succeeded 3fe0d5a
💔 Build #2702 failed 32cd1aa
💚 Build #2570 succeeded ef6ca8e

cc @andsel

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Mar 5, 2025

andsel removed the backport-8.x Automated backport to the 8.x branch with mergify label Mar 5, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Mar 5, 2025

andsel self-assigned this Mar 5, 2025

andsel added the enhancement label Mar 5, 2025

andsel changed the title ~~Fix/bufftok to return itereator~~ Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted Mar 6, 2025

This was referenced Mar 6, 2025

Protect new implementation of BufferedTokenizer against OOM #17275

Open

Fix/avoid oom accumulation in bufftok #17293

Draft

andsel marked this pull request as ready for review March 12, 2025 14:29

andsel force-pushed the fix/bufftok_to_return_itereator branch 2 times, most recently from 1fcb6a8 to 3fe0d5a Compare March 31, 2025 10:53

yaauie self-requested a review April 9, 2025 20:10

yaauie requested changes Apr 9, 2025

View reviewed changes

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java Outdated Show resolved Hide resolved

logstash-core/spec/logstash/util/buftok_spec.rb Outdated Show resolved Hide resolved

yaauie reviewed Apr 10, 2025

View reviewed changes

andsel added 12 commits April 10, 2025 16:43

First rough idea of reimplementation to return an iterator

a33099d

Exposed Java BufferedTokenizer under FileWatch module and adapted tes…

a6771d3

…ts to use it

Moved string encoding logic to outer Ruby extension BufferedTokenizer…

0d6904e

…Ext class

Fixed flush behavior

2c06a21

Minor, removed commented code

253036e

Moved creation of iterable upfront in the constructor to be executed …

2744a13

…just one time

Fixed tests to grab the failure on exceeded size limit on itereation …

9536739

…once reached the next separator

Fixed license headers

ea037bf

[Test] added test to verify buffer full error is notified not only on…

c2d6f85

… the first token

Updated benchmark to consume effectively the iterator

7427fa0

[Benchmark] JMH report in milliseconds instead of nanoseconds

0bff894

Added an isEmpty method to the iterable returned by BufferedTokenizer…

3c3296e

…Ext because it's expected in some use cases, like: https://github.com/logstash-plugins/logstash-input-file/blob/55a4a7099f05f29351672417036c1342850c7adc/lib/filewatch/watched_file.rb#L250

andsel added 3 commits April 10, 2025 16:43

Aligned with main

37ef8c3

Removes a test that's not anymore valid. Not int math happens in code…

cd0c8b1

…, but an OOM error is thrown from JDK libraries if an int overflow happens.

- inverted BufferedTokenizerExt§IterableAdapterWithEmptyCheck#isEmpty

9741517

- specs improved (yaauie/logstash@b524a67) with a custom matcher that validates both `empty?` (which maps to isEmpty) and `entries` (which is provided by the jruby shim extending java-Iterator with RubyEnumerable)

andsel force-pushed the fix/bufftok_to_return_itereator branch from 3fe0d5a to 9741517 Compare April 10, 2025 15:33

andsel added 4 commits April 11, 2025 14:47

Create temporarily ES directories, till elastic/elasticsearch#126566 …

6df670a

…is fixed

Merge branch 'main' into fix/bufftok_to_return_itereator

2727831

Avoid to executed double token scan, in hasNext and next methods

b1578e5

Emptied StringBuilder accumulator in flush and protected the state of…

1a346e2

… DataSplitter with synchornized so that can be used in multithreaded contexts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229

Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229

andsel commented Mar 5, 2025 •

edited

Loading

mergify bot commented Mar 5, 2025

mergify bot commented Mar 5, 2025

mergify bot commented Mar 5, 2025

mergify bot commented Mar 5, 2025

yaauie left a comment

yaauie left a comment

yaauie Apr 9, 2025

andsel Apr 10, 2025 •

edited

Loading

yaauie Apr 10, 2025

yaauie Apr 9, 2025 •

edited

Loading

andsel Apr 10, 2025

yaauie Apr 10, 2025

yaauie Apr 9, 2025

elastic-sonarqube bot commented Apr 11, 2025

elasticmachine commented Apr 11, 2025

	headToken.append(input.pop(context)); // put the leftovers in headToken for later
	inputSize = headToken.length();
	return input;
	}

+        // considered empty if caught up to the accumulator
+        public boolean isBufferEmpty() {
+            return currentIdx >= accumulator.length();
+        }

	return !dataSplitter.hasNext();
	return dataSplitter.isBufferEmpty();

Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229

Are you sure you want to change the base?

Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229

Conversation

andsel commented Mar 5, 2025 • edited Loading

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Logs

Benchmarks

baseline

this PR

mergify bot commented Mar 5, 2025

mergify bot commented Mar 5, 2025

mergify bot commented Mar 5, 2025

mergify bot commented Mar 5, 2025

yaauie left a comment

Choose a reason for hiding this comment

yaauie left a comment

Choose a reason for hiding this comment

yaauie Apr 9, 2025

Choose a reason for hiding this comment

andsel Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

yaauie Apr 10, 2025

Choose a reason for hiding this comment

yaauie Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

andsel Apr 10, 2025

Choose a reason for hiding this comment

yaauie Apr 10, 2025

Choose a reason for hiding this comment

yaauie Apr 9, 2025

Choose a reason for hiding this comment

elastic-sonarqube bot commented Apr 11, 2025

Quality Gate passed

elasticmachine commented Apr 11, 2025

💚 Build Succeeded

History

andsel commented Mar 5, 2025 •

edited

Loading

andsel Apr 10, 2025 •

edited

Loading

yaauie Apr 9, 2025 •

edited

Loading