Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

andsel
Copy link
Contributor

@andsel andsel commented Mar 5, 2025

Release notes

Reimplements BufferedTokenizer to leverage pure Java classes instead of use JRuby runtime's classes.

What does this PR do?

Reimplement the BufferedTokenizerExt in pure Java using an iterable, while the BufferedTokenizerExt is a shell around this new class.

The principal method extract, which dice the data by separator, instead of returning a RubyArray now return an Iterable which is a wrapper on a chain of a couple of Iterators.
The first iterator (DataSplitter) accumulates data in a StringBuilder and then dice the data by separator.
The second iterator, used in cascade to the first, validate that each returned token size respect the sizeLimit parameter.

To be compliant with some usage patterns which expect an empty? method to be present in the returned object from extract, like this, the extract method of the BufferedTokenizerExt return an Iterable adapter custom class with such method.

On the test side the code that tested BufferedTokenizerExt is moved to test the new BufferedTokenizer, so some test classes was renamed:

  • BufferedTokenizerExtTest mostly becomes BufferedTokenizerTest, but there is still a small BufferedTokenizerExtTest remained to test charset conversion use cases.
  • BufferedTokenizerExtWithDelimiterTest -> BufferedTokenizerWithDelimiterTest
  • BufferedTokenizerExtWithSizeLimitTest -> BufferedTokenizerWithSizeLimitTest
  • a test used to verify overflow condition givenTooLongInputExtractDoesntOverflow (code ref) was removed because not anymore applicable.

On the benchmarking side:

  • BufferedTokenizerExtBenchmark -> BufferedTokenizerBenchmark with the adaptation to the new tokenizer class.

As can be seen by benchmark reports in Logs section, this PR provide almost an improvement of 6x against the previous implementation.

Why is it important/What is the impact to the user?

As a developer I want the BufferedTokenizer implementation is simpler than the existing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

How to test this PR locally

Run same tests as in:

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"
  • using the following script to send £ sign in latin1
require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

Related issues

Logs

Benchmarks

The benchmarks was updated to run for 3 seconds (instead of 100 ms) and to report in milliseconds (instead of nanoseconds).

baseline

Ran with:

./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerExtBenchmark.*"
Benchmark                                                               Mode  Cnt     Score   Error   Units
BufferedTokenizerExtBenchmark.multipleTokenPerFragment                 thrpt   10   553.913 ± 6.223  ops/ms
BufferedTokenizerExtBenchmark.multipleTokensCrossingMultipleFragments  thrpt   10   222.815 ± 4.411  ops/ms
BufferedTokenizerExtBenchmark.onlyOneTokenPerFragment                  thrpt   10  1549.777 ± 9.237  ops/ms

this PR

Ran with:

./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerBenchmark.*"
Benchmark                                                            Mode  Cnt     Score     Error   Units
BufferedTokenizerBenchmark.multipleTokenPerFragment                 thrpt   10  3308.716 ± 167.549  ops/ms
BufferedTokenizerBenchmark.multipleTokensCrossingMultipleFragments  thrpt   10  1245.505 ±  52.843  ops/ms
BufferedTokenizerBenchmark.onlyOneTokenPerFragment                  thrpt   10  9468.777 ± 182.184  ops/ms

Copy link

mergify bot commented Mar 5, 2025

This pull request does not have a backport label. Could you fix it @andsel? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link

mergify bot commented Mar 5, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Mar 5, 2025
@andsel andsel removed the backport-8.x Automated backport to the 8.x branch with mergify label Mar 5, 2025
Copy link

mergify bot commented Mar 5, 2025

This pull request does not have a backport label. Could you fix it @andsel? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link

mergify bot commented Mar 5, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Mar 5, 2025
@andsel andsel self-assigned this Mar 5, 2025
@andsel andsel changed the title Fix/bufftok to return itereator Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted Mar 6, 2025
@andsel andsel marked this pull request as ready for review March 12, 2025 14:29
@andsel andsel force-pushed the fix/bufftok_to_return_itereator branch 2 times, most recently from 1fcb6a8 to 3fe0d5a Compare March 31, 2025 10:53
@yaauie yaauie self-requested a review April 9, 2025 20:10
Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully-validated yet, but wanted to pass on some bits from my first pass:

  • BufferedTokenizerExt§IterableAdapterWithEmptyCheck#isEmpty is inverted
  • specs can be improved (yaauie@b524a67) with a custom matcher that validates both empty? (which maps to isEmpty) and entries (which is provided by the jruby shim extending java-Iterator with RubyEnumerable)

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is on the right track, and appreciate the clean-Java implementation.

While the previous implementations have not been thread-safe and
had undefined behaviour when contending threads invoked BufferedTokenizer#extract and/or BufferedTokenizer#flush, making the BufferedTokenizer#extract return a lazy iterator introduces some risk, as interacting with that iterator mutates the underlying buffer.

Looking at all of the current uses of FileWatch::BufferedTokenizer in core and plugins, I don't see this as a significant risk, but if we wanted to mitigate it we would need to synchronize all of the methods on BufferedTokenizer§DataSplitter that deal with mutable state.


I've added some notes about reducing overhead, correctly reporting when the buffer is non-empty with unprocessed bytes, and clearing the accumulator during a flush operation.

}

@JRubyMethod(name = "empty?")
public IRubyObject isEmpty(final ThreadContext context) {
return RubyUtil.RUBY.newBoolean(headToken.toString().isEmpty() && (inputSize == 0));
return RubyUtil.RUBY.newBoolean(tokenizer.isEmpty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 previously FileWatch::BufferedTokenizer#empty? returned true if there was unterminated input in the buffer, but now it doesn't.

Copy link
Contributor Author

@andsel andsel Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original implementation stated that the token was empty

so inputSize is 0 iff the head token is empty and also there weren't any token fragments on input provided.

So if I'm not wrong isEmpty true effectively states that no token parts are available.
The proposed change in this PR effectively return true if there aren't more token available, so respect to that is a slip in the implementation that will be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Before this change, the return value of FileWatch::BufferedTokenizer#empty? was not deterministic while FileWatch::BufferedTokenizer#extract was being invoked, but was deterministic outside of that. Since FileWatch::BufferedTokenizer#extract always consumed all terminated tokens, this FileWatch::BufferedTokenizer#empty? only needed to consider the remaining unterminated buffer.

The proposed change in BufferedTokenizer#isEmpty changes that: it says that the BufferedTokenizer is empty if the terminated tokens in the iterator have all been consumed (and does not consider the unterminated buffer).

My proposed BufferedTokenizer§DataSplitter#isBufferEmpty() here (and the wiring through to BufferedTokenizer#isEmpty here) considers the unconsumed input in the accumulator, which effectively considers both unconsumed terminated tokens and any trailing unterminated buffer.


Worth noting:

Just as the return value wasn't deterministic while BufferedTokenizer#extract was being invoked before, it's now also not stable while iterating over the newly-lazy iterator. I think that's an expected side-effect; if a caller wants it to be stable before iterating, then they can first send it Enumerable#entries to consume all available tokens into an array.

public String flush() {
return accumulator.substring(currentIdx);
}

Copy link
Member

@yaauie yaauie Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to report that the BufferedTokenizer is not empty when it has unprocessed bytes in its buffer, but the DataSplitter implements Iterator<String> is passed through the JRuby Bridge, I'm wary of confusing an isEmpty() that means !hasNext() with one that means that there is unprocessed data in the buffer.

Here's an DataSplitter#isBufferEmpty() that should make it more clear exactly what it means.

        // considered empty if caught up to the accumulator
        public boolean isBufferEmpty() {
            return currentIdx <= accumulator.length();
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be that the currentIndex has reached the end of the accumulator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol. Yes. Wrong direction.

Even though currentIdx should never be > accumulator.length() due to the rest of the implementation, I elected to use >= here for safety.

Suggested change
// considered empty if caught up to the accumulator
public boolean isBufferEmpty() {
return currentIdx >= accumulator.length();
}

}

public boolean isEmpty() {
return !dataSplitter.hasNext();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return !dataSplitter.hasNext();
return dataSplitter.isBufferEmpty();

andsel added 3 commits April 10, 2025 16:43
…, but an OOM error is thrown from JDK libraries if an int overflow happens.
- specs improved (yaauie/logstash@b524a67) with a custom matcher that validates both `empty?` (which maps to isEmpty) and `entries` (which is provided by the jruby shim extending java-Iterator with RubyEnumerable)
@andsel andsel force-pushed the fix/bufftok_to_return_itereator branch from 3fe0d5a to 9741517 Compare April 10, 2025 15:33
Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @andsel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BufferedTokenizerExt applies sizeLimit check only of first token of input fragment
3 participants