-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted #17229
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @andsel? 🙏
|
|
This pull request does not have a backport label. Could you fix it @andsel? 🙏
|
|
1fcb6a8
to
3fe0d5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully-validated yet, but wanted to pass on some bits from my first pass:
BufferedTokenizerExt§IterableAdapterWithEmptyCheck#isEmpty
is inverted- specs can be improved (yaauie@b524a67) with a custom matcher that validates both
empty?
(which maps toisEmpty
) andentries
(which is provided by the jruby shim extending java-Iterator
withRubyEnumerable
)
logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is on the right track, and appreciate the clean-Java implementation.
While the previous implementations have not been thread-safe and
had undefined behaviour when contending threads invoked BufferedTokenizer#extract
and/or BufferedTokenizer#flush
, making the BufferedTokenizer#extract
return a lazy iterator introduces some risk, as interacting with that iterator mutates the underlying buffer.
Looking at all of the current uses of FileWatch::BufferedTokenizer
in core and plugins, I don't see this as a significant risk, but if we wanted to mitigate it we would need to synchronize all of the methods on BufferedTokenizer§DataSplitter
that deal with mutable state.
I've added some notes about reducing overhead, correctly reporting when the buffer is non-empty with unprocessed bytes, and clearing the accumulator during a flush operation.
} | ||
|
||
@JRubyMethod(name = "empty?") | ||
public IRubyObject isEmpty(final ThreadContext context) { | ||
return RubyUtil.RUBY.newBoolean(headToken.toString().isEmpty() && (inputSize == 0)); | ||
return RubyUtil.RUBY.newBoolean(tokenizer.isEmpty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 previously FileWatch::BufferedTokenizer#empty?
returned true if there was unterminated input in the buffer, but now it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original implementation stated that the token was empty
headToken.toString().isEmpty()
should be interpreted as "no token was collected"inputSize == 0
the size of collected head (it's assigned to the the headToken length on each extract:logstash/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
Lines 150 to 153 in 187c925
headToken.append(input.pop(context)); // put the leftovers in headToken for later inputSize = headToken.length(); return input; }
so inputSize
is 0 iff the head token is empty and also there weren't any token fragments on input provided.
So if I'm not wrong isEmpty true effectively states that no token parts are available.
The proposed change in this PR effectively return true if there aren't more token available, so respect to that is a slip in the implementation that will be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
Before this change, the return value of FileWatch::BufferedTokenizer#empty?
was not deterministic while FileWatch::BufferedTokenizer#extract
was being invoked, but was deterministic outside of that. Since FileWatch::BufferedTokenizer#extract
always consumed all terminated tokens, this FileWatch::BufferedTokenizer#empty?
only needed to consider the remaining unterminated buffer.
The proposed change in BufferedTokenizer#isEmpty
changes that: it says that the BufferedTokenizer
is empty if the terminated tokens in the iterator have all been consumed (and does not consider the unterminated buffer).
My proposed BufferedTokenizer§DataSplitter#isBufferEmpty()
here (and the wiring through to BufferedTokenizer#isEmpty
here) considers the unconsumed input in the accumulator, which effectively considers both unconsumed terminated tokens and any trailing unterminated buffer.
Worth noting:
Just as the return value wasn't deterministic while BufferedTokenizer#extract
was being invoked before, it's now also not stable while iterating over the newly-lazy iterator. I think that's an expected side-effect; if a caller wants it to be stable before iterating, then they can first send it Enumerable#entries
to consume all available tokens into an array.
logstash-core/src/main/java/org/logstash/common/BufferedTokenizer.java
Outdated
Show resolved
Hide resolved
logstash-core/src/main/java/org/logstash/common/BufferedTokenizer.java
Outdated
Show resolved
Hide resolved
public String flush() { | ||
return accumulator.substring(currentIdx); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to report that the BufferedTokenizer is not empty when it has unprocessed bytes in its buffer, but the DataSplitter implements Iterator<String>
is passed through the JRuby Bridge, I'm wary of confusing an isEmpty()
that means !hasNext()
with one that means that there is unprocessed data in the buffer.
Here's an DataSplitter#isBufferEmpty()
that should make it more clear exactly what it means.
// considered empty if caught up to the accumulator
public boolean isBufferEmpty() {
return currentIdx <= accumulator.length();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be that the currentIndex
has reached the end of the accumulator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol. Yes. Wrong direction.
Even though currentIdx
should never be > accumulator.length()
due to the rest of the implementation, I elected to use >=
here for safety.
// considered empty if caught up to the accumulator | |
public boolean isBufferEmpty() { | |
return currentIdx >= accumulator.length(); | |
} |
} | ||
|
||
public boolean isEmpty() { | ||
return !dataSplitter.hasNext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return !dataSplitter.hasNext(); | |
return dataSplitter.isBufferEmpty(); |
…once reached the next separator
…Ext because it's expected in some use cases, like: https://github.com/logstash-plugins/logstash-input-file/blob/55a4a7099f05f29351672417036c1342850c7adc/lib/filewatch/watched_file.rb#L250
…, but an OOM error is thrown from JDK libraries if an int overflow happens.
- specs improved (yaauie/logstash@b524a67) with a custom matcher that validates both `empty?` (which maps to isEmpty) and `entries` (which is provided by the jruby shim extending java-Iterator with RubyEnumerable)
3fe0d5a
to
9741517
Compare
… DataSplitter with synchornized so that can be used in multithreaded contexts
|
💚 Build Succeeded
History
cc @andsel |
Release notes
Reimplements
BufferedTokenizer
to leverage pure Java classes instead of use JRuby runtime's classes.What does this PR do?
Reimplement the
BufferedTokenizerExt
in pure Java using an iterable, while theBufferedTokenizerExt
is a shell around this new class.The principal method
extract
, which dice the data by separator, instead of returning aRubyArray
now return anIterable
which is a wrapper on a chain of a couple of Iterators.The first iterator (
DataSplitter
) accumulates data in aStringBuilder
and then dice the data by separator.The second iterator, used in cascade to the first, validate that each returned token size respect the
sizeLimit
parameter.To be compliant with some usage patterns which expect an
empty?
method to be present in the returned object fromextract
, like this, theextract
method of theBufferedTokenizerExt
return anIterable
adapter custom class with such method.On the test side the code that tested
BufferedTokenizerExt
is moved to test the newBufferedTokenizer
, so some test classes was renamed:BufferedTokenizerExtTest
mostly becomesBufferedTokenizerTest
, but there is still a smallBufferedTokenizerExtTest
remained to test charset conversion use cases.BufferedTokenizerExtWithDelimiterTest
->BufferedTokenizerWithDelimiterTest
BufferedTokenizerExtWithSizeLimitTest
->BufferedTokenizerWithSizeLimitTest
givenTooLongInputExtractDoesntOverflow
(code ref) was removed because not anymore applicable.On the benchmarking side:
BufferedTokenizerExtBenchmark
->BufferedTokenizerBenchmark
with the adaptation to the new tokenizer class.As can be seen by benchmark reports in
Logs
section, this PR provide almost an improvement of 6x against the previous implementation.Why is it important/What is the impact to the user?
As a developer I want the BufferedTokenizer implementation is simpler than the existing.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)Author's Checklist
How to test this PR locally
Run same tests as in:
£
sign in latin1Related issues
Logs
Benchmarks
The benchmarks was updated to run for 3 seconds (instead of 100 ms) and to report in milliseconds (instead of nanoseconds).
baseline
Ran with:
this PR
Ran with: