GH-5291: Make it possible to disable transactions in LuceneSail #5572
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GitHub issue resolved: #5291
Secondthird take at solving #5219, after #5446 and #5473. The original idea was to wrap Lucene's Directory into a class that would defer fsyncs and then execute them on a timer. This would allow us to keep full transactional support, and also reap the benefits of improved performance.While that solution seemed to work initially, now I'm pretty sure it's impossible to implement in a 100% correct manner that would be reliable in the long term. Lucene does a lot more with files aside from fsyncing them – it also renames and deletes them. If we intervene only in the fsync process, we see inconsistent state, where, for example, we try to fsync files that no longer exist, or never fsync files that were renamed in the meantime.
Here I took a simpler approach, where we simply add the option to disable transaction/rollback support in LuceneSail. Then, during transaction commit, we call
flush()on the index (not a full commit), and the real commit is done on a timer. Changes are made visible to readers after the commit only.This sounds like a pretty big compromise, but for my use case this is totally fine. If we have slight inconsistencies in the text index because of a missed rollback or two, that's fine, nothing is going to collapse. Similarly, having a 10 second delay between insert and the text being visible in queries is also totally acceptable.
I think that with enough engineering effort you could make this better (e.g., have instant visibility of the changes in readers), but quite honestly... this is good enough.
I ran some end-to-end tests and verified that this indeed does result in a ~10x throughput speedup in insert transactions.
PR Author Checklist (see the contributor guidelines for more details):
mvn process-resourcesto format from the command line)