Replies: 2 comments 3 replies
-
Definitely not solution 1. In some cases with a single large space the cloning process would take up 2x the memory. Besides, solution 1 basically makes the user write all the code we already have for index build (which's rather complicated):
Solution 2 may be better memory-wise but still has a problem: we would have to take care of master changes, which seems rather complicated: finding the right instance to continue the process (there might be multiple writable instances), taking care of "original" writes which should be duplicated to the new space, versus the ones coming from an existing master and so on. Solutions 3 and 4 look good to me.
If the index is already built and turned global on all instances of the replicaset, new replicas will simply receive it during join process, like they always do. OTOH, we don't even have to make the index global if we say that all schema is defined in centralized configuration. In this case each instance will have the same set of indexes built locally, and everything will work as expected, no? |
Beta Was this translation helpful? Give feedback.
-
@sergepetrenko, @locker, @Serpentian, please, have another look. The idea about lazy index auto completion I had to drop, see the Alternatives way. It would also be cool, if you could delete your old comments unless there is still something unresolved. So we could keep the discussion's page clean. Unfortunately, GitHub doesn't yet have a nice way to "resolve" comments without deleting them. |
Beta Was this translation helpful? Give feedback.
-
Reviewers
Tickets
Summary
When a space is large enough, building a new index on it can be quite long. Minutes, hours, depending on the space size. The same is about index alter - it might require index rebuild, space fullscan. That isn't a big deal locally on the instance, because the build is asynchronous - transactions can still be processed, even on the changing space.
But it gets complicated in a cluster due to the following reasons.
Replication gets stuck in a replicated cluster. Yes, the index build is async fiber-wise, but it blocks the current fiber. The blockage happens on-replace into
_index
space. Not on-commit. Because of that the applier's feature of committing the txns asynchronously doesn't help. The longest part happens before the commit.The replica's lag will grow, it won't receive any new data until the build is finished. But the replication still is alive, and at least it doesn't block the transaction processing on the master when the replication is asynchronous. Unlike the next problem.
Master transaction processing gets stuck in a synchronously replicated cluster. Because the index build transaction on the master blocks the limbo until the appliers also apply it and write to their WALs. And that will last until the quorum of replicas have finished the index build.
Essentially, in a synchro cluster with large spaces it becomes impossible to create new indexes. It requires hacks, like creating a new space with all the needed indexes and same format, then slowly copy the data from the old space, in multiple small transactions, then delete the old space. Sounds not complex really, but it requires the user to change their code to maintain this "migration" process by writing into both old and new spaces while the copying is in progress.
The document tries to suggest a solution how people could create large indexes in a replicaset without blocking the replication.
✅ Solution: lazy index
Lua API and behaviour
Consider the actual problem in one sentence - a long index build blocks replication because the transaction can't be committed until the index is built. The solution is right here - lets allow to commit the transaction before the index is complete. Build the index lazily, in "background" (not blocking neither the current fiber nor the transaction). The index entry is added to
_index
instantly and the txn gets committed as fast as if the space was empty.The behaviour is going to be enabled with a new index option
build
.This quick DDL txn gets replicated like any other, goes through the limbo too if it is enabled. The replicas would then build the same index on their own, all instances in parallel, without blocking their appliers or limbos either.
Such an index would be visible and droppable but can't be used until the building is complete. It is reflected in its status. When it gets finished (success or failure), the status is updated.
The index can be dropped while being constructed or afterwards. But it can not be altered in any way with one exception - user can change the build mode from
'lazy'
to'now'
in any moment. Then the current fiber is blocked until the index is finished (if it is still not), and then its build mode is updated. If the index couldn't be built due to an error or the index was dropped while waiting, then this alter would throw an error.If an index is lazy, and restart happens, then
box.cfg()
won't wait for this index to get ready. Its construction would still be lazy. Even if before restart it managed to get fully built. To makebox.cfg()
block on it like on any normal index, the build-mode must be altered to'now'
manually.Users, who need a new index on a large space, would have to create the lazy index on one writable instance, wait for it to get complete and ready (ideally on all replicas), and then bump its build mode to
'now'
. And only after that it is recommended to start using it in the code. Otherwise even if an index is ready, but still has'lazy'
build mode, any restart would make it non-ready again for a while. User's code would then break if it starts relying on the index being functional right afterbox.cfg()
.Internal details
Lets dive a bit deeper into how this would actually work with more details.
The initial code investigation shows, that it doesn't need to be too different from how the indexes get built now. Both memtx and vinyl already have async index build. It is just that they use the current fiber and its txn for it. The txn presumably is necessary to abort conflicting txns, and to have a read-view, but this has to be figured out during the implementation. Either way, a temporary empty txn could be created for the build, without any statements, just for a read-view, if needed.
Memtx
In memtx the build is relatively trivial, because it doesn't leave any artifacts on disk (at least not yet, given https://github.com/orgs/tarantool/discussions/11001). If restart happens, the index would start building from scratch, with nothing to cleanup from the last attempt.
Vinyl
In vinyl this is obviously not the case. Index creation leaves run-files and vylog entries. The proposal is to make vinyl on instance restart drop all the LSM-tree files left from the previous build attempt, before starting a new build. They have to be physically removed from disk, and vylog must be updated to reflect that. This probably already happens now as well, with the current way.
In theory it is possible to make vinyl lazy index build able to survive restart. Thus after the storage is booted, it would continue building the index from the last dump of the in-memory level of the LSM-tree. But that doesn't seem necessary at the moment.
Build fiber
It is suggested to create one fiber for each lazy index build, and store it directly in
struct index
. This would be easier to control in the code, when the entire index state is in one place, easily visible. The fiber would be forcefully cancelled and joined when the index gets dropped. And it would delete itself automatically, if the build completes to the end (success or fail, doesn't matter).Having one fiber per index also would simplify having a txn/read-view in each of them. If the fiber would be one for all indexes, it would have to store some sort of a context for each index, attach/detach txns/read-views, and this would look clumsy.
It is still an open question though. In reality it might happen, that having fiber per index would be harder to control, to shut them all down at once for example, when the instance is being shut down.
Pros:
Cons:
❓ Open questions
There is one similar issue which is lurking somewhere in the darkness and might eventually appear same as this one - space format alteration. As Sergey P. noticed, it has the same effect on the replication. If the new space format is not included into the old one, then the space data must be validated. It takes surely less time than an index build, but nonetheless can take considerable time. Especially with vinyl.
And this issue can not be solved by a lazy index. For this the users would have to fallback to solving this on their side, by having 2 spaces, writing into both somehow, and then decommissioning the old space. Not good.
Some not-yet-discarded solutions below suggest how to fix any sort of space alteration, be it an index creation, re-build, or format alteration.
⭐️ Naming suggestions
"Lazy" is a bit strange name perhaps, for an index build mode. Alternatives:
disabled
- more "official", not really fitting, because it makes an impression that the index is just dead, not doing anything. While it actually is being built.shadow
- a cooler name, and gives the right impression that the index follows the space like a shadow until brought to light."Lazy" name change might also require to reconsider the naming of the build mode
'now'
. For example,shadow
andnow
don't go together well.⭐️ Solution: force async
Why not make index creation and space format alter
TXN_FORCE_ASYNC
txns? I.e. they wouldn't block the limbo. Right before their completion on the master we could lock the limbo, wait until all its txns are confirmed, and then commit this one.I see no reason why it wouldn't be safe. If a master at the time of this DDL commit had all the data of this space replicated and confirmed, it means all the replicas must be able to apply this format/index too.
There is waiting time until the limbo is locked and gets emptied, but it only depends on the replication and replicas' WALs speed. Not on the space size.
It wouldn't work right away, because the appliers on the replicas would still be executing those txns for a very long time, but this can be solved by executing them in their own fiber, not in the applier's main fiber. Perhaps this would bring other complications, and yet the solution looks worthy of consideration.
Pros
Cons
⭐️ Solution: space clone
The proposal attacks the issue from another angle - if a space alteration is long, then users would typically, in another DB, make a second space with all the same meta + the alterations, and then copy the data from the old to the new one + repeat changes to the already copied keys. We could do the same, but automate it.
That is, Tarantool would allow to clone a space with any of its indexes and metadata altered. Once the cloning is done, the user could do the final "drop + rename" themselves.
If designed carefully, this could be an interesting tool to do more than just a new index creation, like:
In Lua code it could look like this:
More interesting outcomes:
Pros:
Cons:
⭐️ Solution: lazy index, lazy format
Index-wise it is identical to the primary solution about lazy indexes. But what if the same can be applied to the space format? Imagine a space can have multiple formats (it kind of does right now - when a space is altered, its old tuples keep the old format). And they are stored in
_space_format
space or in_space
options we could have more than one format specified.Then lets say that among multiple formats the space can have only one primary. It is used for creating tuples.
And then lets say a format can be lazy, like an index. It gets created, and its building happens in the background. Then its in-memory status is updated to reflect if it is ready. Then the user would switch the primary format. Optionally delete the old one.
This is not an alternative to the primary solution, but rather an exploration of an idea which is in sync with that solution, solves the related problem, and might provoke some thoughts in the readers about the primary solution.
Alternatives
♻️ Solution: lazy index with auto-complete
The currently chosen solution with "lazy indexes" requires the user not only to create the lazy index, but also to manually make it non-lazy when the build is complete.
Why not make the index automatically update the build mode, when it gets ready? Just make an internal transaction, which changes the
opts.build
to'now'
.There is a few of reasons, why not:
_index
entry on multiple nodes). That doesn't look safe.♻️ Solution: nothing
The issue in the ticket isn't really a bug. It is an inconvenience, which has a workaround explained right in the intro - copy the space manually, write to both spaces, then replace the old one.
The only problem is that the user would have to support that in their code.
Lets repeat the solution here for clarity. When a user wants a new index or alter an existing one in a non-trivial way, or to change a space format, they do this:
on_replace
trigger on the old space, which does the same work on the new space.Pros: don't need to do anything, already works.
Cons:
♻️ Solution: replica-local index
The problem of index creation/alter is hitting the replication hard. One approach could be to attack the replication shortcomings then. That is, drop the replication from the process.
Lets imagine that the replicas and master could build the same indexes independently, fully local. And when finished, the master would in a single small DDL transaction "enable" this index.
The index creation would then be a 2 step process. 1 - create a local index on all replicas. 2 - turn the local index into a global one on the master.
This needs 2 features which aren't available yet, but aren't hard to add:
Replica-local DDL is not unusual for Tarantool. There is right now a space type
temporary
(not to confuse withdata-temporary
). It can be created on read-only replicas, can have its own indexes, is visible in_space
and its indexes in_index
, but it is not replicated, and its data isn't stored in WAL.Replica-local persistent data also is not a new thing. Tarantool does have "local" spaces. They have replicaset-global meta (
_space
and_index
rows) and their data is persisted, but not replicated. They can only be created by master, but can take DML on any instance and it is not replicated.The proposal is to introduce replica-local indexes. They can be created by any replica, even read-only, on absolutely any space. This index is persisted in
_index
and is not replicated.Creation of the index will not affect replication at all, and won't block the limbo, because replica-local transactions are not synchronous by definition.
To create a new global index, the user then would then go and create a replica-local index on each instance.
Then to make it global the user would on the master instance make
index:alter{is_global = true}
. Locally it works instantly. When this txn comes to the replicas, it will try to find a replica-local index in this space with all the same meta besides index ID. If found, it also works instantly, by changing the index ID to the global one (ID is primary, so it would mean moving local index's data to the new global index with the global ID, and dropping the empty local index). If not found, a new index is created as usual.The solution not only allows to create/alter indexes in the cluster bypassing the replication, but also allows the user to purposefully create replica-local indexes without ever making them global. It could be handy to reduce memory usage on the master and speed-up master's DML. Master would only store the unique indexes and handle DML, and the replicas would store the other indexes + serve DQL.
The cons is that the user has to visit each replica to create the replica-local indexes on the first step.
Pros: introduces a new feature - replica-local indexes, which can be used not only for replicaset-wide index building.
Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.
Beta Was this translation helpful? Give feedback.
All reactions