RFC: Improving the build of large indexes on replicas #11018

Gerold103 · 2025-01-08T19:00:42Z

Gerold103
Jan 8, 2025
Collaborator

Reviewers

Main Reviewer: @locker
Second Reviewer: @Serpentian
Team Lead: @sergepetrenko
CTO: @sergos

Tickets

Non-blocking index build blocks synchro queue (and can even crash) #10766

Summary

When a space is large enough, building a new index on it can be quite long. Minutes, hours, depending on the space size. The same is about index alter - it might require index rebuild, space fullscan. That isn't a big deal locally on the instance, because the build is asynchronous - transactions can still be processed, even on the changing space.

But it gets complicated in a cluster due to the following reasons.

Replication gets stuck in a replicated cluster. Yes, the index build is async fiber-wise, but it blocks the current fiber. The blockage happens on-replace into _index space. Not on-commit. Because of that the applier's feature of committing the txns asynchronously doesn't help. The longest part happens before the commit.

The replica's lag will grow, it won't receive any new data until the build is finished. But the replication still is alive, and at least it doesn't block the transaction processing on the master when the replication is asynchronous. Unlike the next problem.

Master transaction processing gets stuck in a synchronously replicated cluster. Because the index build transaction on the master blocks the limbo until the appliers also apply it and write to their WALs. And that will last until the quorum of replicas have finished the index build.

Essentially, in a synchro cluster with large spaces it becomes impossible to create new indexes. It requires hacks, like creating a new space with all the needed indexes and same format, then slowly copy the data from the old space, in multiple small transactions, then delete the old space. Sounds not complex really, but it requires the user to change their code to maintain this "migration" process by writing into both old and new spaces while the copying is in progress.

The document tries to suggest a solution how people could create large indexes in a replicaset without blocking the replication.

✅ Solution: lazy index

Lua API and behaviour

Consider the actual problem in one sentence - a long index build blocks replication because the transaction can't be committed until the index is built. The solution is right here - lets allow to commit the transaction before the index is complete. Build the index lazily, in "background" (not blocking neither the current fiber nor the transaction). The index entry is added to _index instantly and the txn gets committed as fast as if the space was empty.

The behaviour is going to be enabled with a new index option build.

--
-- Creation of a lazy index. User should specify `build` option, which can be
-- * 'lazy' (non-blocking, non-yielding);
-- * 'now' (default, build it right away, blocking).
--
idx = space:create_index(name, {build = 'lazy'})
--
-- The index exists. Its txn is committed.
--
assert(box.space._index:get{idx.id} ~= nil)

This quick DDL txn gets replicated like any other, goes through the limbo too if it is enabled. The replicas would then build the same index on their own, all instances in parallel, without blocking their appliers or limbos either.

Such an index would be visible and droppable but can't be used until the building is complete. It is reflected in its status. When it gets finished (success or failure), the status is updated.

--
-- Index object gets a new field `status`. It is one of the following values:
-- * 'ready' - index is fully usable, default status when building indexes 'now'.
-- * 'building' - index creation was 'lazy', it is being built in background. Not usable right now.
-- * 'failed' - index was 'lazy', and its construction has failed. Duplicates, type mismatch, OOM.
--    In this status the index also has non-nil field `error`. It stores an error object with the failure
--    reason. Same as would be thrown as an exception, if the index would be created 'now'.
--
assert(idx.status == 'building')
--
-- Usage is not possible yet.
--
assert(not pcall(idx.get, idx, key))
--
-- Construction happens in a system fiber. Eventually status becomes not 'building'.
--
wait_until_the_status_changes()
--
-- Checking the result.
--
assert(idx.status == 'ready' or idx.status == 'failed')
if idx.status == 'failed' then
    assert(idx.error == error_object)
else
    -- Becomes usable.
    assert(pcall(idx.get, idx, key))
end
--
-- The build mode remains unchanged.
--
assert(idx.build == 'lazy')

The index can be dropped while being constructed or afterwards. But it can not be altered in any way with one exception - user can change the build mode from 'lazy' to 'now' in any moment. Then the current fiber is blocked until the index is finished (if it is still not), and then its build mode is updated. If the index couldn't be built due to an error or the index was dropped while waiting, then this alter would throw an error.

assert(idx.status == 'building')
ok, err = pcall(idx.alter, idx, {build = 'now'})
if not ok then
    -- Error could be DDL WAL error, while the index building still continues fine.
    assert(idx.status == 'building' or idx.status == 'failed')
    -- Nothing changed.
    assert(idx.build == 'lazy')
else
    assert(idx.status == 'ready')
    assert(idx.build == 'now')
end

If an index is lazy, and restart happens, then box.cfg() won't wait for this index to get ready. Its construction would still be lazy. Even if before restart it managed to get fully built. To make box.cfg() block on it like on any normal index, the build-mode must be altered to 'now' manually.

Users, who need a new index on a large space, would have to create the lazy index on one writable instance, wait for it to get complete and ready (ideally on all replicas), and then bump its build mode to 'now'. And only after that it is recommended to start using it in the code. Otherwise even if an index is ready, but still has 'lazy' build mode, any restart would make it non-ready again for a while. User's code would then break if it starts relying on the index being functional right after box.cfg().

Internal details

Lets dive a bit deeper into how this would actually work with more details.

The initial code investigation shows, that it doesn't need to be too different from how the indexes get built now. Both memtx and vinyl already have async index build. It is just that they use the current fiber and its txn for it. The txn presumably is necessary to abort conflicting txns, and to have a read-view, but this has to be figured out during the implementation. Either way, a temporary empty txn could be created for the build, without any statements, just for a read-view, if needed.

Memtx

In memtx the build is relatively trivial, because it doesn't leave any artifacts on disk (at least not yet, given https://github.com/orgs/tarantool/discussions/11001). If restart happens, the index would start building from scratch, with nothing to cleanup from the last attempt.

Vinyl

In vinyl this is obviously not the case. Index creation leaves run-files and vylog entries. The proposal is to make vinyl on instance restart drop all the LSM-tree files left from the previous build attempt, before starting a new build. They have to be physically removed from disk, and vylog must be updated to reflect that. This probably already happens now as well, with the current way.

In theory it is possible to make vinyl lazy index build able to survive restart. Thus after the storage is booted, it would continue building the index from the last dump of the in-memory level of the LSM-tree. But that doesn't seem necessary at the moment.

Build fiber

It is suggested to create one fiber for each lazy index build, and store it directly in struct index. This would be easier to control in the code, when the entire index state is in one place, easily visible. The fiber would be forcefully cancelled and joined when the index gets dropped. And it would delete itself automatically, if the build completes to the end (success or fail, doesn't matter).

Having one fiber per index also would simplify having a txn/read-view in each of them. If the fiber would be one for all indexes, it would have to store some sort of a context for each index, attach/detach txns/read-views, and this would look clumsy.

It is still an open question though. In reality it might happen, that having fiber per index would be harder to control, to shut them all down at once for example, when the instance is being shut down.

Pros:

Not so hard to implement really.
Solves the problem for indexes.
Not much action needed from the user.
Replication is untouched.
It opens an interesting possibility to start the instance fast, and make its data usable later. Imagine all indexes are lazy for every large enough space. The instance could restart very fast and even start serving some requests, participate in the synchro quorum. Just index uniqueness might be a problem when replaces are coming, and the index is not ready.

Cons:

Only works for index-related DDL. Not for altering the space itself.
Users need to do more than one step.

❓ Open questions

There is one similar issue which is lurking somewhere in the darkness and might eventually appear same as this one - space format alteration. As Sergey P. noticed, it has the same effect on the replication. If the new space format is not included into the old one, then the space data must be validated. It takes surely less time than an index build, but nonetheless can take considerable time. Especially with vinyl.

And this issue can not be solved by a lazy index. For this the users would have to fallback to solving this on their side, by having 2 spaces, writing into both somehow, and then decommissioning the old space. Not good.

Some not-yet-discarded solutions below suggest how to fix any sort of space alteration, be it an index creation, re-build, or format alteration.

⭐️ Naming suggestions

"Lazy" is a bit strange name perhaps, for an index build mode. Alternatives:

disabled - more "official", not really fitting, because it makes an impression that the index is just dead, not doing anything. While it actually is being built.
shadow - a cooler name, and gives the right impression that the index follows the space like a shadow until brought to light.

"Lazy" name change might also require to reconsider the naming of the build mode 'now'. For example, shadow and now don't go together well.

⭐️ Solution: force async

Why not make index creation and space format alter TXN_FORCE_ASYNC txns? I.e. they wouldn't block the limbo. Right before their completion on the master we could lock the limbo, wait until all its txns are confirmed, and then commit this one.

I see no reason why it wouldn't be safe. If a master at the time of this DDL commit had all the data of this space replicated and confirmed, it means all the replicas must be able to apply this format/index too.

There is waiting time until the limbo is locked and gets emptied, but it only depends on the replication and replicas' WALs speed. Not on the space size.

It wouldn't work right away, because the appliers on the replicas would still be executing those txns for a very long time, but this can be solved by executing them in their own fiber, not in the applier's main fiber. Perhaps this would bring other complications, and yet the solution looks worthy of consideration.

Pros

Doesn't seem too hard to implement.
Solves the problem of any space-related DDL.
No need for user action at all. It just works.

Cons

There might be hidden consistency complications. For example, what happens, if master commits this txn basically async, and then stops being the leader? Then another leader would reject that txn and cause a split-brain. Can this be mitigated?

⭐️ Solution: space clone

The proposal attacks the issue from another angle - if a space alteration is long, then users would typically, in another DB, make a second space with all the same meta + the alterations, and then copy the data from the old to the new one + repeat changes to the already copied keys. We could do the same, but automate it.

That is, Tarantool would allow to clone a space with any of its indexes and metadata altered. Once the cloning is done, the user could do the final "drop + rename" themselves.

If designed carefully, this could be an interesting tool to do more than just a new index creation, like:

Add more than one index.
Rebuild an index (change its parts in some not forward-inclusive way).
Alter space format.

In Lua code it could look like this:

-- User creates a new space with any indexes and format.
new_space = create_new_space()

-- User starts non-blocking copying using space:clone() method.
-- It works in transactions max 1000 rows each.
--
-- The new space is attached to the old one, and all the changes to the
-- already copied data are also repeated, so nothing is lost.
--
-- The new space during this process is read-only. User can't alter it,
-- can't write to it.
old_space:clone(new_space, {batch_size = 1000})

-- When the cloning is complete, the old space will only keep replicating
-- the DML. Background copying is done.
--
-- Now the user can replace the old one with the new one:
new_space:replace(old_space)
-- This operation renames the new space and changes its ID to
-- match the old space. The old one is dropped. All happens in one
-- transaction. ID has to be preserved so the remote ID-based connectors
-- would continue writing to the right space.

More interesting outcomes:

Copying could be done for spaces of different engines (as long as we introduce cross-engine transactions).
Memtx space cloning might be relatively cheap if the spaces would have the same format. The new space would just ref the tuples of the old one.

Pros:

Solves the problem of any space-related DDL.
Replication is untouched (depends on how to implement, see Cons).
Not much action from user needed.

Cons:

Potential memory waste on the duplicate data until the old space is dropped.
User needs to make 2 steps.
Need to design what happens if the master is changed. Is this whole process happening in-memory? Should it be somehow continued on the replicas if master dies? But then to which extent? Replicas can't write safely even if are writable, because it would break master-master.

⭐️ Solution: lazy index, lazy format

Index-wise it is identical to the primary solution about lazy indexes. But what if the same can be applied to the space format? Imagine a space can have multiple formats (it kind of does right now - when a space is altered, its old tuples keep the old format). And they are stored in _space_format space or in _space options we could have more than one format specified.

Then lets say that among multiple formats the space can have only one primary. It is used for creating tuples.

And then lets say a format can be lazy, like an index. It gets created, and its building happens in the background. Then its in-memory status is updated to reflect if it is ready. Then the user would switch the primary format. Optionally delete the old one.

This is not an alternative to the primary solution, but rather an exploration of an idea which is in sync with that solution, solves the related problem, and might provoke some thoughts in the readers about the primary solution.

Alternatives

♻️ Solution: lazy index with auto-complete

The currently chosen solution with "lazy indexes" requires the user not only to create the lazy index, but also to manually make it non-lazy when the build is complete.

Why not make the index automatically update the build mode, when it gets ready? Just make an internal transaction, which changes the opts.build to 'now'.

There is a few of reasons, why not:

If the master builds the index much earlier than the replicas (for example, a replica could restart in the middle of the building, and had to start building the index from zero), and it updates the build-mode, then this DDL transaction would be stuck in the limbo until a quorum of replicas builds the index too. And it returns us back to the original problem. This means the build mode alter txn must not even try to commit until the index is finished on a quorum of instances.
It is not always possible. If the cluster becomes read-only due to any reason, the mode update won't work. It would have to wait for the instance to become writable, which makes the indexing logic slip into the scope which shouldn't be of its concern. Spaces and indexes and their locally stored data in general shouldn't care about instance status relative to other nodes.
It can potentially lead to a mess in master-master (the original ticket is about synchro, but as it was shown in the intro, this also concerns async replication). AFAIU, in such setups users must be very careful not to do txns on random nodes on the same data, or it can lead to races (like the easiest one, make different replaces on the same key on 2 nodes - they would "exchange" the values, not sync to one value). Assuming a user managed to do that successfully until now, we would break this setup. Because Tarantool would start doing internal txns on the same data (same _index entry on multiple nodes). That doesn't look safe.

♻️ Solution: nothing

The issue in the ticket isn't really a bug. It is an inconvenience, which has a workaround explained right in the intro - copy the space manually, write to both spaces, then replace the old one.

The only problem is that the user would have to support that in their code.

Lets repeat the solution here for clarity. When a user wants a new index or alter an existing one in a non-trivial way, or to change a space format, they do this:

Create a new space with all the same metadata and indexes as the original one.
User creates the needed new index on the new space (or alters the one that needs a change) or alters the format.
User duplicates all the replaces and deletes of the old space to the new space. This can be done, for example, by setting an on_replace trigger on the old space, which does the same work on the new space.
User iterates over the old space and copies the tuples into the new space, with yields every N tuples.
When the work is done, the user in a single transaction drops the old space, and renames the new space to the old one's name.

Pros: don't need to do anything, already works.

Cons:

Memory waste on the duplicate data until the old space is dropped.
Inconvenient, easy to make a mistake in the migration process.

♻️ Solution: replica-local index

The problem of index creation/alter is hitting the replication hard. One approach could be to attack the replication shortcomings then. That is, drop the replication from the process.

Lets imagine that the replicas and master could build the same indexes independently, fully local. And when finished, the master would in a single small DDL transaction "enable" this index.

The index creation would then be a 2 step process. 1 - create a local index on all replicas. 2 - turn the local index into a global one on the master.

This needs 2 features which aren't available yet, but aren't hard to add:

Replica-local index.
Local->global index promotion.

Replica-local DDL is not unusual for Tarantool. There is right now a space type temporary (not to confuse with data-temporary). It can be created on read-only replicas, can have its own indexes, is visible in _space and its indexes in _index, but it is not replicated, and its data isn't stored in WAL.

Replica-local persistent data also is not a new thing. Tarantool does have "local" spaces. They have replicaset-global meta (_space and _index rows) and their data is persisted, but not replicated. They can only be created by master, but can take DML on any instance and it is not replicated.

The proposal is to introduce replica-local indexes. They can be created by any replica, even read-only, on absolutely any space. This index is persisted in _index and is not replicated.

Creation of the index will not affect replication at all, and won't block the limbo, because replica-local transactions are not synchronous by definition.

To create a new global index, the user then would then go and create a replica-local index on each instance.

Then to make it global the user would on the master instance make index:alter{is_global = true}. Locally it works instantly. When this txn comes to the replicas, it will try to find a replica-local index in this space with all the same meta besides index ID. If found, it also works instantly, by changing the index ID to the global one (ID is primary, so it would mean moving local index's data to the new global index with the global ID, and dropping the empty local index). If not found, a new index is created as usual.

The solution not only allows to create/alter indexes in the cluster bypassing the replication, but also allows the user to purposefully create replica-local indexes without ever making them global. It could be handy to reduce memory usage on the master and speed-up master's DML. Master would only store the unique indexes and handle DML, and the replicas would store the other indexes + serve DQL.

The cons is that the user has to visit each replica to create the replica-local indexes on the first step.

Pros: introduces a new feature - replica-local indexes, which can be used not only for replicaset-wide index building.

Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.

sergepetrenko · 2025-01-14T09:23:02Z

sergepetrenko
Jan 14, 2025
Maintainer

Definitely not solution 1. In some cases with a single large space the cloning process would take up 2x the memory.

Besides, solution 1 basically makes the user write all the code we already have for index build (which's rather complicated):

Take care of transactions which got started before we started the cloning process, but are not yet committed.
Take care of various cases when a transaction comes before or after the current processed position (I suppose the user would either copy the write to both spaces if it falls into the already copied range or only put it into the original space if it falls into not yet processed range).
so on

Solution 2 may be better memory-wise but still has a problem: we would have to take care of master changes, which seems rather complicated: finding the right instance to continue the process (there might be multiple writable instances), taking care of "original" writes which should be duplicated to the new space, versus the ones coming from an existing master and so on.

Solutions 3 and 4 look good to me.
I'd even argue with one of the cons of solution 3:

Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.

If the index is already built and turned global on all instances of the replicaset, new replicas will simply receive it during join process, like they always do.

OTOH, we don't even have to make the index global if we say that all schema is defined in centralized configuration. In this case each instance will have the same set of indexes built locally, and everything will work as expected, no?

3 replies

Gerold103 Jan 20, 2025
Collaborator Author

If schema is centralized somewhere, and it allows to define replica-local objects (such as an index), then yes, it would be much simpler for people. And the number of steps to do would be the same as with lazy-indexes - create the index (local or lazy) and wait for it to be built, remove the local/lazy flag.

So what would you choose? 3 or 4?

sergepetrenko Jan 22, 2025
Maintainer

It seems solutions 3 and 4 are basically the same: you make the replicas pre-build an index prior to using it. The difference is that solution 3 requires the user to manually tell all the replicas to build an index (via config or not, doesn't matter. Definitely via some side channel). And solution 4 does this automatically: master creates an index entry and replicates it as usual. So solution 4 looks better, because it doesn't depend on external sources of truth (config or user) to unify schema between all replicas.

Besides, solution 4 does not block implementation of replica-local indexes, if we ever find them valuable. It's just that replica-local indexes aren't the best solution to the problem described in this RFC.

Long story short, I vote for option №4

sergepetrenko Jan 22, 2025
Maintainer

Please, add some details to solution 4.

Why is it necessary to turn lazy indexes into normal ones? What is a difference between a fully built lazy index and a normal one? AFAIU once a lazy index is built (doesn't matter, on replica or on master), it becomes accessible for user requests. So there's no point in an extra step.
What happens if you try to turn a lazy index into normal one on a replica, while the index is still being built? Will this operation block replication until the index build finishes?

Gerold103 · 2025-01-24T21:05:41Z

Gerold103
Jan 24, 2025
Collaborator Author

@sergepetrenko, @locker, @Serpentian, please, have another look.

The idea about lazy index auto completion I had to drop, see the Alternatives way.
Also I paid more attention to a broader problem - space alter in general. Including space format change.

It would also be cool, if you could delete your old comments unless there is still something unresolved. So we could keep the discussion's page clean. Unfortunately, GitHub doesn't yet have a nice way to "resolve" comments without deleting them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

RFC: Improving the build of large indexes on replicas #11018

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Tarantool

RFC: Improving the build of large indexes on replicas #11018

Gerold103 Jan 8, 2025 Collaborator

Reviewers

Tickets

Summary

✅ Solution: lazy index

Lua API and behaviour

Internal details

Memtx

Vinyl

Build fiber

❓ Open questions

⭐️ Naming suggestions

⭐️ Solution: force async

⭐️ Solution: space clone

⭐️ Solution: lazy index, lazy format

Alternatives

♻️ Solution: lazy index with auto-complete

♻️ Solution: nothing

♻️ Solution: replica-local index

Replies: 2 comments · 3 replies

sergepetrenko Jan 14, 2025 Maintainer

Gerold103 Jan 20, 2025 Collaborator Author

sergepetrenko Jan 22, 2025 Maintainer

sergepetrenko Jan 22, 2025 Maintainer

Gerold103 Jan 24, 2025 Collaborator Author

Gerold103
Jan 8, 2025
Collaborator

Replies: 2 comments 3 replies

sergepetrenko
Jan 14, 2025
Maintainer

Gerold103 Jan 20, 2025
Collaborator Author

sergepetrenko Jan 22, 2025
Maintainer

sergepetrenko Jan 22, 2025
Maintainer

Gerold103
Jan 24, 2025
Collaborator Author