Cherry pick for 0.1.34 #553

Serpentian · 2025-05-20T14:18:41Z

No description provided.

During upgrade from Tarantool 2.X to 3.X instance doesn't have name for some time (until box.schema.upgrade() is run). During this time master throws the alert UNREACHABLE_REPLICA, even though it's fully functional. This happens due to the fact, that named identification is used and vshard supposes, that box.info.name is accessible. It uses it in order to identify the current instance and fails. Let's use box.info.id in order to identify the current instance. Closes tarantool#13 NO_DOC=bugfix

When name is not set, UNREACHABLE_REPLICA alert is shown as "Replica cdata<void *>: NULL isn't active", which is not very informative. Let's fallback to box.info.id, when named identification is used, but instance name is not set yet. Follow-up tarantool#13 NO_DOC=bugfix

In test `test_named_hot_reload` there is no reconfiguration to uuid_as_key identification_mode in the end of the test. This test negatively influences other tests which expect uuid_as_key identification_mode. Needed for tarantool#493 NO_DOC=test

The function `assert_server_no_alerts` is moved from `persistent_names_test` into `test/luatest_helpers/asserts.lua`. This change has been made in order to use this function in subsequent commit in `test_alerts_for_non_vshard_config_template`. Also some usages of this function are changed in `persistent_names_test` with `asserts` module. Needed for tarantool#493 NO_DOC=test

This test expected alert for non-vshard replica to be shown, which is incorrect behavior and will be fixed in the subsequent commit. Needed for tarantool#493 NO_DOC=test

Before this patch vshard throwed alerts for replicas, which are not in its config. E.g. disconnected CDC replica caused `UNREACHABLE_REPLICA` alert. This patch fixes it by skipping unknown replicas. Closes tarantool#493 NO_DOC=bugfix

This test checks scenario when replica disconnects from replicaset in named vshard config. In this situation two alerts should be thrown: `UNREACHABLE_REPLICA` and `UNREACHABLE_REPLICASET`. It is necessary to add this test in order to reduce untested places in the codebase. Also a function `info_assert_alert` is removed from `router_2_2_test` since there is a more appropriate function in `asserts` module. Follow-up tarantool#493 NO_DOC=test

Serpentian · 2025-05-20T14:19:23Z

Blocked with #544

Serpentian · 2025-05-20T14:25:06Z

@Gerold103, we're blocked on backporting with #544. What do you think, if we merge it as is and then, in the separate commit, add all review fixes. This commit will be pushed to vshard-ee and vshard-ce.

This way we'll get the same commit history in ee and ce. Otherwise, in CE the fixes will be done in the scope of PR, without additional commits and in EE it'll be additional one

The vshard config in `replicaset_3_test.lua` has `uuid_as_key` identification_mode. In the test `test_named_replicaset` the config is changed its mode to `name_as_key` and master node is changed its name on `replica_1_a`. After that config is become `uuid_as_key` but the name of master node remains the same. This test negatively influences other tests which don't expect that master node has its own instance_name in `uuid_as_key` vshard config. This patch fixes it by dropping instance_name of `g.replica_1_a`. Needed for tarantool#513 NO_DOC=test

Before this patch `storage_info` failed due to `replica.upstream` being nil. This error happened when replica disconnected from master or when non-vshard replica (with name identical to the master name) connected to a replicaset with `uuid_as_key` identification_mode. This patch fixes this issue by skipping all non-master replicas (even those that have the name like master node) and checking `replica.upstream` connection. With nil connection, an alert `UNREACHABLE_MASTER` is thrown and `replication` estimation is skipped. Closes tarantool#513 NO_DOC=bugfix

This commit introduces basic performance test for replace and get requests. Now it supports only one router, which is needed for example cluster. It also explicitly hardcodes tuples and space name. So, now the performance test works only with example cluster. Closes tarantool#538 NO_DOC=example NO_TEST=example

All deleted buckets are saved in the `route_map` to redirect routers that arrive at the wrong replicaset. However, before this patch, the map was cleared once every 500 years. The issue was that if the map is empty, the clear deadline was set to infinity. Let's set the timeout for cache to be `BUCKET_SENT_GARBAGE_DELAY` and not `TIMEOUT_INFINITY`, as it was before the patch. Note, that deadline in the patch have been replaced with timeout in order to allow testing. Otherwise, as soon as deadline is set with big timeout, it cannot be reset. Additionally, the `BUCKET_SENT_GARBAGE_DELAY` constant has been increased from 0.5 seconds to 60. It doesn't make sense to save the cache just for half of the second: if GC works actively, the cache will be cleared almost immediately. Closes tarantool#546 NO_DOC=bugfix

This commit makes bucket status functions accept status instead of the bucket tuple. There's no need to index the cdata several times. It also moves the functions before `bucket_commit_update`, since in the subsuquent commit they will be used there. Needed for tarantool#285 NO_DOC=refactoring NO_TEST=refactoring

Refs are used to protect buckets from rebalancing and GC. They're taken on every user request and released after it. This is the most hot place in vshard. Ref has an optimization that if there is already > 0 refs of the requested type, it works as an increment. But if the current ref count is 0, it makes _bucket:get() to check bucket state. That might be expensive - a tuple is returned to Lua as cdata, which increases GC pressure. Also it might be relatively slow if bucket count is millions. We can completely get rid of accessing the space, when ref exists. For that this commit makes the ref locks to be automatically set, when they're needed. Closes tarantool#285 Related tarantool#229 NO_DOC=refactoring

This commit replaces all occurencies of the consts.BUCKET. This way we can index 0 tables, instead of 2 as it was before this patch: `consts` and `BUCKET`. NO_DOC=refactoring NO_TEST=refactoring

The part of the router/router test, where it's checked, that write request for `SENDING` bucket fails, has become very flaky after the implementation of the tarantool#285. Before the patch there was TRANSFER_IN_PROGRESS error, now - BUCKET_IS_LOCKED. However, this BUCKET_IS_LOCKED is retryable, TRANSFER_IN_PROGRESS is not. This causes the flakiness of the test, since now we can get either TimedOut or BUCKET_IS_LOCKED error. Let's change the test, so that it becomes expected behavior. It's also required to reset the bucket, since all of the subsequent tests expect this bucket location to be unknown. Follow-up tarantool#285 NO_DOC=test

At the end test drops the weight of one replicaset to 0 and checks, that all other replicasets has exactly 100 buckets. However, we didn't wait for rebalancing to end, which causes the flakiness: it may happen, that not all buckets are ACTIVE, when we check them. Let's wait for rebalancing to end and check the states of buckets only after that. NO_DOC=test

The configuration of the router has not been reset at the end of the `test_info_with_named_identification` case, which affects all subsequent tests. Let's properly reset it. Needed for tarantool#548 NO_DOC=testfix

A retry is only illegal when we can't say if it was already applied or not. TRANSFER_IS_IN_PROGRESS states that the request wasn't applied. So we can we can safely retry it. Let's make the TRANSFER_IS_IN_PROGRESS consistent with BUCKET_IS_LOCKED error, they both should be retryable. Closes tarantool#548 Follow-up tarantool#285 NO_DOC=bugfix

When the old master hangs, we firstly make a call to it to check if it's still alive. After that, we contact each replica in the replicaset and wait for their responses. As a result, we end up calling the dead master twice, which causes the master discovery process to lag by 3 * MASTER_SEARCH_TIMEOUT seconds (2 times in `locate_master()` while waiting for responses from dead master and 1 before waking up of the master search fiber). We can reduce the time required to discover the new master by skipping the dead old master during the iteration over replicas. This would limit the delay on the router to a maximum of 2 * MASTER_SEARCH_TIMEOUT per master search iteration, if only one node is down. Closes tarantool#549 NO_DOC=bugfix

Serpentian · 2025-05-20T14:37:07Z

https://github.com/tarantool/vshard-ee/commit/6198845f5ec098b529306f99ad5feecb14647bab is intentionally not backported, it doesn't work on GitHub CI runners:

Run # Dynamically linked libraries.
Reading package lists...
Building dependency tree...
Reading state information...
libicu-dev is already the newest version (70.1-2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.
'/home/runner/tnt-release/usr/local/bin/tarantool' -> '/usr/local/bin/tarantool'
cp: cannot create directory '/usr/local/include/tarantool': Permission denied
cp: cannot create directory '/usr/local/lib/tarantool': Permission denied
cp: cannot create directory '/usr/local/share/tarantool': Permission denied
cp: cannot create regular file '/usr/local/share/man/man1/tarantool.1': Permission denied
'/home/runner/tnt-release/usr/local/share/man/man1/tarantool.1' -> '/usr/local/share/man/man1/tarantool.1'
Error: Process completed with exit code 1.

Gerold103 · 2025-05-21T16:41:31Z

What do you think

Looks good. Just lets polish those fixes and push them to EE first. And then backport all of that together.

Before this patch several small remarks needed fixing. Now, the `luatest_helpers.asserts` module is removed as unused and the `assert_server_no_alerts` is moved into `luatest_helpers.server` module because it takes the server as an argument. Also the `info_assert_alerts` is moved into `luatest_helpers.vtest` module. Needed for tarantool#493 NO_DOC=bugfix

Serpentian · 2025-05-26T13:41:28Z

@Gerold103, cherry-picks are ready to be merged

@TarantoolBot

Closes tarantool#408 @TarantoolBot document Title: vshard: introduce `connection_fetch_schema` config option Root document: https://www.tarantool.io/ru/doc/latest/reference/reference_rock/vshard/vshard_ref/ A new configuration option `connection_fetch_schema` has been added. It controls whether vshard connections should fetch the database schema (to allow using `conn.schema.<...>` or `conn.space.<space_name>`). vshard itself doesn’t need this fetching at all, so you can safely set `connection_fetch_schema = false` - as long as you’re not relying on its internal connections. This will improve performance. By default, it’s set to true (enabled) for backward compatibility.

Serpentian and others added 7 commits May 20, 2025 17:16

test: delete test_id_is_shown_in_alerts as unnecessary

723606d

This test expected alert for non-vshard replica to be shown, which is incorrect behavior and will be fixed in the subsequent commit. Needed for tarantool#493 NO_DOC=test

storage: disable alerts for non vshard replicas

bee6b32

Before this patch vshard throwed alerts for replicas, which are not in its config. E.g. disconnected CDC replica caused `UNREACHABLE_REPLICA` alert. This patch fixes it by skipping unknown replicas. Closes tarantool#493 NO_DOC=bugfix

Serpentian added the blocked Not ready to be implemented label May 20, 2025

Serpentian mentioned this pull request May 20, 2025

Optimize bucket_ref() to not make _bucket:get() when _bucket does not change #285

Closed

mrForza and others added 12 commits May 20, 2025 17:35

storage: access bucket states faster

5bb3f84

This commit replaces all occurencies of the consts.BUCKET. This way we can index 0 tables, instead of 2 as it was before this patch: `consts` and `BUCKET`. NO_DOC=refactoring NO_TEST=refactoring

test: fix router/router_2_2_test

ddb1e53

The configuration of the router has not been reset at the end of the `test_info_with_named_identification` case, which affects all subsequent tests. Let's properly reset it. Needed for tarantool#548 NO_DOC=testfix

Serpentian force-pushed the cherry-pick-for-0-1-34 branch from cea2c63 to 0253ae2 Compare May 20, 2025 14:35

Serpentian mentioned this pull request May 26, 2025

Backport storage-alerts #544

Closed

Serpentian removed the blocked Not ready to be implemented label May 26, 2025

Serpentian requested a review from Gerold103 May 26, 2025 13:41

Serpentian assigned Gerold103 May 26, 2025

Gerold103 approved these changes May 28, 2025

View reviewed changes

Gerold103 merged commit c83deb1 into tarantool:master May 28, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry pick for 0.1.34 #553

Cherry pick for 0.1.34 #553

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Gerold103 commented May 21, 2025

Uh oh!

Serpentian commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

Cherry pick for 0.1.34 #553

Cherry pick for 0.1.34 #553

Uh oh!

Conversation

Serpentian commented May 20, 2025

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Serpentian commented May 20, 2025

Uh oh!

Gerold103 commented May 21, 2025

Uh oh!

Serpentian commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!