Skip to content

Conversation

Serpentian
Copy link
Collaborator

No description provided.

Serpentian and others added 7 commits May 20, 2025 17:16
During upgrade from Tarantool 2.X to 3.X instance doesn't have name
for some time (until box.schema.upgrade() is run). During this time
master throws the alert UNREACHABLE_REPLICA, even though it's fully
functional.

This happens due to the fact, that named identification is used and
vshard supposes, that box.info.name is accessible. It uses it in
order to identify the current instance and fails.

Let's use box.info.id in order to identify the current instance.

Closes tarantool#13

NO_DOC=bugfix
When name is not set, UNREACHABLE_REPLICA alert is shown as "Replica
cdata<void *>: NULL isn't active", which is not very informative.

Let's fallback to box.info.id, when named identification is used, but
instance name is not set yet.

Follow-up tarantool#13

NO_DOC=bugfix
In test `test_named_hot_reload` there is no reconfiguration to
uuid_as_key identification_mode in the end of the test. This
test negatively influences other tests which expect uuid_as_key
identification_mode.

Needed for tarantool#493

NO_DOC=test
The function `assert_server_no_alerts` is moved from
`persistent_names_test` into `test/luatest_helpers/asserts.lua`.
This change has been made in order to use this function in subsequent
commit in `test_alerts_for_non_vshard_config_template`. Also some usages
of this function are changed in `persistent_names_test` with `asserts`
module.

Needed for tarantool#493

NO_DOC=test
This test expected alert for non-vshard replica to be shown, which is
incorrect behavior and will be fixed in the subsequent commit.

Needed for tarantool#493

NO_DOC=test
Before this patch vshard throwed alerts for replicas, which are not in
its config. E.g. disconnected CDC replica caused `UNREACHABLE_REPLICA`
alert. This patch fixes it by skipping unknown replicas.

Closes tarantool#493

NO_DOC=bugfix
This test checks scenario when replica disconnects from replicaset
in named vshard config. In this situation two alerts should be thrown:
`UNREACHABLE_REPLICA` and `UNREACHABLE_REPLICASET`. It is necessary to
add this test in order to reduce untested places in the codebase. Also
a function `info_assert_alert` is removed from `router_2_2_test` since
there is a more appropriate function in `asserts` module.

Follow-up tarantool#493

NO_DOC=test
@Serpentian Serpentian added the blocked Not ready to be implemented label May 20, 2025
@Serpentian
Copy link
Collaborator Author

Blocked with #544

@Serpentian
Copy link
Collaborator Author

@Gerold103, we're blocked on backporting with #544. What do you think, if we merge it as is and then, in the separate commit, add all review fixes. This commit will be pushed to vshard-ee and vshard-ce.

This way we'll get the same commit history in ee and ce. Otherwise, in CE the fixes will be done in the scope of PR, without additional commits and in EE it'll be additional one

mrForza and others added 12 commits May 20, 2025 17:35
The vshard config in `replicaset_3_test.lua` has `uuid_as_key`
identification_mode. In the test `test_named_replicaset` the config
is changed its mode to `name_as_key` and master node is changed its name
on `replica_1_a`. After that config is become `uuid_as_key` but the name
of master node remains the same.

This test negatively influences other tests which don't expect that
master node has its own instance_name in `uuid_as_key` vshard config.
This patch fixes it by dropping instance_name of `g.replica_1_a`.

Needed for tarantool#513

NO_DOC=test
Before this patch `storage_info` failed due to `replica.upstream` being
nil. This error happened when replica disconnected from master or when
non-vshard replica (with name identical to the master name) connected to
a replicaset with `uuid_as_key` identification_mode.

This patch fixes this issue by skipping all non-master replicas (even
those that have the name like master node) and checking
`replica.upstream` connection. With nil connection, an alert
`UNREACHABLE_MASTER` is thrown and `replication` estimation is skipped.

Closes tarantool#513

NO_DOC=bugfix
This commit introduces basic performance test for replace and
get requests. Now it supports only one router, which is needed
for example cluster. It also explicitly hardcodes tuples and
space name.

So, now the performance test works only with example cluster.

Closes tarantool#538

NO_DOC=example
NO_TEST=example
All deleted buckets are saved in the `route_map` to redirect routers
that arrive at the wrong replicaset. However, before this patch, the
map was cleared once every 500 years. The issue was that if the map
is empty, the clear deadline was set to infinity.

Let's set the timeout for cache to be `BUCKET_SENT_GARBAGE_DELAY` and
not `TIMEOUT_INFINITY`, as it was before the patch.

Note, that deadline in the patch have been replaced with timeout in
order to allow testing. Otherwise, as soon as deadline is set with big
timeout, it cannot be reset.

Additionally, the `BUCKET_SENT_GARBAGE_DELAY` constant has been
increased from 0.5 seconds to 60. It doesn't make sense to save
the cache just for half of the second: if GC works actively, the
cache will be cleared almost immediately.

Closes tarantool#546

NO_DOC=bugfix
This commit makes bucket status functions accept status instead of
the bucket tuple. There's no need to index the cdata several times.

It also moves the functions before `bucket_commit_update`, since in
the subsuquent commit they will be used there.

Needed for tarantool#285

NO_DOC=refactoring
NO_TEST=refactoring
Refs are used to protect buckets from rebalancing and GC. They're
taken on every user request and released after it. This is the most
hot place in vshard.

Ref has an optimization that if there is already > 0 refs of
the requested type, it works as an increment. But if the current
ref count is 0, it makes _bucket:get() to check bucket state.
That might be expensive - a tuple is returned to Lua as cdata,
which increases GC pressure. Also it might be relatively slow if
bucket count is millions.

We can completely get rid of accessing the space, when ref
exists. For that this commit makes the ref locks to be automatically
set, when they're needed.

Closes tarantool#285
Related tarantool#229

NO_DOC=refactoring
This commit replaces all occurencies of the consts.BUCKET. This way
we can index 0 tables, instead of 2 as it was before this patch:
`consts` and `BUCKET`.

NO_DOC=refactoring
NO_TEST=refactoring
The part of the router/router test, where it's checked, that write
request for `SENDING` bucket fails, has become very flaky after
the implementation of the tarantool#285. Before the
patch there was TRANSFER_IN_PROGRESS error, now - BUCKET_IS_LOCKED.

However, this BUCKET_IS_LOCKED is retryable, TRANSFER_IN_PROGRESS
is not. This causes the flakiness of the test, since now we can get
either TimedOut or BUCKET_IS_LOCKED error. Let's change the test, so
that it becomes expected behavior.

It's also required to reset the bucket, since all of the subsequent
tests expect this bucket location to be unknown.

Follow-up tarantool#285

NO_DOC=test
At the end test drops the weight of one replicaset to 0 and
checks, that all other replicasets has exactly 100 buckets. However,
we didn't wait for rebalancing to end, which causes the flakiness:
it may happen, that not all buckets are ACTIVE, when we check them.

Let's wait for rebalancing to end and check the states of buckets
only after that.

NO_DOC=test
The configuration of the router has not been reset at the end of the
`test_info_with_named_identification` case, which affects all
subsequent tests. Let's properly reset it.

Needed for tarantool#548

NO_DOC=testfix
A retry is only illegal when we can't say if it was already applied or
not. TRANSFER_IS_IN_PROGRESS states that the request wasn't applied. So we
can  we can safely retry it.

Let's make the TRANSFER_IS_IN_PROGRESS consistent with BUCKET_IS_LOCKED
error, they both should be retryable.

Closes tarantool#548
Follow-up tarantool#285

NO_DOC=bugfix
When the old master hangs, we firstly make a call to it to check if it's
still alive. After that, we contact each replica in the replicaset and
wait for their responses. As a result, we end up calling the dead master
twice, which causes the master discovery process to lag by 3 *
MASTER_SEARCH_TIMEOUT seconds (2 times in `locate_master()` while
waiting for responses from dead master and 1 before waking up of the
master search fiber).

We can reduce the time required to discover the new master by skipping
the dead old master during the iteration over replicas. This would limit
the delay on the router to a maximum of 2 * MASTER_SEARCH_TIMEOUT per
master search iteration, if only one node is down.

Closes tarantool#549

NO_DOC=bugfix
@Serpentian Serpentian force-pushed the cherry-pick-for-0-1-34 branch from cea2c63 to 0253ae2 Compare May 20, 2025 14:35
@Serpentian
Copy link
Collaborator Author

https://github.com/tarantool/vshard-ee/commit/6198845f5ec098b529306f99ad5feecb14647bab is intentionally not backported, it doesn't work on GitHub CI runners:

Run # Dynamically linked libraries.
Reading package lists...
Building dependency tree...
Reading state information...
libicu-dev is already the newest version (70.1-2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.
'/home/runner/tnt-release/usr/local/bin/tarantool' -> '/usr/local/bin/tarantool'
cp: cannot create directory '/usr/local/include/tarantool': Permission denied
cp: cannot create directory '/usr/local/lib/tarantool': Permission denied
cp: cannot create directory '/usr/local/share/tarantool': Permission denied
cp: cannot create regular file '/usr/local/share/man/man1/tarantool.1': Permission denied
'/home/runner/tnt-release/usr/local/share/man/man1/tarantool.1' -> '/usr/local/share/man/man1/tarantool.1'
Error: Process completed with exit code 1.

@Gerold103
Copy link
Collaborator

What do you think

Looks good. Just lets polish those fixes and push them to EE first. And then backport all of that together.

@Serpentian Serpentian removed the blocked Not ready to be implemented label May 26, 2025
Before this patch several small remarks needed fixing.

Now, the `luatest_helpers.asserts` module is removed as unused and the
`assert_server_no_alerts` is moved into `luatest_helpers.server` module
because it takes the server as an argument. Also the
`info_assert_alerts` is moved into `luatest_helpers.vtest` module.

Needed for tarantool#493

NO_DOC=bugfix
@Serpentian Serpentian requested a review from Gerold103 May 26, 2025 13:41
@Serpentian
Copy link
Collaborator Author

@Gerold103, cherry-picks are ready to be merged

Closes tarantool#408

@TarantoolBot document
Title: vshard: introduce `connection_fetch_schema` config option

Root document: https://www.tarantool.io/ru/doc/latest/reference/reference_rock/vshard/vshard_ref/

A new configuration option `connection_fetch_schema` has been added.

It controls whether vshard connections should fetch the database schema
(to allow using `conn.schema.<...>` or `conn.space.<space_name>`).

vshard itself doesn’t need this fetching at all, so you can safely set
`connection_fetch_schema = false` - as long as you’re not relying on its
internal connections. This will improve performance.

By default, it’s set to true (enabled) for backward compatibility.
@Gerold103 Gerold103 merged commit c83deb1 into tarantool:master May 28, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants