-
Notifications
You must be signed in to change notification settings - Fork 33
Cherry pick for 0.1.34 #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry pick for 0.1.34 #553
Conversation
During upgrade from Tarantool 2.X to 3.X instance doesn't have name for some time (until box.schema.upgrade() is run). During this time master throws the alert UNREACHABLE_REPLICA, even though it's fully functional. This happens due to the fact, that named identification is used and vshard supposes, that box.info.name is accessible. It uses it in order to identify the current instance and fails. Let's use box.info.id in order to identify the current instance. Closes tarantool#13 NO_DOC=bugfix
When name is not set, UNREACHABLE_REPLICA alert is shown as "Replica cdata<void *>: NULL isn't active", which is not very informative. Let's fallback to box.info.id, when named identification is used, but instance name is not set yet. Follow-up tarantool#13 NO_DOC=bugfix
In test `test_named_hot_reload` there is no reconfiguration to uuid_as_key identification_mode in the end of the test. This test negatively influences other tests which expect uuid_as_key identification_mode. Needed for tarantool#493 NO_DOC=test
The function `assert_server_no_alerts` is moved from `persistent_names_test` into `test/luatest_helpers/asserts.lua`. This change has been made in order to use this function in subsequent commit in `test_alerts_for_non_vshard_config_template`. Also some usages of this function are changed in `persistent_names_test` with `asserts` module. Needed for tarantool#493 NO_DOC=test
This test expected alert for non-vshard replica to be shown, which is incorrect behavior and will be fixed in the subsequent commit. Needed for tarantool#493 NO_DOC=test
Before this patch vshard throwed alerts for replicas, which are not in its config. E.g. disconnected CDC replica caused `UNREACHABLE_REPLICA` alert. This patch fixes it by skipping unknown replicas. Closes tarantool#493 NO_DOC=bugfix
This test checks scenario when replica disconnects from replicaset in named vshard config. In this situation two alerts should be thrown: `UNREACHABLE_REPLICA` and `UNREACHABLE_REPLICASET`. It is necessary to add this test in order to reduce untested places in the codebase. Also a function `info_assert_alert` is removed from `router_2_2_test` since there is a more appropriate function in `asserts` module. Follow-up tarantool#493 NO_DOC=test
Blocked with #544 |
@Gerold103, we're blocked on backporting with #544. What do you think, if we merge it as is and then, in the separate commit, add all review fixes. This commit will be pushed to vshard-ee and vshard-ce. This way we'll get the same commit history in ee and ce. Otherwise, in CE the fixes will be done in the scope of PR, without additional commits and in EE it'll be additional one |
The vshard config in `replicaset_3_test.lua` has `uuid_as_key` identification_mode. In the test `test_named_replicaset` the config is changed its mode to `name_as_key` and master node is changed its name on `replica_1_a`. After that config is become `uuid_as_key` but the name of master node remains the same. This test negatively influences other tests which don't expect that master node has its own instance_name in `uuid_as_key` vshard config. This patch fixes it by dropping instance_name of `g.replica_1_a`. Needed for tarantool#513 NO_DOC=test
Before this patch `storage_info` failed due to `replica.upstream` being nil. This error happened when replica disconnected from master or when non-vshard replica (with name identical to the master name) connected to a replicaset with `uuid_as_key` identification_mode. This patch fixes this issue by skipping all non-master replicas (even those that have the name like master node) and checking `replica.upstream` connection. With nil connection, an alert `UNREACHABLE_MASTER` is thrown and `replication` estimation is skipped. Closes tarantool#513 NO_DOC=bugfix
This commit introduces basic performance test for replace and get requests. Now it supports only one router, which is needed for example cluster. It also explicitly hardcodes tuples and space name. So, now the performance test works only with example cluster. Closes tarantool#538 NO_DOC=example NO_TEST=example
All deleted buckets are saved in the `route_map` to redirect routers that arrive at the wrong replicaset. However, before this patch, the map was cleared once every 500 years. The issue was that if the map is empty, the clear deadline was set to infinity. Let's set the timeout for cache to be `BUCKET_SENT_GARBAGE_DELAY` and not `TIMEOUT_INFINITY`, as it was before the patch. Note, that deadline in the patch have been replaced with timeout in order to allow testing. Otherwise, as soon as deadline is set with big timeout, it cannot be reset. Additionally, the `BUCKET_SENT_GARBAGE_DELAY` constant has been increased from 0.5 seconds to 60. It doesn't make sense to save the cache just for half of the second: if GC works actively, the cache will be cleared almost immediately. Closes tarantool#546 NO_DOC=bugfix
This commit makes bucket status functions accept status instead of the bucket tuple. There's no need to index the cdata several times. It also moves the functions before `bucket_commit_update`, since in the subsuquent commit they will be used there. Needed for tarantool#285 NO_DOC=refactoring NO_TEST=refactoring
Refs are used to protect buckets from rebalancing and GC. They're taken on every user request and released after it. This is the most hot place in vshard. Ref has an optimization that if there is already > 0 refs of the requested type, it works as an increment. But if the current ref count is 0, it makes _bucket:get() to check bucket state. That might be expensive - a tuple is returned to Lua as cdata, which increases GC pressure. Also it might be relatively slow if bucket count is millions. We can completely get rid of accessing the space, when ref exists. For that this commit makes the ref locks to be automatically set, when they're needed. Closes tarantool#285 Related tarantool#229 NO_DOC=refactoring
This commit replaces all occurencies of the consts.BUCKET. This way we can index 0 tables, instead of 2 as it was before this patch: `consts` and `BUCKET`. NO_DOC=refactoring NO_TEST=refactoring
The part of the router/router test, where it's checked, that write request for `SENDING` bucket fails, has become very flaky after the implementation of the tarantool#285. Before the patch there was TRANSFER_IN_PROGRESS error, now - BUCKET_IS_LOCKED. However, this BUCKET_IS_LOCKED is retryable, TRANSFER_IN_PROGRESS is not. This causes the flakiness of the test, since now we can get either TimedOut or BUCKET_IS_LOCKED error. Let's change the test, so that it becomes expected behavior. It's also required to reset the bucket, since all of the subsequent tests expect this bucket location to be unknown. Follow-up tarantool#285 NO_DOC=test
At the end test drops the weight of one replicaset to 0 and checks, that all other replicasets has exactly 100 buckets. However, we didn't wait for rebalancing to end, which causes the flakiness: it may happen, that not all buckets are ACTIVE, when we check them. Let's wait for rebalancing to end and check the states of buckets only after that. NO_DOC=test
The configuration of the router has not been reset at the end of the `test_info_with_named_identification` case, which affects all subsequent tests. Let's properly reset it. Needed for tarantool#548 NO_DOC=testfix
A retry is only illegal when we can't say if it was already applied or not. TRANSFER_IS_IN_PROGRESS states that the request wasn't applied. So we can we can safely retry it. Let's make the TRANSFER_IS_IN_PROGRESS consistent with BUCKET_IS_LOCKED error, they both should be retryable. Closes tarantool#548 Follow-up tarantool#285 NO_DOC=bugfix
When the old master hangs, we firstly make a call to it to check if it's still alive. After that, we contact each replica in the replicaset and wait for their responses. As a result, we end up calling the dead master twice, which causes the master discovery process to lag by 3 * MASTER_SEARCH_TIMEOUT seconds (2 times in `locate_master()` while waiting for responses from dead master and 1 before waking up of the master search fiber). We can reduce the time required to discover the new master by skipping the dead old master during the iteration over replicas. This would limit the delay on the router to a maximum of 2 * MASTER_SEARCH_TIMEOUT per master search iteration, if only one node is down. Closes tarantool#549 NO_DOC=bugfix
cea2c63
to
0253ae2
Compare
https://github.com/tarantool/vshard-ee/commit/6198845f5ec098b529306f99ad5feecb14647bab is intentionally not backported, it doesn't work on GitHub CI runners:
|
Looks good. Just lets polish those fixes and push them to EE first. And then backport all of that together. |
Before this patch several small remarks needed fixing. Now, the `luatest_helpers.asserts` module is removed as unused and the `assert_server_no_alerts` is moved into `luatest_helpers.server` module because it takes the server as an argument. Also the `info_assert_alerts` is moved into `luatest_helpers.vtest` module. Needed for tarantool#493 NO_DOC=bugfix
@Gerold103, cherry-picks are ready to be merged |
Closes tarantool#408 @TarantoolBot document Title: vshard: introduce `connection_fetch_schema` config option Root document: https://www.tarantool.io/ru/doc/latest/reference/reference_rock/vshard/vshard_ref/ A new configuration option `connection_fetch_schema` has been added. It controls whether vshard connections should fetch the database schema (to allow using `conn.schema.<...>` or `conn.space.<space_name>`). vshard itself doesn’t need this fetching at all, so you can safely set `connection_fetch_schema = false` - as long as you’re not relying on its internal connections. This will improve performance. By default, it’s set to true (enabled) for backward compatibility.
No description provided.