Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration stuck in CONNECTING state #4797

Open
andydunstall opened this issue Mar 18, 2025 · 6 comments
Open

Migration stuck in CONNECTING state #4797

andydunstall opened this issue Mar 18, 2025 · 6 comments
Assignees
Labels
bug Something isn't working Next Up task that is ready to be worked on and should be added to working queue

Comments

@andydunstall
Copy link
Contributor

Since updating our system tests to v1.28 we're seeing some migrations are getting stuck in a CONNECTING state for 15m+, even though both the source and target nodes are healthy. ~50% of our test runs are hitting this issue

We see Migration initiating and Connecting to target node in a busy loop for a few seconds on the source node (logged ~30k times in 7 seconds), then no further output. Though SLOT-MIGRATION-STATUS returns the state is CONNECTING

There could be a regression on the control plane, though I don't see any related changes that could have caused this. As far as I can see the cluster configuration looks valid

Will keep looking and trying to reproduce, so will add more info...

@andydunstall andydunstall added the bug Something isn't working label Mar 18, 2025
@BorysTheDev
Copy link
Contributor

da-staging datastore artifacts dst_4or9o54g2 --download ./logs

migration:migrations:

  • migration_id: migration_oktfku418
    started_at: 2025-03-18 09:32:00
    finished_at: (unset)
    status:
    state: in-progress
    error: ""
    config:
    datastore_id: dst_4or9o54g2
    source:
    shard_id: shard_ivuduo6oh
    node_id: node_j53hhxntj
    target:
    shard_id: shard_5i8o7vbr5
    node_id: node_h6ta41rse
    slot_ranges:
    • start: 10921
      end: 10921
      i.e. node_j53hhxntj -> node_h6ta41rse

@andydunstall
Copy link
Contributor Author

Seeing another issue where the target says the migration has state FINISHED, but the source says it has state SYNC. Again it means the migration is just stuck forever (dst_j8a9dr440/migration_31hzqm2op). Maybe related?

@adiholden adiholden added the Next Up task that is ready to be worked on and should be added to working queue label Mar 19, 2025
@BorysTheDev
Copy link
Contributor

I20250319 09:45:21.473965 1720 scheduler.cc:480] ------------ Fiber outgoing_migration (suspended:1056085ms) ------------
0x555555f7e29c util::fb2::detail::FiberInterface::SwitchTo()
0x555555f7aa93 util::fb2::detail::Scheduler::Preempt()
0x555555fbb208 util::fb2::FiberCall::Get()
0x555555fc3984 util::fb2::UringSocket::Recv()
0x5555559f0349 dfly::ProtocolClient::ReadRespReply()
0x5555559f0755 dfly::ProtocolClient::SendCommandAndReadResponse()
0x55555590bd44 dfly::cluster::OutgoingMigration::SyncFb()

@BorysTheDev
Copy link
Contributor

It looks like we can't read from the socket at all

@BorysTheDev
Copy link
Contributor

migrations:

  • migration_id: migration_7dxbpv7c3
    started_at: 2025-03-19 13:34:59
    finished_at: (unset)
    status:
    state: in-progress
    error: ""
    config:
    datastore_id: dst_lj0vl2vi3
    source:
    shard_id: shard_htfg7xztu
    node_id: node_1dkylhwoc
    target:
    shard_id: shard_tg7e7h1mu
    node_id: node_fifk3846c
    slot_ranges:
    • start: 10922
      end: 13651

dst_lj0vl2vi3.zip

@BorysTheDev
Copy link
Contributor

BorysTheDev commented Mar 20, 2025

I've tried to reproduce it locally in the following ways:

  1. generate random delay during sending config to some nodes; - No results
  2. send config only to 2 source nodes and don't send to a target node; - No results
  3. the second approach is simulating network issues using a proxy; - Get "Connection refused" error and migration can not be finished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Next Up task that is ready to be worked on and should be added to working queue
Projects
None yet
Development

No branches or pull requests

3 participants