-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
problem
I've got a fairly slow test cluster I've been mocking up a cloudstack on (8-node supermicro microcloud with Xeon-D processors and 64G RAM each). Its running hyperconverged with Ceph, KVM, and Cloudstack manager nodes (manager nodes are only on 3 of the members), all are interconnected with VXLAN-EVPN on dual 25G Mellanox ConnectX-4 NICs.
During high load events, for example when I bring up a bunch of network tiers in a VPC, and also provision 8-20 VM Instances (All with Terraform) , I'll notice both VPC Virtual Routers go to PRIMARY instead of one being PRIMARY and the other being BACKUP as is normally the case.
This is really bad because the VIP is then owned by both Virtual Routers and it means traffic is getting dropped like crazy, it makes the entire VPC unusable. Restarting the VPC or killing one of the VRs recovers it to a good state.
I understand that if the VRs get starved they end up not responding and the backup then thinks it should become primary but realistically when load reduces again it should recognize this and demote one back to BACKUP state.
I haven't looked under the covers of what is being used, but I assume it is something like keepalived and this feels like it is a configuration issue of some sort.
versions
4.21.0
The steps to reproduce the bug
Use Terraform to create the VPC, network tiers, and VM Instances.
Terraform configuration being used is here: https://github.com/bradh352/terraform-config
What to do about it?
No response
Metadata
Metadata
Assignees
Type
Projects
Status