-
Notifications
You must be signed in to change notification settings - Fork 1.3k
KVM HA: VM remains Running after virsh destroy; PowerReportMissing schedules HA but KVMInvestigator reports alive=true and restart is cancelled #12920
Description
problem
ISSUE TYPE
- Bug Report
COMPONENT NAME
KVM, Orchestration, HA
CLOUDSTACK VERSION
4.22.x
CONFIGURATION
- KVM hypervisor
- Shared primary storage on NFS
- HA-enabled user VM
- sync.interval = 60
- no ha.tag configured
- tested with an HA-enabled VM deployed on a healthy KVM host
- multiple management servers in the zone/cluster
OS / ENVIRONMENT
- CloudStack management servers on Ubuntu 24.04
- MySQL 8
- KVM hosts on Linux/libvirt
- Primary storage: NFS
SUMMARY
On CloudStack 4.22.x, if a KVM VM is stopped unexpectedly on the hypervisor using virsh destroy, CloudStack detects PowerReportMissing, waits for the grace period, and schedules HA restart work. However, the HA worker then fails to restart the VM because KVMInvestigator reports the VM as alive (alive? true) while the host is still up.
As a result:
- the VM remains in
Runningstate in CloudStack/UI - the VM is not transitioned to
Stopped - HA does not restart it
- the same HA scheduling/investigation loop repeats on subsequent sync cycles
This appears related to #10406 / #10407, which were intended to fix cases where VMs were not moving to Stopped when PowerReportMissing is processed.
EXPECTED RESULTS
After the grace period passes, CloudStack should process PowerReportMissing, transition the VM to Stopped, and, because HA is enabled, restart the VM automatically.
Expected behavior for this test case:
virsh destroy <domain>removes the libvirt domain.- CloudStack detects the VM as missing.
- After the graceful period expires, CloudStack updates the VM power report to
PowerReportMissing. - CloudStack transitions the VM state from
RunningtoStopped. - HA schedules a restart for the VM.
- The VM is restarted automatically on an eligible host.
- The CloudStack UI/API reflects the VM state correctly and does not continue to show the VM as
Running.
ACTUAL RESULTS
CloudStack detects the VM as missing and the graceful period is working correctly:
2026-03-31 02:28:43,791 DEBUG ... Detected missing VM. host: 6, vm id: 91(...), power state: PowerReportMissing, last state update: 2026-03-31T02:27:43+0000
2026-03-31 02:28:43,791 DEBUG ... vm id: 91 - time since last state update(60791 ms) has not passed graceful period yet
2026-03-31 02:29:43,722 DEBUG ... Detected missing VM. host: 6, vm id: 91(...), power state: PowerReportMissing, last state update: 2026-03-31T02:27:43+0000
2026-03-31 02:29:43,722 DEBUG ... vm id: 91 - time since last state update(120722 ms) has passed graceful period
After the graceful period passes, CloudStack updates the VM power report and schedules HA restart work:
2026-03-31 02:29:43,742 DEBUG ... VM state report is updated. Host {...}, VM instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...}, power state: PowerReportMissing
2026-03-31 02:29:43,775 INFO ... Detected out-of-band stop of a HA enabled VM ... will schedule restart.
2026-03-31 02:29:43,798 INFO ... Schedule vm for HA: VM instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...}
2026-03-31 02:29:43,820 INFO ... HA on VM instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...}
The HA worker checks the VM, and the host-side agent confirms that the libvirt domain no longer exists:
2026-03-31 02:29:43,855 DEBUG ... Unable to get vm state on VM instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...}```
KVM host agent log:
2026-03-31 02:29:43,928 ERROR ... Could not get state for VM [i-2-91-VM] (retry=0) due to: org.libvirt.LibvirtException: Domain not found: no domain with matching name 'i-2-91-VM'
However, KVMInvestigator then reports the VM as alive, and the HA restart is cancelled:
2026-03-31 02:29:43,859 INFO ... KVMInvestigator found VM instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...} to be alive? true
2026-03-31 02:29:43,860 INFO ... VM instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...} is alive and host is up. No need to restart it.
This same pattern repeats on later sync cycles, including 02:31:43, 02:33:43, and 02:36:43.
Final observed behavior:
the VM remains in Running state in CloudStack/UI
the VM is not transitioned to Stopped
HA does not restart the VM
the missing-domain / HA-scheduled / KVMInvestigator alive=true loop repeats continuously
versions
cloudstack-management 4.22.0.0
cloudstack-agent 4.22.0.0
libvirt 10.0.0-2ubuntu8.11
ubuntu 24.04 LTS
The steps to reproduce the bug
-
Deploy a user VM on a KVM host with HA enabled.
-
Confirm the VM is in
Runningstate in CloudStack. -
On the KVM host, destroy the domain unexpectedly:
virsh destroy <domain-name>
What to do about it?
No response