Skip to content

feat(vm_healthcheck): add VM health validation role#15

Open
stevefulme1 wants to merge 7 commits into
redhat-cop:mainfrom
stevefulme1:feat/vm-healthcheck-role
Open

feat(vm_healthcheck): add VM health validation role#15
stevefulme1 wants to merge 7 commits into
redhat-cop:mainfrom
stevefulme1:feat/vm-healthcheck-role

Conversation

@stevefulme1

Copy link
Copy Markdown
Contributor

Summary

Adds a new vm_healthcheck role that performs comprehensive health validation checks on Virtual Machines running in OpenShift Virtualization.

Checks performed

  • Status: VM phase (Running), Ready condition, guest agent presence, node assignment
  • Networking: Interface IP assignment, interface name consistency, masquerade pod network, bridge/multus attachment
  • Storage: PVC bound status, DataVolume succeeded status, volume attachment verification
  • Resources: CPU/memory requests vs limits, node capacity, overcommit detection

Deliverables

  • roles/vm_healthcheck/ - Full role with defaults, vars, tasks, templates, tests, meta, README
  • playbooks/vm_healthcheck.yml - Playbook to invoke the role
  • HTML report template for healthcheck results

Testing

  • ansible-lint --profile production passes with 0 errors (warnings are yaml[line-length] which is in warn_list)
  • All task names use FQCNs (kubernetes.core.k8s_info, ansible.builtin.set_fact, etc.)
  • Loop variables use __vm_healthcheck_ prefix per loop_var_prefix config
  • Code style matches existing vm_lifecycle role patterns

@sabre1041 sabre1041 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevefulme1 This is a very useful role. A few comments for consideration

  1. Are we only concerned about running VM's? Since VirtualMachineInstances are being targeted, only running VM's have these resources associated. All checks fail against a VM if it is not running. If the same checks can be performed against a VirtualMachine it might be a good fallback where an indicator could be added to denote VMI vs VM
  2. For the storage checks where there are, for example, volumes listed. There is inconsistent ordering between declared vs attached. It might be nice to have the items sorted alphabetically for consistency
  3. What about adding a timestamp to the report to state when the report was generated
  4. Might be good to check that the destination that the report will be created in is writable. In one of my tests, the destination was invalid and an error arose only after the processing took place
  5. Attend to the conflicted files

stevefulme1 and others added 6 commits July 1, 2026 09:39
- Handle stopped VMs gracefully: fall back to VirtualMachine data when
  VMI does not exist, report meaningful status instead of failing
- Add source indicator (vm/vmi) to all healthcheck results
- Sort declared and attached volume lists alphabetically for consistent
  comparison and display
- Add report generation timestamp to HTML report header and footer
- Validate report output directory exists and is writable before
  running healthcheck processing
@stevefulme1 stevefulme1 force-pushed the feat/vm-healthcheck-role branch from e559867 to d1bb24a Compare July 1, 2026 13:43
@stevefulme1

Copy link
Copy Markdown
Contributor Author

@sabre1041 Thanks for the thorough review! All five items addressed in the latest push:

  1. Stopped VMs / VMI fallback — The role now gracefully handles VMs without a running VMI. When no VMI exists, status checks report meaningful details (e.g., "VM stopped (printableStatus: Stopped)") instead of failing opaquely. A source field (vm or vmi) is added to every result so the report shows where data came from. Networking and resource checks that require a live VMI are skipped with a clear explanation rather than erroring.

  2. Volume ordering — Both _vm_healthcheck_declared_volumes and _vm_healthcheck_attached_volumes are now sorted alphabetically (| sort) before comparison and display, ensuring consistent ordering regardless of API return order.

  3. Report timestamp — Added _vm_healthcheck_report_timestamp (UTC ISO 8601) rendered in both the report header and footer.

  4. Report destination validation — Added a pre-flight check early in _healthcheck.yml that uses ansible.builtin.stat to verify the report output directory exists and is writable before any healthcheck processing begins.

  5. Conflicts — Rebased on latest upstream/main. Resolved CONTRIBUTING.md conflict by keeping the upstream version (which is more comprehensive).

@stevefulme1 stevefulme1 deployed to external-ci July 2, 2026 00:37 — with GitHub Actions Active

@sabre1041 sabre1041 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevefulme1 A few comments

  1. Some of the logic defined in the vm_resources.yml look a bit weird. Especially, surrounding overcommit. It is not checking the value of the resources but checking to see if resources or limits are specified
  2. Linting is failing mainly due to line length

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants