-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New API "checkVolume" to check and repair any leaks or issues reported by qemu-img check #8577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@blueorangutan package |
api/src/main/java/org/apache/cloudstack/api/command/user/volume/CheckAndRepairVolumeCmd.java
Show resolved
Hide resolved
engine/components-api/src/main/java/com/cloud/vm/VmWorkCheckAndRepairVolume.java
Show resolved
Hide resolved
engine/storage/volume/src/main/java/org/apache/cloudstack/storage/volume/VolumeServiceImpl.java
Outdated
Show resolved
Hide resolved
...ava/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtCheckAndRepairVolumeCommandWrapper.java
Outdated
Show resolved
Hide resolved
plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/utils/qemu/QemuImg.java
Outdated
Show resolved
Hide resolved
...ava/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtCheckAndRepairVolumeCommandWrapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Outdated
Show resolved
Hide resolved
server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Outdated
Show resolved
Hide resolved
server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Outdated
Show resolved
Hide resolved
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8476 |
6e5c36d to
13597b3
Compare
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8478 |
...chestration/src/main/java/org/apache/cloudstack/engine/orchestration/VolumeOrchestrator.java
Outdated
Show resolved
Hide resolved
...ava/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtCheckAndRepairVolumeCommandWrapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Outdated
Show resolved
Hide resolved
server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Outdated
Show resolved
Hide resolved
server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Outdated
Show resolved
Hide resolved
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8487 |
|
@harikrishna-patnala I might be missing something, but how will this new API be handled when called in a xen or vmware env? as you state "Currently this is supported only for KVM" I am sure you implemented this somewhere, but all I can find is an implicit error and no graceful message |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8729 |
|
@blueorangutan test |
|
@harikrishna-patnala a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
sureshanaparti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
|
[SF] Trillian test result (tid-9314)
|
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8738 |
|
@JoaoJandre is this ok for you now? |
JoaoJandre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLGTM, didn't test it
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8753 |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-9332)
|
kiranchavala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM , Tested the check volume api with the following scenarios and its working fine
Introduced leaks by following
- Launched a vm with root disk and data disk
- Login to the vm and create a partition, format and mount the disk
- Install the fio tool ( yum install fio -y )
- Execute the command
fio --filename=/dev/vdb1 --direct=1 --rw=randwrite --bsrange=512-4k --ioengine=libaio --iodepth=32 --runtime=120 --numjobs=8 --time_based --group_reporting --name=iops-test-job --norandommap --allow_mounted_write=1
- Kill the vm process from the hypervisor
- Execute the check volume api to check for leaks and fix them
Tested the check volume api on stopped VM and detached volume.
Tested the check volume api on a running vm with attached data disk
Tested the check volume api after Introducing Leaks on the disks
Tested the check volume api on Encrypted volume
Tested the check volume api with storage pool types(NFS/Local/Poweflex)
Tested the check volume api with Provisioning type (Thin/FAT)
Tested Check volume api with High capacity Volumes ( 1TB)
Tested the check volume api with user level and admin level access
Tested for Check volume api during VM Start and Vm attach operations (Global setting and Storage level setting > volume.check.and.repair.leaks.before.use)
|
I just realised after merging the base branch of the PR is 4.19 and not main, is that an issue - should we revert @harikrishna-patnala @DaanHoogland @kiranchavala ? |
It is a new API, but no backwards incompatibility and it is also an improvement. Not ideal, but I think it is fine. cc @sureshanaparti @JoaoJandre |
…d by qemu-img check (apache#8577) * Introduced a new API checkVolumeAndRepair that allows users or admins to check and repair if any leaks observed. Currently this is supported only for KVM * some fixes * Added unit tests * addressed review comments * add repair volume while granting access * Changed repair parameter to accept both leaks/all * Introduced new global setting volume.check.and.repair.before.use to do volume check and repair before VM start or volume attach operations * Added volume check and repair changes only during VM start and volume attach operations * Refactored the names to look similar across the code * Some code fixes * remove unused code * Renamed repair values * Fixed unit tests * changed version * Address review comments * Code refactored * used volume name in logs * Changed the API to Async and the setting scope to storage pool * Fixed exit value handling with check volume command * Fixed storage scope to the setting * Fix volume format issues * Refactored the log messages * Fix formatting
Description
This PR introduces a new API "checkVolume" that allows users or admins to check and repair if any leaks observed. Currently this is supported only for KVM
Doc PR link : apache/cloudstack-documentation#380
There are few cases when VMs shutdown uncleanly, particularly those using qcow2, they can leak clusters. This may sometimes lead to volumes taking up much more space than they are supposed to. When we use qcow2 format to thin provision, and the volume size is pretty close to the actual formatted size, leaked clusters can run us out of space, so we need a way to check/repair.
To address this, we have introduced a new API "checkVolume" API which takes parameters volume id and repair (possible values are leaks/all)
API name: checkVolume
Parameters:
There is also option to repair the volume during VM start or while attaching the volume to VM. Introduced a new boolean global setting
volume.check.and.repair.leaks.before.use with a default false.
STEPS TO REPRODUCE:
Create a VM on local storage, or NFS storage.
attach a data disk
run a write benchmark on data disk in guest. e.g.:
fio --filename=/dev/vdb --direct=1 --rw=randwrite --bsrange=512-4k --ioengine=libaio --iodepth=32 --runtime=120 --numjobs=8 --time_based --group_reporting --name=iops-test-job --norandommapimmediately kill the VM (from host try virsh shutdown or "kill -9" of qemu process
run a check on the underlying qcow2 file, observe "leaks" count
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Screenshots (if appropriate):
How Has This Been Tested?
(localcloud) 🐱 > check volume id=55937826-2f08-414a-9eef-4c6b7d6fd3b1
{
.
.
"volumecheckresult": {
"allocated-clusters": "110",
"check-errors": "0",
"leaks": 73,
"filename": "/mnt/e72364b6-eab0-369f-af0b-2ec8bed9d8ac/55937826-2f08-414a-9eef-4c6b7d6fd3b1",
"format": "qcow2",
"fragmented-clusters": "32",
"image-end-offset": "7995392",
"total-clusters": "131072"
},
(localcloud) 🐱 > check volume id=55937826-2f08-414a-9eef-4c6b7d6fd3b1 repair=leaks
{
"volumecheckresult": {
"allocated-clusters": "110",
"check-errors": "0",
"leaks": 73,
"filename": "/mnt/e72364b6-eab0-369f-af0b-2ec8bed9d8ac/55937826-2f08-414a-9eef-4c6b7d6fd3b1",
"format": "qcow2",
"fragmented-clusters": "32",
"image-end-offset": "7995392",
"total-clusters": "131072"
},
"volumerepairresult": {
"allocated-clusters": "110",
"check-errors": "0",
"leaks-fixed": 73,
"filename": "/mnt/e72364b6-eab0-369f-af0b-2ec8bed9d8ac/55937826-2f08-414a-9eef-4c6b7d6fd3b1",
"format": "qcow2",
"fragmented-clusters": "32",
"image-end-offset": "7995392",
"total-clusters": "131072"
},
}
How did you try to break this feature and the system with this change?