Skip to content

Conversation

@harikrishna-patnala
Copy link
Contributor

@harikrishna-patnala harikrishna-patnala commented Jan 30, 2024

Description

This PR introduces a new API "checkVolume" that allows users or admins to check and repair if any leaks observed. Currently this is supported only for KVM

Doc PR link : apache/cloudstack-documentation#380

There are few cases when VMs shutdown uncleanly, particularly those using qcow2, they can leak clusters. This may sometimes lead to volumes taking up much more space than they are supposed to. When we use qcow2 format to thin provision, and the volume size is pretty close to the actual formatted size, leaked clusters can run us out of space, so we need a way to check/repair.

To address this, we have introduced a new API "checkVolume" API which takes parameters volume id and repair (possible values are leaks/all)

API name: checkVolume
Parameters:

  • id : volume ID
  • repair : parameter to repair the volume, leaks or all are the possible values

There is also option to repair the volume during VM start or while attaching the volume to VM. Introduced a new boolean global setting
volume.check.and.repair.leaks.before.use with a default false.

STEPS TO REPRODUCE:

  1. Create a VM on local storage, or NFS storage.

  2. attach a data disk

  3. run a write benchmark on data disk in guest. e.g.:
    fio --filename=/dev/vdb --direct=1 --rw=randwrite --bsrange=512-4k --ioengine=libaio --iodepth=32 --runtime=120 --numjobs=8 --time_based --group_reporting --name=iops-test-job --norandommap

  4. immediately kill the VM (from host try virsh shutdown or "kill -9" of qemu process

  5. run a check on the underlying qcow2 file, observe "leaks" count

# qemu-img check /var/lib/libvirt/images/26be20c7-b9d0-43f6-a76e-16c70737a0e0 --output=json 2>/dev/null
{
    "image-end-offset": 6442582016,
    "total-clusters": 163840,
    "check-errors": 0,
    "leaks": 124,
    "allocated-clusters": 98154,
    "filename": "/var/lib/libvirt/images/26be20c7-b9d0-43f6-a76e-16c70737a0e0",
    "format": "qcow2",
    "fragmented-clusters": 96135
}
  1. repair leaks
# qemu-img check /var/lib/libvirt/images/26be20c7-b9d0-43f6-a76e-16c70737a0e0 --output=json -r leaks 2>/dev/null
{
    "image-end-offset": 6442582016,
    "total-clusters": 163840,
    "check-errors": 0,
    "leaks-fixed": 124,
    "allocated-clusters": 98154,
    "filename": "/var/lib/libvirt/images/26be20c7-b9d0-43f6-a76e-16c70737a0e0",
    "format": "qcow2",
    "fragmented-clusters": 96135
}

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Screenshots (if appropriate):

How Has This Been Tested?

(localcloud) 🐱 > check volume id=55937826-2f08-414a-9eef-4c6b7d6fd3b1
{
.
.
"volumecheckresult": {
"allocated-clusters": "110",
"check-errors": "0",
"leaks": 73,
"filename": "/mnt/e72364b6-eab0-369f-af0b-2ec8bed9d8ac/55937826-2f08-414a-9eef-4c6b7d6fd3b1",
"format": "qcow2",
"fragmented-clusters": "32",
"image-end-offset": "7995392",
"total-clusters": "131072"
},

(localcloud) 🐱 > check volume id=55937826-2f08-414a-9eef-4c6b7d6fd3b1 repair=leaks
{
"volumecheckresult": {
"allocated-clusters": "110",
"check-errors": "0",
"leaks": 73,
"filename": "/mnt/e72364b6-eab0-369f-af0b-2ec8bed9d8ac/55937826-2f08-414a-9eef-4c6b7d6fd3b1",
"format": "qcow2",
"fragmented-clusters": "32",
"image-end-offset": "7995392",
"total-clusters": "131072"
},
"volumerepairresult": {
"allocated-clusters": "110",
"check-errors": "0",
"leaks-fixed": 73,
"filename": "/mnt/e72364b6-eab0-369f-af0b-2ec8bed9d8ac/55937826-2f08-414a-9eef-4c6b7d6fd3b1",
"format": "qcow2",
"fragmented-clusters": "32",
"image-end-offset": "7995392",
"total-clusters": "131072"
},
}

How did you try to break this feature and the system with this change?

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

@harikrishna-patnala harikrishna-patnala marked this pull request as ready for review January 30, 2024 13:27
@codecov
Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 298 lines in your changes are missing coverage. Please review.

Comparison is base (1a11311) 30.90% compared to head (30fa612) 30.91%.

Files Patch % Lines
...n/java/com/cloud/storage/VolumeApiServiceImpl.java 33.96% 69 Missing and 1 partial ⚠️
...s/src/main/java/com/cloud/utils/script/Script.java 0.00% 69 Missing ⚠️
...per/LibvirtCheckAndRepairVolumeCommandWrapper.java 36.78% 48 Missing and 7 partials ⚠️
...i/command/user/volume/CheckAndRepairVolumeCmd.java 3.03% 32 Missing ⚠️
...java/org/apache/cloudstack/utils/qemu/QemuImg.java 0.00% 18 Missing ⚠️
...e/cloudstack/storage/volume/VolumeServiceImpl.java 57.57% 10 Missing and 4 partials ⚠️
...ils/src/main/java/com/cloud/utils/StringUtils.java 0.00% 11 Missing ⚠️
.../agent/api/storage/CheckAndRepairVolumeAnswer.java 42.85% 8 Missing ⚠️
...agent/api/storage/CheckAndRepairVolumeCommand.java 52.94% 7 Missing and 1 partial ⚠️
.../java/com/cloud/vm/VmWorkCheckAndRepairVolume.java 0.00% 6 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##               4.19    #8577    +/-   ##
==========================================
  Coverage     30.90%   30.91%            
- Complexity    34202    34252    +50     
==========================================
  Files          5347     5353     +6     
  Lines        375621   376032   +411     
  Branches      54627    54684    +57     
==========================================
+ Hits         116093   116249   +156     
- Misses       244240   244490   +250     
- Partials      15288    15293     +5     
Flag Coverage Δ
simulator-marvin-tests 24.73% <4.07%> (-0.02%) ⬇️
uitests 4.39% <ø> (ø)
unit-tests 16.58% <24.94%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8476

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8478

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8487

@DaanHoogland
Copy link
Contributor

@harikrishna-patnala I might be missing something, but how will this new API be handled when called in a xen or vmware env? as you state "Currently this is supported only for KVM" I am sure you implemented this somewhere, but all I can find is an implicit error and no graceful message

see https://github.com/apache/cloudstack/pull/8577/files#diff-63d1a7ba0fc6bbd393feffaaf961d192431e8d343280b7cdc8d3817c2d6b7f1cR2801

@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@harikrishna-patnala harikrishna-patnala changed the base branch from main to 4.19 February 21, 2024 04:50
@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@harikrishna-patnala harikrishna-patnala modified the milestones: 20.0.0, 4.19.1.0 Feb 21, 2024
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8729

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

[SF] Trillian test result (tid-9314)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46931 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8577-t9314-kvm-centos7.zip
Smoke tests completed. 127 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_condensed_drs_algorithm Failure 164.54 test_cluster_drs.py
test_02_trigger_shutdown Failure 336.65 test_safe_shutdown.py

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8738

@DaanHoogland
Copy link
Contributor

@JoaoJandre is this ok for you now?
@kiranchavala do you have any qa work for this planned?

Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, didn't test it

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8753

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9332)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43396 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8577-t9332-kvm-centos7.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , Tested the check volume api with the following scenarios and its working fine

Introduced leaks by following

  1. Launched a vm with root disk and data disk
  2. Login to the vm and create a partition, format and mount the disk
  3. Install the fio tool ( yum install fio -y )
  4. Execute the command

fio --filename=/dev/vdb1 --direct=1 --rw=randwrite --bsrange=512-4k --ioengine=libaio --iodepth=32 --runtime=120 --numjobs=8 --time_based --group_reporting --name=iops-test-job --norandommap --allow_mounted_write=1

  1. Kill the vm process from the hypervisor
  2. Execute the check volume api to check for leaks and fix them

Tested the check volume api on stopped VM and detached volume.
Tested the check volume api on a running vm with attached data disk
Tested the check volume api after Introducing Leaks on the disks
Tested the check volume api on Encrypted volume
Tested the check volume api with storage pool types(NFS/Local/Poweflex)
Tested the check volume api with Provisioning type (Thin/FAT)
Tested Check volume api with High capacity Volumes ( 1TB)
Tested the check volume api with user level and admin level access
Tested for Check volume api during VM Start and Vm attach operations (Global setting and Storage level setting > volume.check.and.repair.leaks.before.use)

@rohityadavcloud rohityadavcloud merged commit c462be1 into apache:4.19 Feb 29, 2024
@rohityadavcloud rohityadavcloud deleted the CheckVolumeAPI branch February 29, 2024 09:11
@rohityadavcloud
Copy link
Member

I just realised after merging the base branch of the PR is 4.19 and not main, is that an issue - should we revert @harikrishna-patnala @DaanHoogland @kiranchavala ?

@DaanHoogland
Copy link
Contributor

I just realised after merging the base branch of the PR is 4.19 and not main, is that an issue - should we revert @harikrishna-patnala @DaanHoogland @kiranchavala ?

It is a new API, but no backwards incompatibility and it is also an improvement. Not ideal, but I think it is fine. cc @sureshanaparti @JoaoJandre

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Mar 5, 2024
…d by qemu-img check (apache#8577)

* Introduced a new API checkVolumeAndRepair that allows users or admins to check and repair if any leaks observed.
Currently this is supported only for KVM

* some fixes

* Added unit tests

* addressed review comments

* add repair volume while granting access

* Changed repair parameter to accept both leaks/all

* Introduced new global setting volume.check.and.repair.before.use to do volume check and repair before VM start or volume attach operations

* Added volume check and repair changes only during VM start and volume attach operations

* Refactored the names to look similar across the code

* Some code fixes

* remove unused code

* Renamed repair values

* Fixed unit tests

* changed version

* Address review comments

* Code refactored

* used volume name in logs

* Changed the API to Async and the setting scope to storage pool

* Fixed exit value handling with check volume command

* Fixed storage scope to the setting

* Fix volume format issues

* Refactored the log messages

* Fix formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants