-
Notifications
You must be signed in to change notification settings - Fork 281
Importing a containerdisk onto a block volume loses sparseness #3614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@stefanha Is there a particular version of qemu-img that has this support, or has it been there a long time? We should make sure the version of qemu-img used in CDI supports this flag. |
It is available starting from RHEL 8, Ubuntu 22.04, OpenSUSE Leap 15.4, Debian 10 (backports) or 11. |
Super interesting, thanks for opening the issue! I am wondering how a certain test we have isn't catching this containerized-data-importer/tests/import_test.go Line 1643 in 41b96ed
#3213 |
This du(1) command-line probably isn't working as expected on a block device: https://github.com/kubevirt/containerized-data-importer/blob/main/tests/framework/pvc.go#L552 |
This PR gets rid of the |
This explains it:
SizeOnDisk being |
There is no generic way in Linux to query a block device to find out how many blocks are allocated, so SizeOnDisk will not have a useful value. |
Maybe this trick will work: create a sparse file using truncate(1) and then create a corresponding loopback block device using losetup(8). The test would be able to look at the blocks allocated in the underlying sparse file to get an approximation of the number of block touched on the loopback block device, modulo file system effects like its block size. |
Can't we just use |
If I understand correctly, the test is attempting to verify that the block device was written sparsely (non-zero blocks were skipped). Simply using dd to copy the block device to a file won't show whether the block device was written sparsely, so I don't think that approach works. You could first populate the block device with a pattern, import the containerdisk, and then check to see whether the pattern is still visible in blocks where the containerdisk is zero. That last step could be a single checksum comparison of the contents of the whole disk. |
Looks like this issue is somehow hidden with ceph rbd (using same 10Gi image) # rook-ceph-toolbox pod
$ ceph df -f json | jq .pools[0].stats.bytes_used
2036678656 |
Ceph might be doing zero detection or deduplication? Even if this is the case, you should be able to see the issue by running the same qemu-img command-line as CDI under strace(1) and looking at the pattern of write syscalls. |
/assign @noamasu |
Discussed this in the sig-storage meeting; we should be okay with just unit testing the cmd line args to qemu-img and trust that the rest is verified on the virt layer. I just think a functest for this is going to get too involved/coupled to a specific storage type |
This change updates the qemu-img convert command by adding the --target-is-zero flag along with the required -n option. When the target block device is pre-zeroed, the flag enables qemu-img to skip writing zero blocks, thereby reducing unnecessary I/O and speeding up sparse image conversions. This improvement addresses the performance concerns noted in kubevirt#3614. Signed-off-by: Noam Assouline <[email protected]>
A quick update After performing various tests about this, here is what was observed: without
with
I/O: 3144 I/O (with) vs 677 I/O (without) looking at the block device usage (an rbd):
with
Both configurations show a provisioned size of 10 GiB with 1.9 GiB in use. looking at the block device usage (local, creating loop device
without
Both methods resulted in a physical size of 1988940 blocks for the test image (being presented as a loop device). incorporating the --target-is-zero flag (along with -n to ensure only non-zero sectors are written) effectively reduces I/O operations and the number of sectors written (regardless of the destination device type or whether a sparseness mechanism is available at the block driver level.) For now, a PR was made just to add these flags, but further verification is needed to determine whether:
|
The loop device results are surprising to me. I expected the number of allocated blocks to be less for |
Maybe it has to do with the filesystem test.img was created on? anyway this is most observable with LVM storage when invoking |
testing on thinly provisioned lvm "thinlv"
indeed, we easily see the difference. and even this time, the size of the test.img (loop0 device that has the lvm on it) is 11GB when copying to the LVM. interesting, so looks like raw /dev/loop0 device will save sparseness of the truncated file (even when without --target-is-zero), but not when it has LVM on it. |
Short summary from sig-storage meeting - we should be okay to assume we get a --target-is-zero fit device. |
This change updates the qemu-img convert command by adding the --target-is-zero flag along with the required -n option. When the target block device is pre-zeroed, the flag enables qemu-img to skip writing zero blocks, thereby reducing unnecessary I/O and speeding up sparse image conversions. This improvement addresses the performance concerns noted in kubevirt#3614. Signed-off-by: Noam Assouline <[email protected]>
This change updates the qemu-img convert command by adding the --target-is-zero flag along with the required -n option. When the target block device is pre-zeroed, the flag enables qemu-img to skip writing zero blocks, thereby reducing unnecessary I/O and speeding up sparse image conversions. This improvement addresses the performance concerns noted in kubevirt#3614. Signed-off-by: Noam Assouline <[email protected]>
@akalenyu and I came across an issue specific to preserving sparseness on LVM thin LVs. Using The following test looks at the ability to preserve sparseness on LVM thin LVs:
Relevant output:
And strace.log contains:
The bulk of the data was allocated due to qemu-img convert. The reason was that fallocate(2) failed to write zeroes with unmap and qemu-img fell back to writing zero buffers to the device. |
What happened:
Importing a containerdisk onto a block volume loses sparseness. When I imported the centos-stream:9 containerdisk, which only uses 2 GB of non-zero data onto an empty 10 GB block volume, all 10 GB were written by CDI. Preallocation was not enabled.
What you expected to happen:
Only the non-zero data should be written to the block volume. This saves space on the underlying storage.
How to reproduce it (as minimally and precisely as possible):
Create a DataVolume from the YAML below and observe the amount of storage allocated. I used KubeSAN as the CSI driver, so the LVM
lvs
command can be used to see the thin provisioned storage usage. If you don't have thin provisioned storage you could use I/O stats or tracing to determine how much data is being written.Additional context:
I discussed this with @aglitke and we looked at the qemu-img command that is invoked:
Adding the
--target-is-zero
option should avoid writing every block in the target block volume.If there are concerns that some new block volumes come uninitialized (blocks not zeroed), then it should be possible to run
blkdiscard --zeroout /path/to/block/device
before invoking qemu-img with--target-is-zero
. I have not tested this, but blkdiscard should zero the device efficiently and fall back to writing zero buffers on old hardware. On modern devices this would still be faster and preserve sparseness compared to writing all zeroes. On old devices it would be slower, depending on how many non-zero the input disk image has.Environment:
kubectl get deployments cdi-deployment -o yaml
): 4.17.3kubectl version
): v1.30.5uname -a
): N/AThe text was updated successfully, but these errors were encountered: