Skip to content

Volume Snapshot for DATA disk error timed out after 3600 #7593

@tschacara

Description

@tschacara

ISSUE TYPE

Bug Report

COMPONENT NAME

Management, Storage, KVM

CLOUDSTACK VERSION

4.17.2.0

CONFIGURATION

1 X Management Server
2 X Primary Storage´s 
1 X Secondary Storage
3 X KVM Server.
Advanced Network - Default Security Group.

OS / ENVIRONMENT

Both Management, KVM and Storage´s run Rocklinux 8.7

SUMMARY

Recurring volume snapshots fails and times out after 3600 seconds.

STEPS TO REPRODUCE

Scheduler DATA Volume Snapshot VM (1TB) while VM is online.

LOG HERE

2023-06-05 15:56:57,212 DEBUG [c.c.s.s.SnapshotSchedulerImpl] (SnapshotPollTask:ctx-f75b0142) (logid:1000c112) Got 0 snapshots to be executed at 2023-06-05 18:56:57 GMT
2023-06-05 15:56:59,143 DEBUG [o.a.c.s.d.d.CloudStackPrimaryDataStoreDriverImpl] (Work-Job-Executor-1:ctx-b15acf5a job-15633/job-15636 ctx-a16b6ca1) (logid:8443133e) Failed to take snapshot: 375
        at org.apache.cloudstack.storage.snapshot.SnapshotServiceImpl.takeSnapshot(SnapshotServiceImpl.java:208)
        at org.apache.cloudstack.storage.snapshot.DefaultSnapshotStrategy.takeSnapshot(DefaultSnapshotStrategy.java:439)
        at com.cloud.storage.snapshot.SnapshotManagerImpl.takeSnapshot(SnapshotManagerImpl.java:1240)
2023-06-05 15:56:59,147 DEBUG [o.a.c.s.s.SnapshotServiceImpl] (Work-Job-Executor-1:ctx-b15acf5a job-15633/job-15636 ctx-a16b6ca1) (logid:8443133e) create snapshot FileServer_DATA-414_20230605175657 failed: com.cloud.utils.exception.CloudRuntimeException: Failed to send command, due to Agent:15, com.cloud.exception.OperationTimedoutException: Commands 4851502698584866841 to Host 15 timed out after 3600
2023-06-05 15:56:59,173 DEBUG [o.a.c.s.s.DefaultSnapshotStrategy] (Work-Job-Executor-1:ctx-b15acf5a job-15633/job-15636 ctx-a16b6ca1) (logid:8443133e) Failed to take snapshot: com.cloud.utils.exception.CloudRuntimeException: Failed to send command, due to Agent:15, com.cloud.exception.OperationTimedoutException: Commands 4851502698584866841 to Host 15 timed out after 3600
2023-06-05 15:56:59,188 DEBUG [c.c.s.s.SnapshotManagerImpl] (Work-Job-Executor-1:ctx-b15acf5a job-15633/job-15636 ctx-a16b6ca1) (logid:8443133e) Failed to create snapshotcom.cloud.utils.exception.CloudRuntimeException: Failed to send command, due to Agent:15, com.cloud.exception.OperationTimedoutException: Commands 4851502698584866841 to Host 15 timed out after 3600
2023-06-05 15:56:59,200 DEBUG [c.c.r.ResourceLimitManagerImpl] (Work-Job-Executor-1:ctx-b15acf5a job-15633/job-15636 ctx-a16b6ca1) (logid:8443133e) Updating resource Type = snapshot count for Account = 219 Operation = decreasing Amount = 1
2023-06-05 15:56:59,232 ERROR [o.a.c.s.v.VolumeServiceImpl] (Work-Job-Executor-1:ctx-b15acf5a job-15633/job-15636 ctx-a16b6ca1) (logid:8443133e) Take snapshot: 1191 failed
        at org.apache.cloudstack.storage.snapshot.DefaultSnapshotStrategy.takeSnapshot(DefaultSnapshotStrategy.java:442)
        at com.cloud.storage.snapshot.SnapshotManagerImpl.takeSnapshot(SnapshotManagerImpl.java:1240)
        at org.apache.cloudstack.storage.snapshot.DefaultSnapshotStrategy.takeSnapshot(DefaultSnapshotStrategy.java:442)
        at com.cloud.storage.snapshot.SnapshotManagerImpl.takeSnapshot(SnapshotManagerImpl.java:1240)
2023-06-05 15:56:59,303 ERROR [o.a.c.a.c.u.s.CreateSnapshotCmd] (API-Job-Executor-1:ctx-225af267 job-15633 ctx-cb317a27) (logid:8443133e) Failed to create snapshot due to an internal error creating snapshot for volume d2f310f4-af8e-41fa-a94d-a1863d566503
        at org.apache.cloudstack.api.command.user.snapshot.CreateSnapshotCmd.execute(CreateSnapshotCmd.java:219)
        at org.apache.cloudstack.storage.snapshot.DefaultSnapshotStrategy.takeSnapshot(DefaultSnapshotStrategy.java:442)
        at com.cloud.storage.snapshot.SnapshotManagerImpl.takeSnapshot(SnapshotManagerImpl.java:1240)
2023-06-05 15:56:59,322 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-1:ctx-225af267 job-15633) (logid:8443133e) Complete async job-15633, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Failed to create snapshot due to an internal error creating snapshot for volume d2f310f4-af8e-41fa-a94d-a1863d566503"}
2023-06-05 15:56:59,328 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-1:ctx-225af267 job-15633) (logid:8443133e) Done executing org.apache.cloudstack.api.command.user.snapshot.CreateSnapshotCmd for job-15633

EXPECTED RESULTS

Volume snapshot is created and transported to secondary storage.

ACTUAL RESULTS

Volume DATA snapshot process can successfully create VM Snapshot. 
Checking the logs, the operation is being timeout after 3600 seconds.
I noticed that even getting the error in the ACS log and in the UI. 
The process is interrupted by the ACS in 3600 seconds and does not copy the data to the secondary storage. 
Even receiving the error, the snapshot of the volume runs in the background, but is not transferred, giving the impression that the hypervisor continued to execute what was requested in that step, when it received the timeout it does not execute the next steps to complete the request .

SNAPSHOTS

virsh snapshot-list i-219-414-VM
 Name                                   Creation Time               State
-----------------------------------------------------------------------------------
 19290884-7bd7-4b29-863a-e2400f4db35f   2023-06-03 14:23:01 -0300   disk-snapshot
 3de38bcb-8e2c-4616-9ade-66524eae41b7   2023-05-27 00:08:01 -0300   disk-snapshot
 40a49d8b-294b-4139-97d1-3fb3d5318b5e   2023-05-20 00:08:01 -0300   disk-snapshot
 4561cfde-8e21-495c-878a-e3511b3a1b0b   2023-05-27 00:03:01 -0300   disk-snapshot
 60f9e8ea-a15c-4c2d-a5f9-320670c44c48   2023-06-05 14:56:59 -0300   disk-snapshot
 6cb5b473-9b47-42bb-90ce-99871ed42517   2023-06-03 17:44:44 -0300   disk-snapshot
 6d9f0c1a-15ff-4e9c-b739-95406ae86649   2023-05-15 08:15:40 -0300   disk-snapshot
 6e4efb0b-f17e-4488-b1a4-3b2cc8053c44   2023-05-06 00:00:38 -0300   disk-snapshot
 94d85369-9914-4ed2-9263-04823f992c38   2023-06-05 12:03:54 -0300   disk-snapshot
 9e3466e2-5479-4abd-8003-f37336234799   2023-06-03 12:18:01 -0300   disk-snapshot
 a722a007-0a8f-447c-96df-14e38360dfc5   2023-05-08 10:20:38 -0300   disk-snapshot
 af3236fc-a865-4916-bfcf-3433de227a70   2023-06-03 00:03:01 -0300   disk-snapshot
 b54669a5-d7bc-46c0-9fda-f7cf822ad512   2023-05-15 08:20:40 -0300   disk-snapshot
 cb237c76-0adb-4288-8a0d-699ff4e63d92   2023-06-03 00:12:11 -0300   disk-snapshot
 d61caeaa-5a0c-4b5e-852b-6069a5136bf3   2023-05-20 00:03:01 -0300   disk-snapshot

/mnt/a6c33213-6d0f-365a-845c-021a5b5c2aed/snapshots
drwxr-xr-x. 2 root root 4.0K Jun  5 14:56 .
drwxr-xr-x. 4 root root 4.0K Jun  5 14:56 ..
-rw-r--r--. 1 root root 217G Jun  3 16:26 19290884-7bd7-4b29-863a-e2400f4db35f
-rw-r--r--. 1 root root 243G Jun  5 17:08 60f9e8ea-a15c-4c2d-a5f9-320670c44c48
-rw-r--r--. 1 root root 217G Jun  3 19:46 6cb5b473-9b47-42bb-90ce-99871ed42517
-rw-r--r--. 1 root root 278G Jun  5 14:37 94d85369-9914-4ed2-9263-04823f992c38
-rw-r--r--. 1 root root 217G Jun  3 14:21 9e3466e2-5479-4abd-8003-f37336234799
-rw-r--r--. 1 root root 217G Jun  3 02:14 cb237c76-0adb-4288-8a0d-699ff4e63d92

GLOBAL SETTINGS

job.cancel.threshold.minutes	        Time (in minutes) for async-jobs to be forcely cancelled if it has been in process for long	Advanced	default(60)     set(1440)
copy.volume.wait	                In second, timeout for copy volume command	                                                Storage	default(10800)  set(86400)
backup.snapshot.wait	                In second, timeout for BackupSnapshotCommand	                                                Storage	default(21600)  set(86400)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions