You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In bug 1857347 we tried to fix a case when libvirt block commit job failed
with unrecoverable error. Unfortunately the fix was not correct, making the
situation even worse, bug 1945675. The fix was reverted and now vdsm is
retrying pivot after unexpected errors.
Retrying proved very useful to mitigate temporary errors, for example bug 1945635, when libvirt block job is flipping states between "ready" and
"standby". Testing show that in all cases the pivot was successful in the
second retry.
However if the libvirt error is not temporary, retrying will not help and
the operation will never complete. In this case vdsm need to abort the
current libvirt block job and fail the merge operation.
We don't have a way to detect unrecoverable error in libvirt, since the error
is typically caused by a bug in qemu or libvirt so libvirt reports internal
error for all unexpected cases.
The only way to tell if the error is recoverable is to retry the operation,
and fail after several retries.
I think the best way to fix this is:
Keep the cleanup method in the job (e.g. "pivot", "abort")
Keep the number of pivot attempts in the job (like extend attempts)
When pivot fails, increase the pivot attempt counter.
When starting cleanup, if pivot attempt counter exceed the maximum value,
change the job cleanup method to "abort". From this point, the job
will try to abort the libvirt block job without the pivot flag.
There is no limit the the number of abort attempts, we must not
leave libvirt block job running.
Expected flow, starting at the point we start the cleanup, assuming
maximum 3 pivot attempts (the actual number of retries may need to be
larger):
00:00 try to pivot, fail: wait for next update
00:15 try to pivot, fail: wait for next update
00:30 try to pivot, fail: switch job to cleanup="abort"
00:45 try to avbort, fail: wait for next update
01:00 try to avbort, fail: wait for next update
01:15 try to abort, success: untrack job
More information:
It would be better if we limit the number of retries, but the only case
when it can help is libvirt bug, and in this case trying to stop the
pivot attempt and abort the merge may also fail.
In bug 1857347 we tried to fix a case when libvirt block commit job failed
with unrecoverable error. Unfortunately the fix was not correct, making the
situation even worse, bug 1945675. The fix was reverted and now vdsm is
retrying pivot after unexpected errors.
Retrying proved very useful to mitigate temporary errors, for example
bug 1945635, when libvirt block job is flipping states between "ready" and
"standby". Testing show that in all cases the pivot was successful in the
second retry.
However if the libvirt error is not temporary, retrying will not help and
the operation will never complete. In this case vdsm need to abort the
current libvirt block job and fail the merge operation.
We don't have a way to detect unrecoverable error in libvirt, since the error
is typically caused by a bug in qemu or libvirt so libvirt reports internal
error for all unexpected cases.
The only way to tell if the error is recoverable is to retry the operation,
and fail after several retries.
I think the best way to fix this is:
change the job cleanup method to "abort". From this point, the job
will try to abort the libvirt block job without the pivot flag.
leave libvirt block job running.
Expected flow, starting at the point we start the cleanup, assuming
maximum 3 pivot attempts (the actual number of retries may need to be
larger):
00:00 try to pivot, fail: wait for next update
00:15 try to pivot, fail: wait for next update
00:30 try to pivot, fail: switch job to cleanup="abort"
00:45 try to avbort, fail: wait for next update
01:00 try to avbort, fail: wait for next update
01:15 try to abort, success: untrack job
More information:
It would be better if we limit the number of retries, but the only case
when it can help is libvirt bug, and in this case trying to stop the
pivot attempt and abort the merge may also fail.
Original bug: https://bugzilla.redhat.com/1949470
The text was updated successfully, but these errors were encountered: