Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of pivot retries #197

Open
ahadas opened this issue May 23, 2022 · 0 comments
Open

Limit the number of pivot retries #197

ahadas opened this issue May 23, 2022 · 0 comments
Labels
enhancement Enhancing the system by adding new feature or improving performance or reliability storage

Comments

@ahadas
Copy link
Member

ahadas commented May 23, 2022

In bug 1857347 we tried to fix a case when libvirt block commit job failed
with unrecoverable error. Unfortunately the fix was not correct, making the
situation even worse, bug 1945675. The fix was reverted and now vdsm is
retrying pivot after unexpected errors.

Retrying proved very useful to mitigate temporary errors, for example
bug 1945635, when libvirt block job is flipping states between "ready" and
"standby". Testing show that in all cases the pivot was successful in the
second retry.

However if the libvirt error is not temporary, retrying will not help and
the operation will never complete. In this case vdsm need to abort the
current libvirt block job and fail the merge operation.

We don't have a way to detect unrecoverable error in libvirt, since the error
is typically caused by a bug in qemu or libvirt so libvirt reports internal
error for all unexpected cases.

The only way to tell if the error is recoverable is to retry the operation,
and fail after several retries.

I think the best way to fix this is:

  • Keep the cleanup method in the job (e.g. "pivot", "abort")
  • Keep the number of pivot attempts in the job (like extend attempts)
  • When pivot fails, increase the pivot attempt counter.
  • When starting cleanup, if pivot attempt counter exceed the maximum value,
    change the job cleanup method to "abort". From this point, the job
    will try to abort the libvirt block job without the pivot flag.
  • There is no limit the the number of abort attempts, we must not
    leave libvirt block job running.

Expected flow, starting at the point we start the cleanup, assuming
maximum 3 pivot attempts (the actual number of retries may need to be
larger):

00:00 try to pivot, fail: wait for next update
00:15 try to pivot, fail: wait for next update
00:30 try to pivot, fail: switch job to cleanup="abort"
00:45 try to avbort, fail: wait for next update
01:00 try to avbort, fail: wait for next update
01:15 try to abort, success: untrack job

More information:
It would be better if we limit the number of retries, but the only case
when it can help is libvirt bug, and in this case trying to stop the
pivot attempt and abort the merge may also fail.

Original bug: https://bugzilla.redhat.com/1949470

@ahadas ahadas added the storage label May 23, 2022
@nirs nirs added the enhancement Enhancing the system by adding new feature or improving performance or reliability label May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancing the system by adding new feature or improving performance or reliability storage
Projects
None yet
Development

No branches or pull requests

2 participants