Skip to content

Retry instances not running add new Execution list endpoint #793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

olethanh
Copy link
Collaborator

@olethanh olethanh commented Apr 10, 2025

Jira ticket: ALEPH-287

This PR improve how Instance are started, making it more reliable and properly cleaning when they fail to start. It also allow to start again the VM if it failed to start the first time.

Changes introduced by this PR:

  1. Decouple instance wait_for_init from allocation. It now return directly after starting the VM controller and run wait_for_init in the background. wait_for_init has been renamed wait_for_boot in VM
  2. Stop the controller when the Instance fail to start. Before if the instance failed to respond to ping the Instance was reported as not running and removed from the list but the Instance systemd controller was still. This tied up resource and prevented the VM from starting again. Also it caused issue with the network.
  3. Introduce a new endpoint /v2/about/executions/list which also include the starting, started, end times etc.. so we can inspect the VM state. and includ non running VM It will allow us in the future to improve the clients
  4. Fix VM inconsistant state calculation between the boot and the end of boot.
  5. Properly clean up ressource if the VM crash and we tried to start it again. Earlier if it was not is_running we just started a new one regardless if it was shutting down or crashed and the resource were not cleaned.
  6. Save debug token inside a protected file so we can still access it if the log rotated
  7. Improve the instances tests:
    • Reduce the problem of contamination of settings between tests
    • ensure we tests in a separat temp dir we don't reuse the cache. Before that the tests missed issue where ressource could not be downloaded because of a coding error but the resource was already cached
    • Better output and debug info

Self proofreading checklist

  • The new code clear, easy to read and well commented.
  • New code does not duplicate the functions of builtin or popular libraries.
  • An LLM was used to review the new code and look for simplifications.
  • New classes and functions contain docstrings explaining what they provide.
  • All new code is covered by relevant tests.
  • Documentation has been updated regarding these changes.
  • Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

In addition to what was listed above, others refactor changes:

  • in tests Setup webapp take the vm_pool argument
  • vm_pool don't take the loop anymore (not needed since Python 3.10)
  • Execution call .enable_and_start and .stop_and_disable, not VmPool
  • .enable_and_start is now async
  • Remove some unused code
  • Better debugging output

How to test

Allocate VM, try allocating VM that fail to start and reallocate again. Kill VM during boot

@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 9 times, most recently from 7694fb3 to f5d096f Compare April 15, 2025 08:41
Copy link

codecov bot commented Apr 15, 2025

Codecov Report

Attention: Patch coverage is 80.08130% with 49 lines in your changes missing coverage. Please review.

Project coverage is 65.02%. Comparing base (c10d8f4) to head (9f1acdd).

Files with missing lines Patch % Lines
src/aleph/vm/models.py 63.46% 15 Missing and 4 partials ⚠️
src/aleph/vm/orchestrator/run.py 0.00% 13 Missing ⚠️
src/aleph/vm/pool.py 54.54% 4 Missing and 1 partial ⚠️
src/aleph/vm/orchestrator/supervisor.py 42.85% 4 Missing ⚠️
tests/supervisor/test_qemu_instance.py 95.08% 2 Missing and 1 partial ⚠️
src/aleph/vm/orchestrator/cli.py 0.00% 2 Missing ⚠️
src/aleph/vm/hypervisors/firecracker/microvm.py 0.00% 1 Missing ⚠️
tests/supervisor/test_instance.py 92.85% 1 Missing ⚠️
tests/supervisor/views/test_operator.py 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #793      +/-   ##
==========================================
+ Coverage   64.54%   65.02%   +0.48%     
==========================================
  Files          78       78              
  Lines        7093     7180      +87     
  Branches      598      599       +1     
==========================================
+ Hits         4578     4669      +91     
- Misses       2315     2318       +3     
+ Partials      200      193       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch from f5d096f to f38053f Compare April 15, 2025 09:46
@olethanh olethanh marked this pull request as ready for review April 15, 2025 10:07
@olethanh
Copy link
Collaborator Author

I will rework the commit cleanly and update the desc but for me this is ready for review

@olethanh olethanh requested a review from nesitor April 15, 2025 10:07
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 2 times, most recently from 50c0f9b to 637df1a Compare April 15, 2025 12:54
@olethanh olethanh changed the title Ol aleph 287 retry instance Retry instances not running add new Execution list endpoint Apr 15, 2025
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch 7 times, most recently from b4f1398 to d50c807 Compare April 22, 2025 13:48
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch from d50c807 to e5ec35a Compare April 23, 2025 12:48
Wait for boot, clean up  ressources if boot failed
Add a new executions list endpoint with more information on the state of the VM to be used by client
so we can better inform them on their instance state

This new endpoint list all executions in pool running or not
(but terminated VM are removed from the pool)
Handle VM that are stopping and starting before creating a new one
@olethanh olethanh force-pushed the ol-ALEPH-287-retry-instance branch from e5ec35a to 9f1acdd Compare April 24, 2025 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants