Retry instances not running add new Execution list endpoint #793

olethanh · 2025-04-10T12:14:46Z

Jira ticket: ALEPH-287

This PR improve how Instance are started, making it more reliable and properly cleaning when they fail to start. It also allow to start again the VM if it failed to start the first time.

Changes introduced by this PR:

Decouple instance wait_for_init from allocation. It now return directly after starting the VM controller and run wait_for_init in the background. wait_for_init has been renamed wait_for_boot in VM
Stop the controller when the Instance fail to start. Before if the instance failed to respond to ping the Instance was reported as not running and removed from the list but the Instance systemd controller was still. This tied up resource and prevented the VM from starting again. Also it caused issue with the network.
Introduce a new endpoint /v2/about/executions/list which also include the starting, started, end times etc.. so we can inspect the VM state. and includ non running VM It will allow us in the future to improve the clients
Fix VM inconsistant state calculation between the boot and the end of boot.
Properly clean up ressource if the VM crash and we tried to start it again. Earlier if it was not is_running we just started a new one regardless if it was shutting down or crashed and the resource were not cleaned.
Save debug token inside a protected file so we can still access it if the log rotated
Improve the instances tests:
- Reduce the problem of contamination of settings between tests
- ensure we tests in a separat temp dir we don't reuse the cache. Before that the tests missed issue where ressource could not be downloaded because of a coding error but the resource was already cached
- Better output and debug info

Self proofreading checklist

The new code clear, easy to read and well commented.
New code does not duplicate the functions of builtin or popular libraries.
An LLM was used to review the new code and look for simplifications.
New classes and functions contain docstrings explaining what they provide.
All new code is covered by relevant tests.
Documentation has been updated regarding these changes.
Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

In addition to what was listed above, others refactor changes:

in tests Setup webapp take the vm_pool argument
vm_pool don't take the loop anymore (not needed since Python 3.10)
Execution call .enable_and_start and .stop_and_disable, not VmPool
.enable_and_start is now async
Remove some unused code
Better debugging output

How to test

Allocate VM, try allocating VM that fail to start and reallocate again. Kill VM during boot

codecov · 2025-04-15T08:52:50Z

Codecov Report

Attention: Patch coverage is 80.08130% with 49 lines in your changes missing coverage. Please review.

Project coverage is 65.02%. Comparing base (c10d8f4) to head (9f1acdd).

Files with missing lines	Patch %	Lines
src/aleph/vm/models.py	63.46%	15 Missing and 4 partials ⚠️
src/aleph/vm/orchestrator/run.py	0.00%	13 Missing ⚠️
src/aleph/vm/pool.py	54.54%	4 Missing and 1 partial ⚠️
src/aleph/vm/orchestrator/supervisor.py	42.85%	4 Missing ⚠️
tests/supervisor/test_qemu_instance.py	95.08%	2 Missing and 1 partial ⚠️
src/aleph/vm/orchestrator/cli.py	0.00%	2 Missing ⚠️
src/aleph/vm/hypervisors/firecracker/microvm.py	0.00%	1 Missing ⚠️
tests/supervisor/test_instance.py	92.85%	1 Missing ⚠️
tests/supervisor/views/test_operator.py	93.75%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #793      +/-   ##
==========================================
+ Coverage   64.54%   65.02%   +0.48%     
==========================================
  Files          78       78              
  Lines        7093     7180      +87     
  Branches      598      599       +1     
==========================================
+ Hits         4578     4669      +91     
- Misses       2315     2318       +3     
+ Partials      200      193       -7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

olethanh · 2025-04-15T10:07:44Z

I will rework the commit cleanly and update the desc but for me this is ready for review

mod: VmPool remove unused loop argument

Wait for boot, clean up ressources if boot failed

… rotated

Add a new executions list endpoint with more information on the state of the VM to be used by client so we can better inform them on their instance state This new endpoint list all executions in pool running or not (but terminated VM are removed from the pool)

Handle VM that are stopping and starting before creating a new one

Prevent the case when the VM was not stopped when running execution.stop() directly instead of VmPool.stop_vm() Simplify code, add warning

Change sig of enable_and_start to async Adapt Firecracker instance test - Rename to test_create_firecracker_instance - Use mocker to patch() settings so it doesn't contamine other tests - Ensure it ping properly to confirm it is working

…the test output

olethanh force-pushed the ol-ALEPH-287-retry-instance branch 9 times, most recently from 7694fb3 to f5d096f Compare April 15, 2025 08:41

olethanh force-pushed the ol-ALEPH-287-retry-instance branch from f5d096f to f38053f Compare April 15, 2025 09:46

olethanh marked this pull request as ready for review April 15, 2025 10:07

olethanh requested a review from nesitor April 15, 2025 10:07

olethanh force-pushed the ol-ALEPH-287-retry-instance branch 2 times, most recently from 50c0f9b to 637df1a Compare April 15, 2025 12:54

olethanh changed the title ~~Ol aleph 287 retry instance~~ Retry instances not running add new Execution list endpoint Apr 15, 2025

olethanh force-pushed the ol-ALEPH-287-retry-instance branch 7 times, most recently from b4f1398 to d50c807 Compare April 22, 2025 13:48

olethanh force-pushed the ol-ALEPH-287-retry-instance branch from d50c807 to e5ec35a Compare April 23, 2025 12:48

olethanh added 5 commits April 24, 2025 09:46

mod: setup_webapp take the VmPool as arg. VmPool remove loop arg

1376911

mod: VmPool remove unused loop argument

enh: Instance start Do not block instance start on ping ALEPH-287

ed7751e

Wait for boot, clean up ressources if boot failed

enh: write debug login token to a file so we can use it even when log…

5a12884

… rotated

Instance creation: Reuse starting vm, handle stopping VM

62729d7

Handle VM that are stopping and starting before creating a new one

olethanh added 12 commits April 24, 2025 09:46

stop controller from execution.stop()

ada05f0

Prevent the case when the VM was not stopped when running execution.stop() directly instead of VmPool.stop_vm() Simplify code, add warning

mod remove warning for unused var

d53dc2a

mod remove unused funtion stop_permanent_execution

f40df87

mod remove dead code on qemu instance

08803ce

mod: remove unused code: ping, instance wait_for_init

24bdb96

Test test_create_qemu_instance

928c1c6

Fix py.test setup message

fdaeb9e

firewall.py List failed rule insertion for debugging

f55c704

test: Patch journal.stream so the output of qemu process is shown in …

56ecad9

…the test output

test: Ensure that test are run inside their own directory

bb1193f

Test new endpoint, add fix for non prepared Execution

9f1acdd

olethanh force-pushed the ol-ALEPH-287-retry-instance branch from e5ec35a to 9f1acdd Compare April 24, 2025 07:46

nesitor assigned olethanh May 15, 2025

nesitor approved these changes May 15, 2025

View reviewed changes

Merge branch 'main' into ol-ALEPH-287-retry-instance

82b671e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry instances not running add new Execution list endpoint #793

Retry instances not running add new Execution list endpoint #793

olethanh commented Apr 10, 2025 •

edited

Loading

codecov bot commented Apr 15, 2025 •

edited

Loading

olethanh commented Apr 15, 2025

Retry instances not running add new Execution list endpoint #793

Are you sure you want to change the base?

Retry instances not running add new Execution list endpoint #793

Conversation

olethanh commented Apr 10, 2025 • edited Loading

Self proofreading checklist

Changes

How to test

codecov bot commented Apr 15, 2025 • edited Loading

Codecov Report

olethanh commented Apr 15, 2025

olethanh commented Apr 10, 2025 •

edited

Loading

codecov bot commented Apr 15, 2025 •

edited

Loading