Submit Celery runs based on run_mode instead of API version #956

sambles · 2024-01-22T15:12:14Z

Celery run types are now based on run_mode instead of API version

Switch the celery workflow selection to use the model.run_mode field. Instead of based on the API version used ( /v1/analyses/{id}/run vs /v2/analyses/{id}/run)
Renamed worker environment variable OASIS_API_VERSION to OASIS_RUN_MODE. This select which run_mode workflow a container is running in (2.3.0+ worker images and up)

Documentation

Celery Workflows

The OasisPlatform versions 2.3.0 and above now support two types of execution workflow, selected using a new run_mode field in an AnalysisModel. (either V1 or V2)

Example: GET /v2/models/1/

{
  "id": 1,
  "supplier_id": "OasisLMF",
  "model_id": "PiWind",
  "version_id": "1.28.4",
       ...
  "run_mode": "V1"
}

Each model must have this field set before an analysis is created, otherwise the server returns a HTTP 400 - Error: Bad Request

{
  "model": [ "Model pk "5" - 'run_mode' must not be null"]
}

Run Mode V1 - Single server execution

When a model is set to "run_mode": "V1" a single task is dispatched (generate_inputs or generate_losses) to a single worker which then processes that task and returns the result to a WorkerMonitor to store the output and notify the OasisPlatform API server.

graph TD;
    ApiServer-->Worker_v1;
    Worker_v1-->WorkerMonitor_v1;
    WorkerMonitor_v1-->ApiServer;

This is the same workflow used in all OasisPlatform versions starting with 1.x.x, so if a worker image is versioned from a stable branch starting with 1 e.g.

1.15.x
1.23.x
1.26.x
1.27.x
1.28.x

Then it can only support the V1 workflow and the model needs to be set accordingly

Run Mode V2 - Distributed execution

Model is set to "run_mode": "V2" use a newer workflow that can horizontally scale to run on n workers in parallel.

WARNING: OasisPlatform 2.3.0 is not compatible with worker versions 2.1.0, 2.1.1, 2.1.2, 2.2.0 and 2.2.1 to use horizontal scaling the worker version must be 2.3.0 (or above)

graph TD;
    ApiServer-->TaskController
    TaskController-->Worker_v2_node-1;
    TaskController-->Worker_v2_node-2;
    TaskController-->Worker_v2_node-n;
    Worker_v2_node-1-->WorkerMonitor_v2;
    Worker_v2_node-2-->WorkerMonitor_v2;
    Worker_v2_node-n-->WorkerMonitor_v2;
    WorkerMonitor_v2-->ApiServer;

Versioned API (v1 / v2)

There are now two API versions for backwards compatibility, which is also labelled v1 and v2. The endpoints starting with v1 mirror the API specification from the 1.x.x OasisPlatforms and do not include the newer additions needed for horizontal scaling.

v1 endpoints
only supports "run_mode": "V1"

 /v1/models/    
    ...
 /v1/portfolios/
    ...
 /v1/analyses/
     ...

Validation has been added to ensure the v1 endpoints are 'locked' to only using models marked as "run_mode": "V1".
the /v1/models/ list operation filters out any models marked as V2.

If a model attached to an analysis is switched from "run_mode": "V1" to "run_mode": "V2" then a POST
to either /v1/analyses/{id}/generate_inputs/ or /v1/analyses/{id}/run/ will returnHTTP 400 - Error: Bad Request

{
  "model": [
    "Model pk 1' - Unsupported Operation, 'run_mode' must be 'V1', not 'V2'"
  ]
}

v2 endpoints
supports both "run_mode": "V1" and "run_mode": "V2". When a request is posted to /v2/analyses/{id}/run/ the API server with check the value stored in the attached models run_mode field and dispatch the celery task matching that workflow.

 /v2/models/    
    ...
 /v2/portfolios/
    ...
 /v2/analyses/
     ...

However, there is an exception which is the endpoint /v2/analyses/{id}/generate_and_run/ which executes both input generation and losses in a single call. This is only supported in the distributed worked flow so is 'locked' to V2 only.

If a request is send to an analysis linked to a model with "run_mode": "V1", then a HTTP 400 - Error: Bad Request is returned.

{
  "model": [
    "Model pk "1" - Unsuppored Operation, "run_mode" must be "V2", not "V1""
  ]
}

How a model's `run_mode` is set.

1. Directly to model

Both /v1/models/ and /v2/models/ can update the run_mode field. Either by POST or PATCH

{
  "supplier_id": " .. ",
  "model_id": " .. ",
  "version_id": " .. ",
  "run_mode": "V1"
}

2. auto-registration

If worker containers is set to auto-registration, then the WorkerMonitor will know which model queue a container is listening on, so it automaticity sets run_mode to match.

This works because the WorkerMonitor-V2 can only receive registration tasks from workers connected to the priority queue Celery-v2. Since all worker containers running in distributed mode will send its registration task here so we know run_mode should also be V2

The same is true for WorkerMonitor-V1 and Celery (the non-priority queue), which is the default in all workers 1.x.x.
A worker deployed for a Single server workflow execution send their auto-reg task here instead, so run_mode must also be V1

3. URL parameter `run_mode_override`

The endpoints that support both V1 and V2 run_mode's have a url parameter run_mode_override to force a value, ignoring field set on an AnalysisModel. This only applies to two endpoints

/v2/analyses/{id}/generate_inputs/?run_mode_override={V1|V2}
/v2/analyses/{id}/run/?run_mode_override={V1|V2}

WARNING: Using this will bypass the run_mode validation checks, if no worker containers are setup to process the task run_mode selected then an analysis we be stuck with the status of INPUTS_GENERATION_QUEUED or INPUTS_GENERATION_QUEUED

4. posting model_settings

The PR OasisLMF/ODS_Tools#86 expanded model_settings.json to include a new model_run_mode key.

If this key is included in the settings, then when the data is posted to /v1/models/{id}/settings/ the run_mode value is update to match the settings data.

Example:

GET /v1/models/1/'
{
  "id": 1,
  "supplier_id": "OasisLMF",
  "model_id": "PiWind",
  "version_id": "1.28.4",
  "created": "2024-01-24T16:24:16.438134Z",
  "modified": "2024-01-24T16:36:21.791734Z",
  "data_files": [],
  "settings": "http://localhost:8000/v1/models/1/settings/",
  "versions": "http://localhost:8000/v1/models/1/versions/",
  "run_mode": "V1"
}

POST /v1/models/1/settings/
{
  "model_run_mode": "V2",
  "model_settings": {},
  "lookup_settings": {}
}

GET /v2/models/1/'
{
  "id": 1,
  "supplier_id": "OasisLMF",
  "model_id": "PiWind",
  "version_id": "1.28.4",
  "created": "2024-01-24T16:24:16.438134Z",
  "modified": "2024-01-24T16:36:21.791734Z",
  "data_files": [],
  "settings": "http://localhost:8000/v2/models/1/settings/",
  "versions": "http://localhost:8000/v2/models/1/versions/",
  "run_mode": "V2"
}

sambles · 2024-01-22T15:28:11Z

Questions & testing:

Should the registration raise an error if run_mode is already set? --- check that the task reg dosn't bounce between V1 and V2 --> multiple v1/v2 deployments under same model name should not be allowed.
Should resource_file endpoint be removed?
Restore 1 chunk mode in V2 execution?

This reverts commit 999a764.

sambles · 2024-01-24T11:01:54Z

Need to check and raise a validation error if AnalysisModel.run_mode = "V2" and either /v1/analyses/{id}/run/ or /v1/analyses/{id}/generate_inputs/ is called

…00, incase this somehow happened

sambles · 2024-01-25T08:49:49Z

rename OASIS_API_VERSION to OASIS_RUN_MODE matching the new selector in the API

sambles added 4 commits January 22, 2024 10:44

Remove BOTH from model "run_mode"

fafa36c

Update AnalysisModel run_mode if in posted model_settings

451d85f

Same for v1 API

0acde31

Set run_mode based on "model.run_mode" or "param_url" + validation

47933ac

sambles marked this pull request as draft January 22, 2024 15:12

sambles linked an issue Jan 22, 2024 that may be closed by this pull request

Allow 'single instance' execution from v2 api #951

Closed

sambles added 2 commits January 22, 2024 15:18

PEP and clean up

ae30b25

Flake8 cleanup

4bb84d0

sambles self-assigned this Jan 22, 2024

sambles added 2 commits January 22, 2024 15:36

Fix run_mode validation gap

bd7c05f

Fix bad run_mode_override value

293365a

sambles marked this pull request as ready for review January 22, 2024 15:41

sambles added 10 commits January 23, 2024 14:08

Fixes and updated tests

dbefad4

fix typo

697c395

Fixes and test updates

ba5b36f

restore one_chunk mode

f047f3a

Disable sub-path

b96bc48

Split workers to seperate api models

999a764

update compose

e2c5f1f

Revert "Split workers to seperate api models"

15c4d2e

This reverts commit 999a764.

Added check to test if run_mode is set after posting model settings

5ee1cbc

clean up

6a239cc

sambles added 6 commits January 24, 2024 13:07

don't allow non V1 models to attached V1 analysis, and throw bad op 4…

cd0bec3

…00, incase this somehow happened

f

b3931ed

Fix

ffa7b7f

pep & flake

e6bf7b7

retest with ODS-tools merged

09d7a07

Fix typo in error msg

491c759

sambles added Enhancement Small improvement or refinement. production labels Jan 24, 2024

sambles added 4 commits January 25, 2024 10:56

Rename env var OASIS_API_VERSION -> OASIS_RUN_MODE

c91defc

Fix accidental commit

a95b85c

retest

b8fe3a5

Fix missing ods-tools build step in schema testing

a8c22aa

sambles merged commit 645b2e4 into main Jan 30, 2024
26 checks passed

sambles deleted the feature/951-exec-v1-models-from-new-api branch January 30, 2024 14:46

awsbuild added this to the 2.3.0 milestone Feb 6, 2024

sambles mentioned this pull request Apr 11, 2024

Update and write documentation on auto scaling and platform V2 #1021

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submit Celery runs based on run_mode instead of API version #956

Submit Celery runs based on run_mode instead of API version #956

sambles commented Jan 22, 2024 •

edited

Loading

sambles commented Jan 22, 2024 •

edited

Loading

sambles commented Jan 24, 2024

sambles commented Jan 25, 2024

Submit Celery runs based on run_mode instead of API version #956

Submit Celery runs based on run_mode instead of API version #956

Conversation

sambles commented Jan 22, 2024 • edited Loading

Celery run types are now based on run_mode instead of API version

Documentation

Celery Workflows

Run Mode V1 - Single server execution

Run Mode V2 - Distributed execution

Versioned API (v1 / v2)

How a model's run_mode is set.

1. Directly to model

2. auto-registration

3. URL parameter run_mode_override

4. posting model_settings

sambles commented Jan 22, 2024 • edited Loading

sambles commented Jan 24, 2024

sambles commented Jan 25, 2024

sambles commented Jan 22, 2024 •

edited

Loading

How a model's `run_mode` is set.

3. URL parameter `run_mode_override`

sambles commented Jan 22, 2024 •

edited

Loading