Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit Celery runs based on run_mode instead of API version #956

Merged
merged 28 commits into from
Jan 30, 2024

Conversation

sambles
Copy link
Contributor

@sambles sambles commented Jan 22, 2024

Celery run types are now based on run_mode instead of API version

  • Switch the celery workflow selection to use the model.run_mode field. Instead of based on the API version used ( /v1/analyses/{id}/run vs /v2/analyses/{id}/run)
  • Renamed worker environment variable OASIS_API_VERSION to OASIS_RUN_MODE. This select which run_mode workflow a container is running in (2.3.0+ worker images and up)

Documentation

Celery Workflows

The OasisPlatform versions 2.3.0 and above now support two types of execution workflow, selected using a new run_mode field in an AnalysisModel. (either V1 or V2)

Example: GET /v2/models/1/

{
  "id": 1,
  "supplier_id": "OasisLMF",
  "model_id": "PiWind",
  "version_id": "1.28.4",
       ...
  "run_mode": "V1"
}

Each model must have this field set before an analysis is created, otherwise the server returns a HTTP 400 - Error: Bad Request

{
  "model": [ "Model pk "5" - 'run_mode' must not be null"]
}

Run Mode V1 - Single server execution

When a model is set to "run_mode": "V1" a single task is dispatched (generate_inputs or generate_losses) to a single worker which then processes that task and returns the result to a WorkerMonitor to store the output and notify the OasisPlatform API server.

graph TD;
    ApiServer-->Worker_v1;
    Worker_v1-->WorkerMonitor_v1;
    WorkerMonitor_v1-->ApiServer;
Loading

This is the same workflow used in all OasisPlatform versions starting with 1.x.x, so if a worker image is versioned from a stable branch starting with 1 e.g.

  • 1.15.x
  • 1.23.x
  • 1.26.x
  • 1.27.x
  • 1.28.x

Then it can only support the V1 workflow and the model needs to be set accordingly

Run Mode V2 - Distributed execution

Model is set to "run_mode": "V2" use a newer workflow that can horizontally scale to run on n workers in parallel.

WARNING: OasisPlatform 2.3.0 is not compatible with worker versions 2.1.0, 2.1.1, 2.1.2, 2.2.0 and 2.2.1 to use horizontal scaling the worker version must be 2.3.0 (or above)

graph TD;
    ApiServer-->TaskController
    TaskController-->Worker_v2_node-1;
    TaskController-->Worker_v2_node-2;
    TaskController-->Worker_v2_node-n;
    Worker_v2_node-1-->WorkerMonitor_v2;
    Worker_v2_node-2-->WorkerMonitor_v2;
    Worker_v2_node-n-->WorkerMonitor_v2;
    WorkerMonitor_v2-->ApiServer;
Loading

Versioned API (v1 / v2)

There are now two API versions for backwards compatibility, which is also labelled v1 and v2. The endpoints starting with v1 mirror the API specification from the 1.x.x OasisPlatforms and do not include the newer additions needed for horizontal scaling.

v1 endpoints
only supports "run_mode": "V1"

 /v1/models/    
    ...
 /v1/portfolios/
    ...
 /v1/analyses/
     ...

Validation has been added to ensure the v1 endpoints are 'locked' to only using models marked as "run_mode": "V1".
the /v1/models/ list operation filters out any models marked as V2.

If a model attached to an analysis is switched from "run_mode": "V1" to "run_mode": "V2" then a POST
to either /v1/analyses/{id}/generate_inputs/ or /v1/analyses/{id}/run/ will returnHTTP 400 - Error: Bad Request

{
  "model": [
    "Model pk 1' - Unsupported Operation, 'run_mode' must be 'V1', not 'V2'"
  ]
}

v2 endpoints
supports both "run_mode": "V1" and "run_mode": "V2". When a request is posted to /v2/analyses/{id}/run/ the API server with check the value stored in the attached models run_mode field and dispatch the celery task matching that workflow.

 /v2/models/    
    ...
 /v2/portfolios/
    ...
 /v2/analyses/
     ...

However, there is an exception which is the endpoint /v2/analyses/{id}/generate_and_run/ which executes both input generation and losses in a single call. This is only supported in the distributed worked flow so is 'locked' to V2 only.

If a request is send to an analysis linked to a model with "run_mode": "V1", then a HTTP 400 - Error: Bad Request is returned.

{
  "model": [
    "Model pk "1" - Unsuppored Operation, "run_mode" must be "V2", not "V1""
  ]
}

How a model's run_mode is set.

1. Directly to model

Both /v1/models/ and /v2/models/ can update the run_mode field. Either by POST or PATCH

{
  "supplier_id": " .. ",
  "model_id": " .. ",
  "version_id": " .. ",
  "run_mode": "V1"
}

2. auto-registration

If worker containers is set to auto-registration, then the WorkerMonitor will know which model queue a container is listening on, so it automaticity sets run_mode to match.

This works because the WorkerMonitor-V2 can only receive registration tasks from workers connected to the priority queue Celery-v2. Since all worker containers running in distributed mode will send its registration task here so we know run_mode should also be V2

The same is true for WorkerMonitor-V1 and Celery (the non-priority queue), which is the default in all workers 1.x.x.
A worker deployed for a Single server workflow execution send their auto-reg task here instead, so run_mode must also be V1

3. URL parameter run_mode_override

The endpoints that support both V1 and V2 run_mode's have a url parameter run_mode_override to force a value, ignoring field set on an AnalysisModel. This only applies to two endpoints

  • /v2/analyses/{id}/generate_inputs/?run_mode_override={V1|V2}
  • /v2/analyses/{id}/run/?run_mode_override={V1|V2}
    Screenshot from 2024-01-25 11-11-54

WARNING: Using this will bypass the run_mode validation checks, if no worker containers are setup to process the task run_mode selected then an analysis we be stuck with the status of INPUTS_GENERATION_QUEUED or INPUTS_GENERATION_QUEUED

4. posting model_settings

The PR OasisLMF/ODS_Tools#86 expanded model_settings.json to include a new model_run_mode key.

If this key is included in the settings, then when the data is posted to /v1/models/{id}/settings/ the run_mode value is update to match the settings data.

Example:

GET /v1/models/1/'
{
  "id": 1,
  "supplier_id": "OasisLMF",
  "model_id": "PiWind",
  "version_id": "1.28.4",
  "created": "2024-01-24T16:24:16.438134Z",
  "modified": "2024-01-24T16:36:21.791734Z",
  "data_files": [],
  "settings": "http://localhost:8000/v1/models/1/settings/",
  "versions": "http://localhost:8000/v1/models/1/versions/",
  "run_mode": "V1"
}
POST /v1/models/1/settings/
{
  "model_run_mode": "V2",
  "model_settings": {},
  "lookup_settings": {}
}
GET /v2/models/1/'
{
  "id": 1,
  "supplier_id": "OasisLMF",
  "model_id": "PiWind",
  "version_id": "1.28.4",
  "created": "2024-01-24T16:24:16.438134Z",
  "modified": "2024-01-24T16:36:21.791734Z",
  "data_files": [],
  "settings": "http://localhost:8000/v2/models/1/settings/",
  "versions": "http://localhost:8000/v2/models/1/versions/",
  "run_mode": "V2"
}

@sambles sambles marked this pull request as draft January 22, 2024 15:12
@sambles sambles linked an issue Jan 22, 2024 that may be closed by this pull request
@sambles sambles self-assigned this Jan 22, 2024
@sambles
Copy link
Contributor Author

sambles commented Jan 22, 2024

Questions & testing:

  • Should the registration raise an error if run_mode is already set? --- check that the task reg dosn't bounce between V1 and V2 --> multiple v1/v2 deployments under same model name should not be allowed.
  • Should resource_file endpoint be removed?
  • Restore 1 chunk mode in V2 execution?

@sambles sambles marked this pull request as ready for review January 22, 2024 15:41
@sambles
Copy link
Contributor Author

sambles commented Jan 24, 2024

Need to check and raise a validation error if AnalysisModel.run_mode = "V2" and either /v1/analyses/{id}/run/ or /v1/analyses/{id}/generate_inputs/ is called

@sambles sambles added Enhancement Small improvement or refinement. production labels Jan 24, 2024
@sambles
Copy link
Contributor Author

sambles commented Jan 25, 2024

rename OASIS_API_VERSION to OASIS_RUN_MODE matching the new selector in the API

@sambles sambles merged commit 645b2e4 into main Jan 30, 2024
26 checks passed
@sambles sambles deleted the feature/951-exec-v1-models-from-new-api branch January 30, 2024 14:46
@awsbuild awsbuild added this to the 2.3.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Small improvement or refinement. production
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Allow 'single instance' execution from v2 api
2 participants