-
Notifications
You must be signed in to change notification settings - Fork 163
feat(BA-3030): Support service-definition.toml override with optional fields(image, arch, resource) #6751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
82f5b7f to
6f777a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for optional fields (image, architecture, resource) in service-definition.toml, allowing these values to override API request parameters. The main changes refactor the deployment creation flow to distinguish between requested specifications (which may have optional values) and final specifications (with all required values resolved).
Key changes:
- Introduced
RequestedModelRevisionSpecandRequestedImageIdentifiertypes to handle optional field values from API requests - Added merge logic to combine service-definition.toml overrides with API request values
- Refactored
DeploymentCreatorintoDeploymentCreatorandFinalDeploymentCreatorto separate the request and final states - Moved
ModelServiceDefinitionfrommodel_serving.typestodeployment.typesfor better organization
Reviewed Changes
Copilot reviewed 31 out of 33 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/manager/services/deployment/actions/test_service_definition_merge.py | Comprehensive test suite for service definition merge scenarios including success cases and validation errors |
| tests/manager/services/deployment/actions/BUILD | Build configuration for new test directory |
| tests/manager/services/deployment/BUILD | Build configuration for deployment service tests |
| src/ai/backend/manager/sokovan/deployment/types.py | Updated import path for RouteStatus |
| src/ai/backend/manager/sokovan/deployment/route/handlers/*.py | Updated import paths for RouteStatus across route handlers |
| src/ai/backend/manager/sokovan/deployment/deployment_controller.py | Refactored to use new DeploymentCreator and generate ModelRevisionSpec from RequestedModelRevisionSpec |
| src/ai/backend/manager/sokovan/deployment/definition_generator/*.py | Updated all generators to accept RequestedModelRevisionSpec and return ModelRevisionSpec |
| src/ai/backend/manager/services/model_serving/services/model_serving.py | Added validation for required fields (image, architecture, resources) before image resolution |
| src/ai/backend/manager/repositories/deployment/types/endpoint.py | Updated imports for EndpointLifecycle and RouteStatus |
| src/ai/backend/manager/repositories/deployment/storage_source/storage_source.py | Updated import path for ModelServiceDefinition |
| src/ai/backend/manager/repositories/deployment/repository.py | Changed to accept FinalDeploymentCreator instead of DeploymentCreator |
| src/ai/backend/manager/repositories/deployment/db_source/db_source.py | Changed to accept FinalDeploymentCreator instead of DeploymentCreator |
| src/ai/backend/manager/models/endpoint.py | Changed to accept FinalDeploymentCreator instead of DeploymentCreator |
| src/ai/backend/manager/data/model_serving/types.py | Removed ModelServiceDefinition (moved to deployment.types), made resources optional in ServiceConfig |
| src/ai/backend/manager/data/model_serving/creator.py | Made image and architecture fields optional in ModelServiceCreator |
| src/ai/backend/manager/data/deployment/types.py | Added ModelServiceDefinition, RequestedModelRevisionSpec, RequestedImageIdentifier, RequestedResourceSpec with merge logic |
| src/ai/backend/manager/data/deployment/creator.py | Split DeploymentCreator into DeploymentCreator and FinalDeploymentCreator |
| src/ai/backend/manager/api/service.py | Made image, architecture, and resources optional in API request model; added validation in dry-run method |
| changes/6751.fix.md | Changelog entry for the fix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raise InvalidAPIParameters("Architecture must be specified for CMD runtime variant.") | ||
| if ( | ||
| requested_image_identifier.canonical == "" | ||
| or requested_image_identifier.canonical is None | ||
| ): | ||
| raise InvalidAPIParameters( | ||
| "Canonical image identifier must be specified for CMD runtime variant." | ||
| ) | ||
| requested_resource_spec = self.resource_spec | ||
| if requested_resource_spec.resource_slots is None: | ||
| raise InvalidAPIParameters("Resource slots must be specified for CMD runtime variant.") |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages in to_model_revision_spec() incorrectly reference "CMD runtime variant" but this method is used by all runtime variants, not just CMD. The error messages should be generic.
Suggested changes:
- Line 296: "Architecture must be specified."
- Line 301-302: "Canonical image identifier must be specified."
- Line 306: "Resource slots must be specified."
| raise InvalidAPIParameters("Architecture must be specified for CMD runtime variant.") | |
| if ( | |
| requested_image_identifier.canonical == "" | |
| or requested_image_identifier.canonical is None | |
| ): | |
| raise InvalidAPIParameters( | |
| "Canonical image identifier must be specified for CMD runtime variant." | |
| ) | |
| requested_resource_spec = self.resource_spec | |
| if requested_resource_spec.resource_slots is None: | |
| raise InvalidAPIParameters("Resource slots must be specified for CMD runtime variant.") | |
| raise InvalidAPIParameters("Architecture must be specified.") | |
| if ( | |
| requested_image_identifier.canonical == "" | |
| or requested_image_identifier.canonical is None | |
| ): | |
| raise InvalidAPIParameters( | |
| "Canonical image identifier must be specified." | |
| ) | |
| requested_resource_spec = self.resource_spec | |
| if requested_resource_spec.resource_slots is None: | |
| raise InvalidAPIParameters("Resource slots must be specified.") |
| if self.execution.environ: | ||
| self.execution.environ.update(service_definition.environ) | ||
| else: | ||
| self.execution.environ = service_definition.environ | ||
|
|
||
| return ModelRevisionSpec( | ||
| image_identifier=image_identifier, | ||
| resource_spec=resource_spec, | ||
| mounts=self.mounts, | ||
| execution=self.execution, |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution.environ dictionary is being mutated in place within to_model_revision_spec_with_service_definition(). This could lead to unexpected side effects since the original RequestedModelRevisionSpec object's execution.environ will be modified.
Consider creating a new dictionary instead:
if service_definition.environ is not None:
if self.execution.environ:
merged_environ = {**self.execution.environ, **service_definition.environ}
self.execution.environ = merged_environ
else:
self.execution.environ = service_definition.environOr better yet, avoid mutating the self.execution object by creating a new ExecutionSpec with the merged environ.
| if self.execution.environ: | |
| self.execution.environ.update(service_definition.environ) | |
| else: | |
| self.execution.environ = service_definition.environ | |
| return ModelRevisionSpec( | |
| image_identifier=image_identifier, | |
| resource_spec=resource_spec, | |
| mounts=self.mounts, | |
| execution=self.execution, | |
| # Merge environments without mutating self.execution.environ | |
| merged_environ = {} | |
| if self.execution.environ: | |
| merged_environ = {**self.execution.environ, **service_definition.environ} | |
| else: | |
| merged_environ = dict(service_definition.environ) | |
| execution = ExecutionSpec( | |
| command=self.execution.command, | |
| args=self.execution.args, | |
| environ=merged_environ, | |
| working_dir=self.execution.working_dir, | |
| entrypoint=self.execution.entrypoint, | |
| user=self.execution.user, | |
| group=self.execution.group, | |
| umask=self.execution.umask, | |
| capabilities=self.execution.capabilities, | |
| privileged=self.execution.privileged, | |
| shm_size=self.execution.shm_size, | |
| tmpfs=self.execution.tmpfs, | |
| devices=self.execution.devices, | |
| sysctls=self.execution.sysctls, | |
| extra_hosts=self.execution.extra_hosts, | |
| security_opt=self.execution.security_opt, | |
| read_only_root_fs=self.execution.read_only_root_fs, | |
| stop_signal=self.execution.stop_signal, | |
| stop_timeout=self.execution.stop_timeout, | |
| healthcheck=self.execution.healthcheck, | |
| init=self.execution.init, | |
| isolation=self.execution.isolation, | |
| runtime=self.execution.runtime, | |
| resources=self.execution.resources, | |
| hostname=self.execution.hostname, | |
| domainname=self.execution.domainname, | |
| mac_address=self.execution.mac_address, | |
| network_mode=self.execution.network_mode, | |
| networks=self.execution.networks, | |
| dns=self.execution.dns, | |
| dns_options=self.execution.dns_options, | |
| dns_search=self.execution.dns_search, | |
| labels=self.execution.labels, | |
| log_config=self.execution.log_config, | |
| restart_policy=self.execution.restart_policy, | |
| syslog=self.execution.syslog, | |
| volumes=self.execution.volumes, | |
| volume_driver=self.execution.volume_driver, | |
| working_dir_mount=self.execution.working_dir_mount, | |
| secrets=self.execution.secrets, | |
| configs=self.execution.configs, | |
| env_files=self.execution.env_files, | |
| extra=self.execution.extra, | |
| ) | |
| else: | |
| execution = self.execution | |
| return ModelRevisionSpec( | |
| image_identifier=image_identifier, | |
| resource_spec=resource_spec, | |
| mounts=self.mounts, | |
| execution=execution, |
| if self.image is None or self.architecture is None or self.config.resources is None: | ||
| raise InvalidAPIParameters( | ||
| "Image, architecture, and resources must be specified for dry-run service creation" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation is wrong. It should only be validated for the overridden model definition after reading the value from model_definition_path.
|
|
||
| @dataclass | ||
| class DeploymentCreator: | ||
| class FinalDeploymentCreator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like this needs to be separated, could you let me know the reason for separating it?
| def to_model_revision_spec_with_service_definition( | ||
| self, service_definition: ModelServiceDefinition | ||
| ) -> ModelRevisionSpec: | ||
| if service_definition.resource_slots is not None: | ||
| resource_spec = ResourceSpec( | ||
| cluster_mode=self.resource_spec.cluster_mode, | ||
| cluster_size=self.resource_spec.cluster_size, | ||
| resource_slots=service_definition.resource_slots, | ||
| resource_opts=self.resource_spec.resource_opts, | ||
| ) | ||
| else: | ||
| if self.resource_spec.resource_slots is None: | ||
| raise InvalidAPIParameters( | ||
| "Resource slots must be specified in either requested model revision or model service definition." | ||
| ) | ||
| resource_spec = ResourceSpec( | ||
| cluster_mode=self.resource_spec.cluster_mode, | ||
| cluster_size=self.resource_spec.cluster_size, | ||
| resource_slots=self.resource_spec.resource_slots, | ||
| resource_opts=self.resource_spec.resource_opts, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems to have gone wrong with the implementation right now.
7f2ba22 to
68ce6a9
Compare
Co-authored-by: octodog <[email protected]>
| """Action to create a new legacy deployment(Model Service).""" | ||
|
|
||
| creator: DeploymentCreator | ||
| request: DeploymentCreateRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is strange. It looks like a request is going in as an internal action.
| vfolder_id=request.requested_model_revision.mounts.model_vfolder_id, | ||
| model_definition_path=request.requested_model_revision.mounts.model_definition_path, | ||
| ) | ||
| print(model_revision.model_dump_json()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use print in production code.
| @override | ||
| async def validate_revision(self, revision: ModelRevisionSpec) -> None: | ||
| """ | ||
| Validate CUSTOM variant revision by checking model definition. | ||
| """ | ||
| definition_files: DefinitionFiles = ( | ||
| await self._deployment_repository.fetch_definition_files( | ||
| vfolder_id=revision.mounts.model_vfolder_id, | ||
| model_definition_path=revision.mounts.model_definition_path, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I question whether this method is needed.
| def _override_revision( | ||
| self, | ||
| requested_revision: RequestedModelRevisionSpec, | ||
| service_definition: ModelServiceDefinition, | ||
| ) -> ModelRevisionSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to guarantee overriding something within the class structure?
resolves #6733 (BA-3030)
Checklist: (if applicable)
ai.backend.testdocsdirectory📚 Documentation preview 📚: https://sorna--6751.org.readthedocs.build/en/6751/
📚 Documentation preview 📚: https://sorna-ko--6751.org.readthedocs.build/ko/6751/