Skip to content

Conversation

@seedspirit
Copy link
Contributor

@seedspirit seedspirit commented Nov 13, 2025

resolves #6733 (BA-3030)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--6751.org.readthedocs.build/en/6751/


📚 Documentation preview 📚: https://sorna-ko--6751.org.readthedocs.build/ko/6751/

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component area:docs Documentations labels Nov 13, 2025
@seedspirit seedspirit force-pushed the fix/BA-3030 branch 2 times, most recently from 82f5b7f to 6f777a0 Compare November 13, 2025 09:20
@seedspirit seedspirit marked this pull request as ready for review November 13, 2025 09:21
Copilot AI review requested due to automatic review settings November 13, 2025 09:21
Copilot finished reviewing on behalf of seedspirit November 13, 2025 09:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for optional fields (image, architecture, resource) in service-definition.toml, allowing these values to override API request parameters. The main changes refactor the deployment creation flow to distinguish between requested specifications (which may have optional values) and final specifications (with all required values resolved).

Key changes:

  • Introduced RequestedModelRevisionSpec and RequestedImageIdentifier types to handle optional field values from API requests
  • Added merge logic to combine service-definition.toml overrides with API request values
  • Refactored DeploymentCreator into DeploymentCreator and FinalDeploymentCreator to separate the request and final states
  • Moved ModelServiceDefinition from model_serving.types to deployment.types for better organization

Reviewed Changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/manager/services/deployment/actions/test_service_definition_merge.py Comprehensive test suite for service definition merge scenarios including success cases and validation errors
tests/manager/services/deployment/actions/BUILD Build configuration for new test directory
tests/manager/services/deployment/BUILD Build configuration for deployment service tests
src/ai/backend/manager/sokovan/deployment/types.py Updated import path for RouteStatus
src/ai/backend/manager/sokovan/deployment/route/handlers/*.py Updated import paths for RouteStatus across route handlers
src/ai/backend/manager/sokovan/deployment/deployment_controller.py Refactored to use new DeploymentCreator and generate ModelRevisionSpec from RequestedModelRevisionSpec
src/ai/backend/manager/sokovan/deployment/definition_generator/*.py Updated all generators to accept RequestedModelRevisionSpec and return ModelRevisionSpec
src/ai/backend/manager/services/model_serving/services/model_serving.py Added validation for required fields (image, architecture, resources) before image resolution
src/ai/backend/manager/repositories/deployment/types/endpoint.py Updated imports for EndpointLifecycle and RouteStatus
src/ai/backend/manager/repositories/deployment/storage_source/storage_source.py Updated import path for ModelServiceDefinition
src/ai/backend/manager/repositories/deployment/repository.py Changed to accept FinalDeploymentCreator instead of DeploymentCreator
src/ai/backend/manager/repositories/deployment/db_source/db_source.py Changed to accept FinalDeploymentCreator instead of DeploymentCreator
src/ai/backend/manager/models/endpoint.py Changed to accept FinalDeploymentCreator instead of DeploymentCreator
src/ai/backend/manager/data/model_serving/types.py Removed ModelServiceDefinition (moved to deployment.types), made resources optional in ServiceConfig
src/ai/backend/manager/data/model_serving/creator.py Made image and architecture fields optional in ModelServiceCreator
src/ai/backend/manager/data/deployment/types.py Added ModelServiceDefinition, RequestedModelRevisionSpec, RequestedImageIdentifier, RequestedResourceSpec with merge logic
src/ai/backend/manager/data/deployment/creator.py Split DeploymentCreator into DeploymentCreator and FinalDeploymentCreator
src/ai/backend/manager/api/service.py Made image, architecture, and resources optional in API request model; added validation in dry-run method
changes/6751.fix.md Changelog entry for the fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 296 to 306
raise InvalidAPIParameters("Architecture must be specified for CMD runtime variant.")
if (
requested_image_identifier.canonical == ""
or requested_image_identifier.canonical is None
):
raise InvalidAPIParameters(
"Canonical image identifier must be specified for CMD runtime variant."
)
requested_resource_spec = self.resource_spec
if requested_resource_spec.resource_slots is None:
raise InvalidAPIParameters("Resource slots must be specified for CMD runtime variant.")
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages in to_model_revision_spec() incorrectly reference "CMD runtime variant" but this method is used by all runtime variants, not just CMD. The error messages should be generic.

Suggested changes:

  • Line 296: "Architecture must be specified."
  • Line 301-302: "Canonical image identifier must be specified."
  • Line 306: "Resource slots must be specified."
Suggested change
raise InvalidAPIParameters("Architecture must be specified for CMD runtime variant.")
if (
requested_image_identifier.canonical == ""
or requested_image_identifier.canonical is None
):
raise InvalidAPIParameters(
"Canonical image identifier must be specified for CMD runtime variant."
)
requested_resource_spec = self.resource_spec
if requested_resource_spec.resource_slots is None:
raise InvalidAPIParameters("Resource slots must be specified for CMD runtime variant.")
raise InvalidAPIParameters("Architecture must be specified.")
if (
requested_image_identifier.canonical == ""
or requested_image_identifier.canonical is None
):
raise InvalidAPIParameters(
"Canonical image identifier must be specified."
)
requested_resource_spec = self.resource_spec
if requested_resource_spec.resource_slots is None:
raise InvalidAPIParameters("Resource slots must be specified.")

Copilot uses AI. Check for mistakes.
Comment on lines 278 to 287
if self.execution.environ:
self.execution.environ.update(service_definition.environ)
else:
self.execution.environ = service_definition.environ

return ModelRevisionSpec(
image_identifier=image_identifier,
resource_spec=resource_spec,
mounts=self.mounts,
execution=self.execution,
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution.environ dictionary is being mutated in place within to_model_revision_spec_with_service_definition(). This could lead to unexpected side effects since the original RequestedModelRevisionSpec object's execution.environ will be modified.

Consider creating a new dictionary instead:

if service_definition.environ is not None:
    if self.execution.environ:
        merged_environ = {**self.execution.environ, **service_definition.environ}
        self.execution.environ = merged_environ
    else:
        self.execution.environ = service_definition.environ

Or better yet, avoid mutating the self.execution object by creating a new ExecutionSpec with the merged environ.

Suggested change
if self.execution.environ:
self.execution.environ.update(service_definition.environ)
else:
self.execution.environ = service_definition.environ
return ModelRevisionSpec(
image_identifier=image_identifier,
resource_spec=resource_spec,
mounts=self.mounts,
execution=self.execution,
# Merge environments without mutating self.execution.environ
merged_environ = {}
if self.execution.environ:
merged_environ = {**self.execution.environ, **service_definition.environ}
else:
merged_environ = dict(service_definition.environ)
execution = ExecutionSpec(
command=self.execution.command,
args=self.execution.args,
environ=merged_environ,
working_dir=self.execution.working_dir,
entrypoint=self.execution.entrypoint,
user=self.execution.user,
group=self.execution.group,
umask=self.execution.umask,
capabilities=self.execution.capabilities,
privileged=self.execution.privileged,
shm_size=self.execution.shm_size,
tmpfs=self.execution.tmpfs,
devices=self.execution.devices,
sysctls=self.execution.sysctls,
extra_hosts=self.execution.extra_hosts,
security_opt=self.execution.security_opt,
read_only_root_fs=self.execution.read_only_root_fs,
stop_signal=self.execution.stop_signal,
stop_timeout=self.execution.stop_timeout,
healthcheck=self.execution.healthcheck,
init=self.execution.init,
isolation=self.execution.isolation,
runtime=self.execution.runtime,
resources=self.execution.resources,
hostname=self.execution.hostname,
domainname=self.execution.domainname,
mac_address=self.execution.mac_address,
network_mode=self.execution.network_mode,
networks=self.execution.networks,
dns=self.execution.dns,
dns_options=self.execution.dns_options,
dns_search=self.execution.dns_search,
labels=self.execution.labels,
log_config=self.execution.log_config,
restart_policy=self.execution.restart_policy,
syslog=self.execution.syslog,
volumes=self.execution.volumes,
volume_driver=self.execution.volume_driver,
working_dir_mount=self.execution.working_dir_mount,
secrets=self.execution.secrets,
configs=self.execution.configs,
env_files=self.execution.env_files,
extra=self.execution.extra,
)
else:
execution = self.execution
return ModelRevisionSpec(
image_identifier=image_identifier,
resource_spec=resource_spec,
mounts=self.mounts,
execution=execution,

Copilot uses AI. Check for mistakes.
@seedspirit seedspirit changed the title fix(BA-3030): Support service-definition.toml override with optional fields(image, arch, resource) feat(BA-3030): Support service-definition.toml override with optional fields(image, arch, resource) Nov 14, 2025
Comment on lines +529 to +533
if self.image is None or self.architecture is None or self.config.resources is None:
raise InvalidAPIParameters(
"Image, architecture, and resources must be specified for dry-run service creation"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation is wrong. It should only be validated for the overridden model definition after reading the value from model_definition_path.


@dataclass
class DeploymentCreator:
class FinalDeploymentCreator:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this needs to be separated, could you let me know the reason for separating it?

Comment on lines 233 to 253
def to_model_revision_spec_with_service_definition(
self, service_definition: ModelServiceDefinition
) -> ModelRevisionSpec:
if service_definition.resource_slots is not None:
resource_spec = ResourceSpec(
cluster_mode=self.resource_spec.cluster_mode,
cluster_size=self.resource_spec.cluster_size,
resource_slots=service_definition.resource_slots,
resource_opts=self.resource_spec.resource_opts,
)
else:
if self.resource_spec.resource_slots is None:
raise InvalidAPIParameters(
"Resource slots must be specified in either requested model revision or model service definition."
)
resource_spec = ResourceSpec(
cluster_mode=self.resource_spec.cluster_mode,
cluster_size=self.resource_spec.cluster_size,
resource_slots=self.resource_spec.resource_slots,
resource_opts=self.resource_spec.resource_opts,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems to have gone wrong with the implementation right now.

request_user_id: uuid.UUID,
sudo_session_enabled: bool,
) -> DryRunModelServiceAction:
# TODO: Implement service definition override in dry-run

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
"""Action to create a new legacy deployment(Model Service)."""

creator: DeploymentCreator
request: DeploymentCreateRequest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is strange. It looks like a request is going in as an internal action.

vfolder_id=request.requested_model_revision.mounts.model_vfolder_id,
model_definition_path=request.requested_model_revision.mounts.model_definition_path,
)
print(model_revision.model_dump_json())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use print in production code.

Comment on lines +22 to +32
@override
async def validate_revision(self, revision: ModelRevisionSpec) -> None:
"""
Validate CUSTOM variant revision by checking model definition.
"""
definition_files: DefinitionFiles = (
await self._deployment_repository.fetch_definition_files(
vfolder_id=revision.mounts.model_vfolder_id,
model_definition_path=revision.mounts.model_definition_path,
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I question whether this method is needed.

Comment on lines +162 to +166
def _override_revision(
self,
requested_revision: RequestedModelRevisionSpec,
service_definition: ModelServiceDefinition,
) -> ModelRevisionSpec:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to guarantee overriding something within the class structure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make NewServiceRequestModel fields optional to support service-definition.toml override pattern

3 participants