Add an RFC For Job Execution Plugins to Enable Online Custom Scorers#2
Add an RFC For Job Execution Plugins to Enable Online Custom Scorers#2mprahl wants to merge 3 commits intomlflow:mainfrom
Conversation
|
@B-Step62 @etirelli @TomeHirata could you please review this? |
8a013d6 to
511c662
Compare
This depends on mlflow#2 and adds safe online scoring for custom scorers. Signed-off-by: mprahl <mprahl@users.noreply.github.com>
Co-authored-by: Humair Khan <HumairAK@users.noreply.github.com> Signed-off-by: mprahl <mprahl@users.noreply.github.com>
511c662 to
3524afe
Compare
This depends on mlflow#2 and adds safe online scoring for custom scorers. Signed-off-by: mprahl <mprahl@users.noreply.github.com>
| def recover_jobs(self, unfinished_job_ids: list[str]) -> list[JobRecoveryResult]: ... | ||
|
|
||
| @property | ||
| def scorer_capabilities(self) -> ScorerCapability: ... # defaults to NONE and participates in backend routing |
There was a problem hiding this comment.
A job is a higher abstraction than a scorer execution job, and it's a bit odd that a job executor has a property for what scorer type is supported. If the intention is to tell if UDF is supported by the backend or not, can we have a boolean flag like is_udf_supported, or more generally capabilities property that returns ["UDF"]. Also, I wonder if we need this property from the beginning. Any job executor should be able to execute any Python function, and it just has a different resource isolation level. For local development, users are free to use SubprocessJobExecutor, and for the remote tracking server, they can just switch to DockerJobExecutor/K8sJobExecutor, and this property is not used.
There was a problem hiding this comment.
Good point. We don't want to restrict to just "scorer" jobs, so we could make this a generic capabilities property.
The reason why I had this was mostly to be able to automatically block custom scorer code if the job executor did not have isolation capabilities. On second thought, we can just let the admin opt in to custom scorers explicitly with an environment variable and/or mlflow server CLI flag.
I'll make that change but let me know if you have a different idea.
There was a problem hiding this comment.
On second thought, we can just let the admin opt in to custom scorers explicitly with an environment variable and/or mlflow server CLI flag.
Good point, I prefer to remove capabilities from the job executor interface altogether and support routing or executor<>job type mapping later if explicitly requested.
| tracking_uri: str | ||
| gateway_uri: str | None = None # optional MLflow AI Gateway base URI reachable from the job runtime |
There was a problem hiding this comment.
I thought you could start the AI Gateway separately with mlflow gateway start so this was to account for if the tracking server and gateway are deployed on separate servers. If that's not common, I can remove it.
Let me know your preference!
There was a problem hiding this comment.
mlflow gateway start is a legacy gateway product that we don't promote anymore. The new gateway feature exposes the gateway endpoints on the tracking server directly, so I'd recommend we just ask users to set tracking_uri only.
| - `remote_execution` answers whether the job runs through the local direct-store path or through the remote executor | ||
| contract | ||
|
|
||
| This distinction matters for `optimize_prompts_job`. It is not arbitrary custom Python in the same way that a custom |
There was a problem hiding this comment.
I'd refactor optimize_prompts_job rather than defining the job executor interface based on how optimize_prompts_job works. Btw, this is also an issue for online scorers that don't use MLflow gateway.
There was a problem hiding this comment.
@TomeHirata, my original thought was to not cause a breaking change for non-gateway users. I also don't think you can do online scoring without the MLflow AI Gateway today, but you can do one-off evaluations through the UI using direct. I may be wrong about that.
So maybe, if the plugin has remote_execution() return True, we can disallow non-gateway usage. Then for existing users, it's not a breaking change, because they would still use the default subprocess executor backend which does support local.
There was a problem hiding this comment.
I also don't think you can do online scoring without the MLflow AI Gateway today
This is true for the UI path, but I believe users can register judges via the Python API.
So maybe, if the plugin has remote_execution() return True, we can disallow non-gateway usage
Yeah, I can try adding this validation if that's not difficult. Otherwise, we can document the limitation and raise an error at runtime if the direct provider API is used and the secret is missing.
|
|
||
| - **Job row claim**: the worker's conditional `PENDING -> RUNNING` transition that gives one MLflow instance ownership | ||
| of a queued job row | ||
| - **Exclusivity lock**: the higher-level lock stored in `job_locks`, typically for a key such as an experiment ID |
There was a problem hiding this comment.
Do we have any concrete use cases for this higher level locking (e.g., experiment id)?
There was a problem hiding this comment.
This is to support the exclusive argument in the job decorator for run_online_trace_scorer_job and run_online_session_scorer_job. They don't allow running multiple per experiment ID. By bring this to a database level lock, we can replicate the same locking that exists in Huey today, except it would now support multiple MLflow replicas.
There was a problem hiding this comment.
Got it. @dbczumar, what was the main motivation to bring the resource-based exclusion to the online scorer?
There was a problem hiding this comment.
Concurrent job executions could result in duplicate logs (e.g. MLflow assessments) for jobs that process traces from a particular experiment, etc, such as the online scoring job.
| - **Job lease**: the short-lived `RUNNING`-job lease tracked by `lease_expires_at`, used to detect stale monitored work | ||
| - **Scheduler lease**: the single-leader discovery lease stored in `scheduler_leases` | ||
|
|
||
| `JobLockManager` replaces Huey's lock helper and keeps the existing lock key computation model. Lock acquisition is an |
There was a problem hiding this comment.
How do we plan to implement the job queue based on the job table and the job_locks table in a multi-replica setting? Implementing a high-performing multi-process queue is non-trivial, and that's part of why huey was selected instead of a database-based job queue implementation.
There was a problem hiding this comment.
I wondering of possibly offloading this capability to a third party. However, Huey's current support for distributed task based queuing seems limited to mostly Redis, which adds another major dependency. We would like to provide users the option to be able to just leverage their current existing mlflow DB to reduce deployment overhead, but Huey doesn't seem to have proper support for sql based DBs.
We will evaluate the feasibility of leveraging this or something similar for task based queing and locking and follow up.
There was a problem hiding this comment.
So we looked into other alternatives. We were unable to find a strong existing alternative that cleanly fits all of our requirements: no additional infrastructure beyond the existing MLflow DB, multi-replica execution, SQL-backed durability/coordination, portability across PostgreSQL/MySQL/MSSQL, and preserving a built-in OSS/local experience.
The alternatives we discussed each seem to miss at least one of those constraints. Huey has SQL-backed storage, but not the broader distributed coordination model we need here. Celery would introduce an external broker. Other alternatives are specific to DB's like Postgresql.
So given the current constraints, I think the proposal is the right direction. I do agree this means we are taking on distributed queue / locking / recovery correctness as MLflow product scope.
@TomeHirata, what specific implementation detail would be most helpful to spell out next so we can move forward with implementation?
There was a problem hiding this comment.
I'll add:
- Huey supports SQLite but I didn't see support for other database types.
- Celery does support a SQLAlchemy plugin but it would not solve the
exclusivelock mechanism already in MLflow jobs on the experiment.
There was a problem hiding this comment.
It seems Huey supports other types of SQL-backed durability (https://github.com/coleifer/huey/blob/master/huey/contrib/sql_huey.py), wasn't this enough?
There was a problem hiding this comment.
I added a section for a hybrid approach that leverages Huey.
| `JobExecutionContext.workspace`, while executors themselves remain workspace-unaware. | ||
|
|
||
| Multi-replica coordination assumes a transactional tracking database such as PostgreSQL, MySQL, or MSSQL. SQLite is | ||
| acceptable for single-process local use, but it is not a safe foundation for multi-replica lease and lock coordination. |
There was a problem hiding this comment.
nit: I overall agree with the statement, but note that by default, mlflow server spins up multiple uvicorn workers.
|
|
||
| Each job token is granted only the permissions needed for the job that owns it: | ||
|
|
||
| - `EDIT` on the target experiment |
There was a problem hiding this comment.
I wonder if we create an attack vector that allows attackers to access a resource they are not permitted to through job execution. Also, the current design requires us to list required permissions for each job type, but identifying all required permissions for complex jobs like prompt optimization is not trivial.
So I wonder if we should just carry the caller's permission. Concretely,
- When a job is submitted, we authenticate the user and generate a short-term token (job token)
- The user ID and the job token are included in the HTTP header when the job executor calls the tracking server
- The tracking server verifies the token and authorize the request based on the user's permission
There was a problem hiding this comment.
Thanks for calling this out. I agree that identifying the required permissions is not trivial with the current shape of the job, especially for optimize_prompts_job, because some of that dependency resolution still happens inside the job at execution time.
That said, it seems like we can make this workable with a refactor rather than by carrying the caller's full permissions at runtime. In particular, if we move more of the dependency/resource resolution for optimize_prompts_job to the server side at submission time instead of inside the job body, we should be able to determine the required resources up front without causing breaking changes for existing users. For remote job executors, we can also require gateway-backed model usage so the set of required permissions stays explicit and bounded.
I still think we should keep the remote execution path least-privileged. We also should add a check to ensure that the user creating an online scorer already holds the permissions required for the job to run, which should help prevent privilege escalation.
There is still some residual risk here because MLflow permissions are not scoped at the run level today. So a token scoped to an experiment could theoretically read or modify other runs in that experiment that it does not strictly need. That is not ideal, but I think it is an acceptable and explicit limitation for now, and still safer than giving the job the caller's full live permissions.
I'll work on a section in the doc to proposes refactoring optimize prompt jobs.
There was a problem hiding this comment.
Thanks. Yeah, it's ideal to verify all required permissions at the submission time based on the job's business logic and the caller's permission. I just called out that bringing the user's permission is probably the most efficient way to avoid breaching the caller's permission. If we think we can check the required permission for all job types at the submission time, let's document the decision and what types of refactoring are necessary.
| This is one of the core security benefits of the remote model. The remote backend gets a scoped token, not broad | ||
| provider credentials. | ||
|
|
||
| `optimize_prompts_job` is excluded from this path by design. It still participates in the common framework, but remains |
There was a problem hiding this comment.
ditto, we should be able to use gateway:/... in optimize_prompts_job too.
|
|
||
| ## Drawbacks | ||
|
|
||
| 1. This proposal moves more logic into the core MLflow job framework. Huey previously hid some of that complexity. |
There was a problem hiding this comment.
This is a bit concerning. cc: @WeichenXu123 who made the decision for huey
|
|
||
| # Open questions | ||
|
|
||
| 1. Should `python_env` remain part of the `@job` decorator contract? It is currently unused in practice, and keeping it |
There was a problem hiding this comment.
Iirc, python_env is for installing extra packages required for the job. Don't we still need this if we want to allow users to use extra packages in the remote executor?
There was a problem hiding this comment.
@TomeHirata the main reason of the open question is to simplify things by not continuing to allow specifying the Python version. We'd still want the extra packages though.
There was a problem hiding this comment.
Got it. How do we handle an edge case where the Python version of the job executor cannot install the required packages for the job?
|
|
||
| def start_executor(self) -> None: ... | ||
|
|
||
| def stop_executor(self) -> None: ... |
There was a problem hiding this comment.
The intent was on server shutdown to allow and daemons/processes shutdown gracefully. I'll add a note in the doc. The main motivation is so that each plugin implementation doesn't have to keep track of the server process state to determine when to clean up.
| @abstractmethod | ||
| def cancel_job(self, job_id: str) -> None: ... | ||
|
|
||
| def recover_jobs(self, unfinished_job_ids: list[str]) -> list[JobRecoveryResult]: ... |
There was a problem hiding this comment.
Shouldn't recover_jobs also be an abstract method?
There was a problem hiding this comment.
Good point, I can't see a reason not to make it one. Thanks!
| gateway_uri: str | None = None # optional MLflow AI Gateway base URI reachable from the job runtime | ||
| token: str | None = None # used by remote executors | ||
| workspace: str | None = None | ||
| pip_config: PipConfig | None = None # pip install settings for local or remote runtimes |
There was a problem hiding this comment.
Can we reuse the existing _PythonEnv data model?
There was a problem hiding this comment.
Is your preference to expand _PythonEnv and reuse it here since it doesn't have the configuration for the PyPi index like the proposed PipConfig does?
There was a problem hiding this comment.
These fields in JobExecutionContext are immutable, why not just treat them as part of job params ?
There was a problem hiding this comment.
if JobExecutionContext is used by JobExecutor, can we pass JobExecutionContext to JobExecutor's constructor instead ?
There was a problem hiding this comment.
@WeichenXu123 good questions. My intent with JobExecutionContext is to keep framework-owned runtime metadata separate from the job's logical params.
Even if fields like tracking_uri, workspace, and the remote token are immutable for a given run, they are not part of the job function's business input. Some are deployment-derived and some are framework-generated at execution time, so putting them into params would blur the boundary between user/job inputs and runtime/executor metadata.
For the same reason, I don't think they belong on the executor constructor either, since the executor instance is deployment-scoped while this context is per job run. I think submit_job(..., context=...) is still the right shape.
That said, I'll go ahead and update the proposal to extend _PythonEnv rather than introduce a separate PipConfig type.
Signed-off-by: mprahl <mprahl@users.noreply.github.com>
|
@TomeHirata thanks for the review! I addressed your comments or replied to them. Could you please take another look? |
| fn_fullname: str, | ||
| params: dict[str, Any], | ||
| context: JobExecutionContext, | ||
| python_env: Any | None = None, |
There was a problem hiding this comment.
we already can configure python_env for certain job function, do we need to support configuring python_env for individual job run ?
There was a problem hiding this comment.
This is the python_env on the job decorator passed down to the backend executor plugin.
| 4. Land the Remote Executors RFC after the core abstractions are approved, or | ||
| review it in parallel if it helps make the core contract clearer. |
There was a problem hiding this comment.
feel free to file the follow-up Remote Executors RFC . I want to review together and see if anything in current RFC needs to improve
There was a problem hiding this comment.
@WeichenXu123 thanks for offering to review! It's at #3.
This depends on mlflow#2 and adds safe online scoring for custom scorers. Signed-off-by: mprahl <mprahl@users.noreply.github.com>
6bc13da to
c749254
Compare
Signed-off-by: mprahl <mprahl@users.noreply.github.com>
c749254 to
88059cf
Compare
This is the core design. The follow up for remote scorers supporting custom scoring securely is in #3.