Skip to content

Add an RFC For Job Execution Plugins to Enable Online Custom Scorers#2

Open
mprahl wants to merge 3 commits intomlflow:mainfrom
mprahl:online-scoring-plugin
Open

Add an RFC For Job Execution Plugins to Enable Online Custom Scorers#2
mprahl wants to merge 3 commits intomlflow:mainfrom
mprahl:online-scoring-plugin

Conversation

@mprahl
Copy link

@mprahl mprahl commented Mar 20, 2026

This is the core design. The follow up for remote scorers supporting custom scoring securely is in #3.

@mprahl
Copy link
Author

mprahl commented Mar 20, 2026

@B-Step62 @etirelli @TomeHirata could you please review this?

@mprahl mprahl force-pushed the online-scoring-plugin branch from 8a013d6 to 511c662 Compare March 20, 2026 18:07
mprahl added a commit to mprahl/mlflow-rfcs that referenced this pull request Mar 20, 2026
This depends on mlflow#2 and adds safe online scoring for custom scorers.

Signed-off-by: mprahl <mprahl@users.noreply.github.com>
Co-authored-by: Humair Khan <HumairAK@users.noreply.github.com>
Signed-off-by: mprahl <mprahl@users.noreply.github.com>
@mprahl mprahl force-pushed the online-scoring-plugin branch from 511c662 to 3524afe Compare March 20, 2026 18:11
mprahl added a commit to mprahl/mlflow-rfcs that referenced this pull request Mar 20, 2026
This depends on mlflow#2 and adds safe online scoring for custom scorers.

Signed-off-by: mprahl <mprahl@users.noreply.github.com>
def recover_jobs(self, unfinished_job_ids: list[str]) -> list[JobRecoveryResult]: ...

@property
def scorer_capabilities(self) -> ScorerCapability: ... # defaults to NONE and participates in backend routing
Copy link

@TomeHirata TomeHirata Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A job is a higher abstraction than a scorer execution job, and it's a bit odd that a job executor has a property for what scorer type is supported. If the intention is to tell if UDF is supported by the backend or not, can we have a boolean flag like is_udf_supported, or more generally capabilities property that returns ["UDF"]. Also, I wonder if we need this property from the beginning. Any job executor should be able to execute any Python function, and it just has a different resource isolation level. For local development, users are free to use SubprocessJobExecutor, and for the remote tracking server, they can just switch to DockerJobExecutor/K8sJobExecutor, and this property is not used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We don't want to restrict to just "scorer" jobs, so we could make this a generic capabilities property.

The reason why I had this was mostly to be able to automatically block custom scorer code if the job executor did not have isolation capabilities. On second thought, we can just let the admin opt in to custom scorers explicitly with an environment variable and/or mlflow server CLI flag.

I'll make that change but let me know if you have a different idea.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, we can just let the admin opt in to custom scorers explicitly with an environment variable and/or mlflow server CLI flag.

Good point, I prefer to remove capabilities from the job executor interface altogether and support routing or executor<>job type mapping later if explicitly requested.

Comment on lines +183 to +184
tracking_uri: str
gateway_uri: str | None = None # optional MLflow AI Gateway base URI reachable from the job runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two uris should be identical

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you could start the AI Gateway separately with mlflow gateway start so this was to account for if the tracking server and gateway are deployed on separate servers. If that's not common, I can remove it.

Let me know your preference!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlflow gateway start is a legacy gateway product that we don't promote anymore. The new gateway feature exposes the gateway endpoints on the tracking server directly, so I'd recommend we just ask users to set tracking_uri only.

- `remote_execution` answers whether the job runs through the local direct-store path or through the remote executor
contract

This distinction matters for `optimize_prompts_job`. It is not arbitrary custom Python in the same way that a custom

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd refactor optimize_prompts_job rather than defining the job executor interface based on how optimize_prompts_job works. Btw, this is also an issue for online scorers that don't use MLflow gateway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomeHirata, my original thought was to not cause a breaking change for non-gateway users. I also don't think you can do online scoring without the MLflow AI Gateway today, but you can do one-off evaluations through the UI using direct. I may be wrong about that.

So maybe, if the plugin has remote_execution() return True, we can disallow non-gateway usage. Then for existing users, it's not a breaking change, because they would still use the default subprocess executor backend which does support local.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think you can do online scoring without the MLflow AI Gateway today

This is true for the UI path, but I believe users can register judges via the Python API.

So maybe, if the plugin has remote_execution() return True, we can disallow non-gateway usage

Yeah, I can try adding this validation if that's not difficult. Otherwise, we can document the limitation and raise an error at runtime if the direct provider API is used and the secret is missing.


- **Job row claim**: the worker's conditional `PENDING -> RUNNING` transition that gives one MLflow instance ownership
of a queued job row
- **Exclusivity lock**: the higher-level lock stored in `job_locks`, typically for a key such as an experiment ID
Copy link

@TomeHirata TomeHirata Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any concrete use cases for this higher level locking (e.g., experiment id)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to support the exclusive argument in the job decorator for run_online_trace_scorer_job and run_online_session_scorer_job. They don't allow running multiple per experiment ID. By bring this to a database level lock, we can replicate the same locking that exists in Huey today, except it would now support multiple MLflow replicas.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. @dbczumar, what was the main motivation to bring the resource-based exclusion to the online scorer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent job executions could result in duplicate logs (e.g. MLflow assessments) for jobs that process traces from a particular experiment, etc, such as the online scoring job.

- **Job lease**: the short-lived `RUNNING`-job lease tracked by `lease_expires_at`, used to detect stale monitored work
- **Scheduler lease**: the single-leader discovery lease stored in `scheduler_leases`

`JobLockManager` replaces Huey's lock helper and keeps the existing lock key computation model. Lock acquisition is an
Copy link

@TomeHirata TomeHirata Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we plan to implement the job queue based on the job table and the job_locks table in a multi-replica setting? Implementing a high-performing multi-process queue is non-trivial, and that's part of why huey was selected instead of a database-based job queue implementation.

Copy link

@HumairAK HumairAK Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondering of possibly offloading this capability to a third party. However, Huey's current support for distributed task based queuing seems limited to mostly Redis, which adds another major dependency. We would like to provide users the option to be able to just leverage their current existing mlflow DB to reduce deployment overhead, but Huey doesn't seem to have proper support for sql based DBs.

We will evaluate the feasibility of leveraging this or something similar for task based queing and locking and follow up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we looked into other alternatives. We were unable to find a strong existing alternative that cleanly fits all of our requirements: no additional infrastructure beyond the existing MLflow DB, multi-replica execution, SQL-backed durability/coordination, portability across PostgreSQL/MySQL/MSSQL, and preserving a built-in OSS/local experience.

The alternatives we discussed each seem to miss at least one of those constraints. Huey has SQL-backed storage, but not the broader distributed coordination model we need here. Celery would introduce an external broker. Other alternatives are specific to DB's like Postgresql.

So given the current constraints, I think the proposal is the right direction. I do agree this means we are taking on distributed queue / locking / recovery correctness as MLflow product scope.

@TomeHirata, what specific implementation detail would be most helpful to spell out next so we can move forward with implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add:

  1. Huey supports SQLite but I didn't see support for other database types.
  2. Celery does support a SQLAlchemy plugin but it would not solve the exclusive lock mechanism already in MLflow jobs on the experiment.

Copy link

@TomeHirata TomeHirata Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Huey supports other types of SQL-backed durability (https://github.com/coleifer/huey/blob/master/huey/contrib/sql_huey.py), wasn't this enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a section for a hybrid approach that leverages Huey.

`JobExecutionContext.workspace`, while executors themselves remain workspace-unaware.

Multi-replica coordination assumes a transactional tracking database such as PostgreSQL, MySQL, or MSSQL. SQLite is
acceptable for single-process local use, but it is not a safe foundation for multi-replica lease and lock coordination.
Copy link

@TomeHirata TomeHirata Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I overall agree with the statement, but note that by default, mlflow server spins up multiple uvicorn workers.


Each job token is granted only the permissions needed for the job that owns it:

- `EDIT` on the target experiment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we create an attack vector that allows attackers to access a resource they are not permitted to through job execution. Also, the current design requires us to list required permissions for each job type, but identifying all required permissions for complex jobs like prompt optimization is not trivial.
So I wonder if we should just carry the caller's permission. Concretely,

  1. When a job is submitted, we authenticate the user and generate a short-term token (job token)
  2. The user ID and the job token are included in the HTTP header when the job executor calls the tracking server
  3. The tracking server verifies the token and authorize the request based on the user's permission

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. I agree that identifying the required permissions is not trivial with the current shape of the job, especially for optimize_prompts_job, because some of that dependency resolution still happens inside the job at execution time.

That said, it seems like we can make this workable with a refactor rather than by carrying the caller's full permissions at runtime. In particular, if we move more of the dependency/resource resolution for optimize_prompts_job to the server side at submission time instead of inside the job body, we should be able to determine the required resources up front without causing breaking changes for existing users. For remote job executors, we can also require gateway-backed model usage so the set of required permissions stays explicit and bounded.

I still think we should keep the remote execution path least-privileged. We also should add a check to ensure that the user creating an online scorer already holds the permissions required for the job to run, which should help prevent privilege escalation.

There is still some residual risk here because MLflow permissions are not scoped at the run level today. So a token scoped to an experiment could theoretically read or modify other runs in that experiment that it does not strictly need. That is not ideal, but I think it is an acceptable and explicit limitation for now, and still safer than giving the job the caller's full live permissions.

I'll work on a section in the doc to proposes refactoring optimize prompt jobs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah, it's ideal to verify all required permissions at the submission time based on the job's business logic and the caller's permission. I just called out that bringing the user's permission is probably the most efficient way to avoid breaching the caller's permission. If we think we can check the required permission for all job types at the submission time, let's document the decision and what types of refactoring are necessary.

This is one of the core security benefits of the remote model. The remote backend gets a scoped token, not broad
provider credentials.

`optimize_prompts_job` is excluded from this path by design. It still participates in the common framework, but remains

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we should be able to use gateway:/... in optimize_prompts_job too.


## Drawbacks

1. This proposal moves more logic into the core MLflow job framework. Huey previously hid some of that complexity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit concerning. cc: @WeichenXu123 who made the decision for huey


# Open questions

1. Should `python_env` remain part of the `@job` decorator contract? It is currently unused in practice, and keeping it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iirc, python_env is for installing extra packages required for the job. Don't we still need this if we want to allow users to use extra packages in the remote executor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomeHirata the main reason of the open question is to simplify things by not continuing to allow specifying the Python version. We'd still want the extra packages though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. How do we handle an edge case where the Python version of the job executor cannot install the required packages for the job?


def start_executor(self) -> None: ...

def stop_executor(self) -> None: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: when is stop_executor called?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was on server shutdown to allow and daemons/processes shutdown gracefully. I'll add a note in the doc. The main motivation is so that each plugin implementation doesn't have to keep track of the server process state to determine when to clean up.

@abstractmethod
def cancel_job(self, job_id: str) -> None: ...

def recover_jobs(self, unfinished_job_ids: list[str]) -> list[JobRecoveryResult]: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't recover_jobs also be an abstract method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can't see a reason not to make it one. Thanks!

gateway_uri: str | None = None # optional MLflow AI Gateway base URI reachable from the job runtime
token: str | None = None # used by remote executors
workspace: str | None = None
pip_config: PipConfig | None = None # pip install settings for local or remote runtimes
Copy link

@TomeHirata TomeHirata Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the existing _PythonEnv data model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your preference to expand _PythonEnv and reuse it here since it doesn't have the configuration for the PyPi index like the proposed PipConfig does?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields in JobExecutionContext are immutable, why not just treat them as part of job params ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if JobExecutionContext is used by JobExecutor, can we pass JobExecutionContext to JobExecutor's constructor instead ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 good questions. My intent with JobExecutionContext is to keep framework-owned runtime metadata separate from the job's logical params.

Even if fields like tracking_uri, workspace, and the remote token are immutable for a given run, they are not part of the job function's business input. Some are deployment-derived and some are framework-generated at execution time, so putting them into params would blur the boundary between user/job inputs and runtime/executor metadata.

For the same reason, I don't think they belong on the executor constructor either, since the executor instance is deployment-scoped while this context is per job run. I think submit_job(..., context=...) is still the right shape.

That said, I'll go ahead and update the proposal to extend _PythonEnv rather than introduce a separate PipConfig type.

Signed-off-by: mprahl <mprahl@users.noreply.github.com>
@mprahl mprahl requested a review from TomeHirata March 23, 2026 21:21
@mprahl
Copy link
Author

mprahl commented Mar 23, 2026

@TomeHirata thanks for the review! I addressed your comments or replied to them. Could you please take another look?

fn_fullname: str,
params: dict[str, Any],
context: JobExecutionContext,
python_env: Any | None = None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already can configure python_env for certain job function, do we need to support configuring python_env for individual job run ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the python_env on the job decorator passed down to the backend executor plugin.

Comment on lines +934 to +935
4. Land the Remote Executors RFC after the core abstractions are approved, or
review it in parallel if it helps make the core contract clearer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to file the follow-up Remote Executors RFC . I want to review together and see if anything in current RFC needs to improve

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 thanks for offering to review! It's at #3.

mprahl added a commit to mprahl/mlflow-rfcs that referenced this pull request Mar 24, 2026
This depends on mlflow#2 and adds safe online scoring for custom scorers.

Signed-off-by: mprahl <mprahl@users.noreply.github.com>
@mprahl mprahl force-pushed the online-scoring-plugin branch from 6bc13da to c749254 Compare March 24, 2026 18:37
@mprahl mprahl requested review from WeichenXu123 and dbczumar March 24, 2026 19:20
Signed-off-by: mprahl <mprahl@users.noreply.github.com>
@mprahl mprahl force-pushed the online-scoring-plugin branch from c749254 to 88059cf Compare March 25, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants