Skip to content

feat: Multi-cluster Kubernetes Task Runner#19433

Open
GWphua wants to merge 16 commits into
apache:masterfrom
GWphua:multiple-k8s-clusters
Open

feat: Multi-cluster Kubernetes Task Runner#19433
GWphua wants to merge 16 commits into
apache:masterfrom
GWphua:multiple-k8s-clusters

Conversation

@GWphua
Copy link
Copy Markdown
Contributor

@GWphua GWphua commented May 8, 2026

Description

The existing Kubernetes task runner schedules all tasks into a single Kubernetes cluster. This PR adds a multik8s task runner mode that can schedule tasks across multiple Kubernetes clusters while keeping a single Overlord-level view of task capacity, task state, task logs, and task reports.

This is useful for Druid deployments that need to spread indexing work across multiple Kubernetes clusters, drain or disable specific clusters, or route tasks with a simple cluster selection strategy.

Added multi-cluster Kubernetes task runner

Added MultipleKubernetesTaskRunner, which wraps multiple per-cluster KubernetesTaskRunner instances and exposes them through the existing TaskRunner and TaskLogStreamer APIs.

The runner supports:

  • round-robin, random, and least-task cluster selection strategies
  • disabled clusters that are not selected for new tasks
  • aggregated running, pending, known task, and task slot views
  • task log and task report streaming routed to the owning cluster runner
  • k8s_cluster task context tag injection for pod-template selection

Added multi-cluster runner configuration

Added MultipleKubernetesTaskRunnerConfig, configured under the existing druid.indexer.runner prefix, with per-cluster settings for cluster name, kubeconfig path, task namespace, Overlord identifier, and disabled state.

The multi-cluster runner reuses the existing Kubernetes runner configuration for common task settings such as capacity, labels, annotations, sidecar support, task cleanup, task timeout, and shared informer behavior.

Built per-cluster Kubernetes runners

Added MultipleKubernetesTaskRunnerFactory, which creates one KubernetesTaskRunner per configured cluster and registers the new multik8s runner type in KubernetesOverlordModule.

Each per-cluster runner gets its own Kubernetes client and task adapter while sharing a single task tracking executor. The factory also honors useK8sSharedInformers by using CachingKubernetesPeonClient when shared informers are enabled.

Shared task runner capacity across clusters

Added AutoscalableThreadPoolExecutor and updated KubernetesTaskRunner so multiple Kubernetes task runners can share one executor. This keeps the configured runner capacity as a global limit across all configured clusters instead of multiplying capacity by the number of clusters.

Release note

Added experimental support for running Kubernetes indexing tasks across multiple Kubernetes clusters. Set druid.indexer.runner.type=multik8s and configure druid.indexer.runner.clusters to schedule tasks across multiple Kubernetes clusters from a single Overlord.


Key changed/added classes in this PR
  • MultipleKubernetesTaskRunner
  • MultipleKubernetesTaskRunnerConfig
  • MultipleKubernetesTaskRunnerFactory
  • MultipleKubernetesTaskRunnerDelegate
  • AutoscalableThreadPoolExecutor
  • KubernetesTaskRunner
  • KubernetesOverlordModule

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@GWphua GWphua force-pushed the multiple-k8s-clusters branch from e3f0e73 to f4504f6 Compare May 8, 2026 10:01
if (ownsExecutor && configManager != null) {
configManager.addListener(
KubernetesTaskRunnerDynamicConfig.CONFIG_KEY,
StringUtils.format(OBSERVER_KEY, Thread.currentThread().getId()),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated Thread.threadId is the replacement, introduced in Java 19. I didn't get to see any PR that drops jdk17 support yet, (correct me if im wrong), so we should still use this deprecated method.

if (ownsExecutor && configManager != null) {
configManager.addListener(
KubernetesTaskRunnerDynamicConfig.CONFIG_KEY,
StringUtils.format(OBSERVER_KEY, Thread.currentThread().getId()),
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.


This is an automated review by Codex GPT-5

@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented May 13, 2026

Yet to provide documentation on the feature, I will need some time to make it.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 1
P2 0
P3 0
Total 1

Reviewed 17 of 17 changed files.


This is an automated review by Codex GPT-5.5

new DynamicConfigPodTemplateSelector(properties, effectiveConfig)
);
} else {
return new SingleContainerTaskAdapter(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Default pod adapter fails for remote clusters

The multik8s factory still defaults to SingleContainerTaskAdapter, and also allows MultiContainerTaskAdapter. Both adapters build jobs by reading the Overlord pod via the configured Kubernetes client and task namespace. In this factory, that client points at the selected target cluster and namespace, where the single Druid Overlord usually does not exist. With the documented multik8s sample, which does not set a custom template adapter, the first task submitted to a remote cluster will fail while trying to copy a non-existent Overlord pod. Either require customTemplateAdapter for multik8s or change these adapters to source the pod spec from the local Overlord cluster instead of the target task cluster.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 21 of 21 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants