Skip to content

Conversation

lukebaumann
Copy link

@lukebaumann lukebaumann commented Jul 8, 2025

Changes the container name for Pathways Workers to pathways-worker so that the workload container name is not the same. This follows the same convention that pathways-proxy and pathways-rm follow and lets you filter logs based on container name alone instead of needing to filter based on pod name and container name.

Adds an exclusivity based on hostname for the head pod to avoid a bug seen where multiple head pods are scheduled on the same host. This is required due to a limitation of k8s where the host ports for init containers are not respected.

@lukebaumann lukebaumann requested review from ruomingp, markblee and a team as code owners July 8, 2025 20:16
Copy link
Contributor

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a new docker image and might not be necessary.

Can I ask what's the motivation behind it?

@lukebaumann
Copy link
Author

Pathways TPU and McJAX TPU have different dependencies. Specifically jax[tpu] is only needed for McJAX TPU and pathwaysutils is only needed for Pathways TPU. Separating Pathways TPU and McJAX TPU as two packaging extras was the main intent of the PR and as a consequence I created a new docker image.

Today, Pathways TPU can install jax[tpu] and McJAX TPU can install pathwaysutils so this PR is not necessary but it does more accurately denote dependencies of different execution modes.

@lukebaumann lukebaumann requested a review from a team as a code owner July 14, 2025 20:00
@Ethanlm
Copy link
Contributor

Ethanlm commented Jul 17, 2025

Pathways TPU and McJAX TPU have different dependencies. Specifically jax[tpu] is only needed for McJAX TPU and pathwaysutils is only needed for Pathways TPU. Separating Pathways TPU and McJAX TPU as two packaging extras was the main intent of the PR and as a consequence I created a new docker image.

Today, Pathways TPU can install jax[tpu] and McJAX TPU can install pathwaysutils so this PR is not necessary but it does more accurately denote dependencies of different execution modes.

I would suggest to hold off splitting the dependencies for now, since this will break our internal setup.

Making a separate image doesn't seem necessary and add maintenance overhead. It is nice to have both Pathways TPU and McJAX TPU working in one single image for easier testing.

We can go with other changes in the PR

@lukebaumann
Copy link
Author

I removed the pathways-tpu image from this PR. I think it is helpful to separate dependencies for McJAX/Pathways in the long term but agree that it is not worth any extra maintenance cost today.

I can put that commit in a draft PR for when the time comes to separate the images.

Ethanlm
Ethanlm previously approved these changes Jul 24, 2025
@Ethanlm Ethanlm dismissed their stale review July 24, 2025 18:12

sorry got some questions.

@muyangyuapple
Copy link
Contributor

Adds an exclusivity based on hostname for the head pod to avoid a bug seen where multiple head pods are scheduled on the same host.

This is a very good question to discuss, I see the pro is to guarantee that the head pods can have exclusive resource usage of the host. But it will also waste CPU node resource making the head node harder to be scheduled. FWIW, I remember we intentionally tune the head pods CPU/Memory requirement so that two head nodes can share a host in RL pipelines to make it easier to schedule head pods.

@lukebaumann
Copy link
Author

k8s does not respect host ports for init containers when scheduling so we have witnessed collisions where job 1 containers connect to job 2 if the head pods were on the same host. Adding this exclusivity worked around the issue while it is being fixed.
More details are here

Additionally, for the hero workloads we were scale testing with, the head pod was resource constrained and removing the limits and adding the exclusivity was needed.

Changing the name of the worker container to pathways-worker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants