-
Notifications
You must be signed in to change notification settings - Fork 369
Pathways Updates #1288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Pathways Updates #1288
Conversation
9e739e3
to
4f3c12f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a new docker image and might not be necessary.
Can I ask what's the motivation behind it?
Pathways TPU and McJAX TPU have different dependencies. Specifically Today, Pathways TPU can install |
I would suggest to hold off splitting the dependencies for now, since this will break our internal setup. Making a separate image doesn't seem necessary and add maintenance overhead. It is nice to have both Pathways TPU and McJAX TPU working in one single image for easier testing. We can go with other changes in the PR |
f9fb40f
to
d2894dc
Compare
I removed the pathways-tpu image from this PR. I think it is helpful to separate dependencies for McJAX/Pathways in the long term but agree that it is not worth any extra maintenance cost today. I can put that commit in a draft PR for when the time comes to separate the images. |
This is a very good question to discuss, I see the pro is to guarantee that the head pods can have exclusive resource usage of the host. But it will also waste CPU node resource making the head node harder to be scheduled. FWIW, I remember we intentionally tune the head pods CPU/Memory requirement so that two head nodes can share a host in RL pipelines to make it easier to schedule head pods. |
k8s does not respect host ports for init containers when scheduling so we have witnessed collisions where job 1 containers connect to job 2 if the head pods were on the same host. Adding this exclusivity worked around the issue while it is being fixed. Additionally, for the hero workloads we were scale testing with, the head pod was resource constrained and removing the limits and adding the exclusivity was needed. |
Changing the name of the worker container to pathways-worker
8c522c7
to
c8bab34
Compare
Changes the container name for Pathways Workers to
pathways-worker
so that the workload container name is not the same. This follows the same convention thatpathways-proxy
andpathways-rm
follow and lets you filter logs based on container name alone instead of needing to filter based on pod name and container name.Adds an exclusivity based on hostname for the head pod to avoid a bug seen where multiple head pods are scheduled on the same host. This is required due to a limitation of k8s where the host ports for init containers are not respected.