Skip to content

Conversation

@jcyang43
Copy link
Contributor

@jcyang43 jcyang43 commented Oct 21, 2025

Purpose

Add tpu-inference for vllm TPU.

Test Plan

E2E tests on v6e.

Test Result

Successful


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@jcyang43 jcyang43 requested a review from NickLucche as a code owner October 21, 2025 17:53
@mergify mergify bot added ci/build v1 tpu Related to Google TPUs labels Oct 21, 2025
@mergify mergify bot mentioned this pull request Oct 21, 2025
5 tasks
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the TPU backend to use the tpu-inference library, updating dependencies and import paths accordingly. The changes correctly adapt vLLM to the new library version. However, I've identified a critical issue with the dependency management in requirements/tpu.txt. By removing torch_xla, the fallback mechanism for when tpu-inference is unavailable becomes non-functional, which could lead to confusing runtime errors. My review includes a suggestion to address this to ensure the system remains robust.

Comment on lines 15 to 16
tpu-inference==0.11.1
numba
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Removing torch_xla from this requirements file makes the fallback path in the code fragile. The codebase contains logic to fall back to a torch_xla-based implementation if tpu-inference is not found or fails to import (e.g., in vllm/platforms/tpu.py).

With torch_xla removed from the dependencies, if tpu-inference is installed but fails to import for any reason (like a transitive dependency issue), the fallback will immediately fail with an ImportError for torch_xla. This creates a brittle setup and can cause confusing errors for users.

To ensure the fallback mechanism is robust, please re-add torch_xla to the dependencies. This ensures that even if the primary tpu-inference path fails, the system can gracefully fall back to the torch_xla-based implementation.

tpu-inference==0.11.1
numba

# Install torch_xla for fallback
torch_xla[tpu, pallas]==2.8.0

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if not USE_TPU_INFERENCE:
logger.info("tpu_inference not found, using vLLM's TpuCommunicator")
if current_platform.is_tpu():
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
from torch_xla._internal import pjrt
from torch_xla.distributed.xla_multiprocessing import (
create_optimized_replica_groups,
)
if USE_RAY:
from vllm.executor import ray_utils
class TpuCommunicator(DeviceCommunicatorBase):
def __init__(
self,
cpu_group: ProcessGroup,
device: torch.device | None = None,
device_group: ProcessGroup | None = None,
unique_name: str = "",
):
super().__init__(cpu_group, device, device_group, unique_name)
# NOTE(woosuk): When using TP > 1 on TPUs, every TPU on the same node
# must be used together. Therefore, the local rank and world size can
# be simply calculated as follows.
global_rank = self.global_rank
global_world_size = self.global_world_size
if USE_RAY:
logger.info("TpuCommunicator initialized with RAY")
# Calculate how many TPU nodes are in the current deployment. This
# is the Ray placement group if it is deployed with Ray. Default
# to the number of TPU nodes in the Ray cluster. The number of TPU
# nodes is computed by the total number of TPUs divided by the
# number of TPU accelerators per node, to account for clusters
# with both CPUs and TPUs.
num_nodes = ray_utils.get_num_tpu_nodes()
num_nodes_in_pg = ray_utils.get_num_nodes_in_placement_group()

P1 Badge Import ray_utils when TPU inference is enabled

With tpu_inference installed USE_TPU_INFERENCE becomes True, but this module now always exposes the local TpuCommunicator instead of aliasing to the implementation in tpu_inference. The conditional import of ray_utils is still wrapped inside if not USE_TPU_INFERENCE, yet __init__ uses ray_utils whenever the distributed executor backend is Ray. In a TPU inference deployment that uses the Ray backend, ray_utils will be undefined and TpuCommunicator.__init__ will raise NameError before any communication is set up. The Ray utilities need to be imported regardless of USE_TPU_INFERENCE (or the Ray-specific logic needs a similar guard).

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if not USE_TPU_INFERENCE:
logger.info("tpu_inference not found, using vLLM's TpuCommunicator")
if current_platform.is_tpu():
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
from torch_xla._internal import pjrt
from torch_xla.distributed.xla_multiprocessing import (
create_optimized_replica_groups,
)
if USE_RAY:
from vllm.executor import ray_utils
class TpuCommunicator(DeviceCommunicatorBase):
def __init__(
self,
cpu_group: ProcessGroup,
device: torch.device | None = None,
device_group: ProcessGroup | None = None,
unique_name: str = "",
):
super().__init__(cpu_group, device, device_group, unique_name)
# NOTE(woosuk): When using TP > 1 on TPUs, every TPU on the same node
# must be used together. Therefore, the local rank and world size can
# be simply calculated as follows.
global_rank = self.global_rank
global_world_size = self.global_world_size
if USE_RAY:
logger.info("TpuCommunicator initialized with RAY")
# Calculate how many TPU nodes are in the current deployment. This
# is the Ray placement group if it is deployed with Ray. Default
# to the number of TPU nodes in the Ray cluster. The number of TPU
# nodes is computed by the total number of TPUs divided by the
# number of TPU accelerators per node, to account for clusters
# with both CPUs and TPUs.
num_nodes = ray_utils.get_num_tpu_nodes()
num_nodes_in_pg = ray_utils.get_num_nodes_in_placement_group()

P1 Badge Import ray_utils when TPU inference is enabled

With tpu_inference installed USE_TPU_INFERENCE becomes True, but this module now always exposes the local TpuCommunicator instead of aliasing to the implementation in tpu_inference. The conditional import of ray_utils is still wrapped inside if not USE_TPU_INFERENCE, yet __init__ uses ray_utils whenever the distributed executor backend is Ray. In a TPU inference deployment that uses the Ray backend, ray_utils will be undefined and TpuCommunicator.__init__ will raise NameError before any communication is set up. The Ray utilities need to be imported regardless of USE_TPU_INFERENCE (or the Ray-specific logic needs a similar guard).

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@jcyang43 jcyang43 marked this pull request as draft October 22, 2025 02:48
Signed-off-by: Johnny Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant