Bazel in Pytorch/XLA

Bazel is a free software tool used for the automation of building and testing software. TensorFlow and OpenXLA both use it, which makes it a good fit for PyTorch/XLA as well.

Bazel dependencies

Tensorflow is a bazel external dependency for PyTorch/XLA, which can be seen in the WORKSPACE file:

WORKSPACE

http_archive(
    name = "org_tensorflow",
    strip_prefix = "tensorflow-f7759359f8420d3ca7b9fd19493f2a01bd47b4ef",
    urls = [
        "https://github.com/tensorflow/tensorflow/archive/f7759359f8420d3ca7b9fd19493f2a01bd47b4ef.tar.gz",
    ],
)

TensorFlow pin can be updated by pointing this repository to a different revision. Patches may be added as needed. Bazel will resolve the dependency, prepare the code and patch it hermetically.

For PyTorch, a different dependency mechanism is deployed because a local PyTorch checkout is used, and this local checkout has to be built from source and ideally installed on the system for version compatibility (e.g. codegen in PyTorch/XLA uses torchgen python module that should be installed in the system).

The local directory can either set in bazel/dependencies.bzl, or overriden on the command line:

bazel build --override_repository=org_tensorflow=/path/to/exported/tf_repo //...

bazel build --override_repository=torch=/path/to/exported/and/built/torch_repo //...

Please make sure that the overridden repositories are at the appropriate revisions and in case of torch, that it has been built with USE_CUDA=0 python setup.py bdist_wheel to make sure that all expected build objects are present; ideally installed into the system.

WORKSPACE

new_local_repository(
    name = "torch",
    build_file = "//bazel:torch.BUILD",
    path = PYTORCH_LOCAL_DIR,
)

PyTorch headers are directly sourced from the torch dependency, the local checkout of PyTorch. The shared libraries (e.g. libtorch.so) are sourced from the same local checkout where the code has been built and build/lib/ contains the built objects. For this to work, it's required to pass -isystemexternal/torch to the compiler so it can find system libraries and satisfy them from the local checkout. Some are included as <system> and some as "user" headers.

Bazel brings in pybind11 embeded python and links against it to provide libpython to the plugin using this mechanism. Python headers are also sourced from there instead of depending on the system version. These are satisfied from the "@pybind11//:pybind11_embed", which sets up compiler options for linking with libpython transitively.

How to build XLA libraries

Building the libraries is simple:

bazel build //torch_xla/csrc/runtime/...

Bazel is configred via .bazelrc, but it can also take flags on the command line.

bazel build --config=remote_cache //torch_xla/csrc/runtime/...

The remote_cache configurations use gcloud for caching and usually faster, but require authentication with gcloud. See .bazelrc for the configuration.

Using bazel makes it easy to express complex dependencies and there is a lot of gain from having a single build graph with everything expressed in the same way. Therefore, there is no need to build the XLA libraries separately from the rest of the pluing as used to be the case, building the whole repository, or the plugin shared object that links everythin else in, is enough.

How to build the Torch/XLA plugin

The normal build can be achieved by the invoking the standard python setup.py bdist_wheel, but C++ bindings can be built simply with:

bazel build //:_XLAC.so

This will build the XLA client and the PyTorch plugin and link it all together. This can be useful when testing changes, to be able to compile the C++ code without building the python plugin faster iteration cycles.

Remote caching

Bazel comes with remote caching built in. There are plenty of cache backends that can be used; we deploy our caching on (GCS)[https://bazel.build/remote/caching#cloud-storage]. You can see the configuration in .bazelrc, under config name remote_cache.

Remote caching is disabled by default but because it speeds up incremental builds by a huge margin, it is almost always recommended, and it is enabled by default in the CI automation and on Cloud Build.

To authenticate on a machine, please ensure that you have the credentials present with gcloud auth application-default login --no-launch-browser or equivalent.

Using the remote cache configured by remote_cache configuration setup requires authentication with GCP. There are various ways to authenticate with GCP. For individual developers who have access to the development GCP project, one only needs to specify the --config=remote_cache flag to bazel, and the default --google_default_credentials will be used and if the gcloud token is present on the machine, it will work out of the box, using the logged in user for authentication. The user needs to have remote build permissions in GCP (add new developers into the Remote Bazel role). In the CI, the service account key is used for authentication and is passed to bazel using --config=remote_cache --google_credentials=path/to/service.key. On Cloud Build, docker build --network=cloudbuild is used to pass the authentication from the service account running the cloud build down into the docker image doing the compilation: Application Default Credentials does the work there and authenticates as the service account. All accounts, both user and service accounts, need to have remote cache read/write permissions.

Remote cache uses cache silos. Each unique machine and build should specify a unique silo key to benefit from consistent caching. The silo key can be passed using a flag: -remote_default_exec_properties=cache-silo-key=SOME_SILO_KEY'.

Running the build with remote cache:

BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel

Adding

GCLOUD_SERVICE_KEY_FILE=~/.config/gcloud/application_default_credentials.json

might help too if bazel cannot find the auth token.

YOUR-USER here can the author's username or machine name, a unique name that ensures good cache behavior. Other setup.py functionality works as intended too (e.g. develop).

The first time the code is compiled using a new cached key will be slow because it will compile everything from scratch, but incremental compilations will be very fast. On updating the TensorFlow pin, it will once again be a bit slower the first time per key, and then until the next update quite fast again.

Running tests

Currently C++ code is built and tested by bazel. Python code will be migrated in the future.

Bazel is a test plafrom too, making it easy to run tests:

bazel test //test/cpp:main

Ofcourse the XLA and PJRT configuration have to be present in the environment to run the tests. Not all environmental variables are passed into the bazel test environment to make sure that the remote cache misses are not too common (environment is part of the cache key), see .bazelrc test configuration to see which ones are passed in, and add new ones as required.

You can run the tests using the helper script too:

BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" ./test/cpp/run_tests.sh -R

The xla_client tests are pure hermetic tests that can be easily executed. The torch_xla plugin tests are more complex: they require torch and torch_xla to be installed, and they cannot run in parallel, since they are using either XRT server/client on the same port, or because they use a GPU or TPU device and there's only one available at the time. For that reason, all tests under torch_xla/csrc/ are bundled into a single target :main that runs them all sequentially.

Code coverage

When running tests, it can be useful to calculate code coverage.

bazel coverage //torch_xla/csrc/runtime/...

Coverage can be visualized using lcov as described in Bazel's documentation, or in your editor of choice with lcov plugins, e.g. Coverage Gutters for VSCode.

Language Server

Bazel can power a language server like clangd that brings code references, autocompletion and semantic understanding of the underlying code to your editor of choice. For VSCode, one can use Bazel Stack that can be combined with clangd functionality to bring powerful features to assist code editing.

Building PyTorch/XLA

As always, PyTorch/XLA can be built using Python distutils:

BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!