Bazel is a free software tool used for the automation of building and testing software. TensorFlow and OpenXLA both use it, which makes it a good fit for PyTorch/XLA as well.
Tensorflow is a bazel external dependency for PyTorch/XLA, which can be seen in the WORKSPACE
file:
WORKSPACE
http_archive(
name = "org_tensorflow",
strip_prefix = "tensorflow-f7759359f8420d3ca7b9fd19493f2a01bd47b4ef",
urls = [
"https://github.com/tensorflow/tensorflow/archive/f7759359f8420d3ca7b9fd19493f2a01bd47b4ef.tar.gz",
],
)
TensorFlow pin can be updated by pointing this repository to a different revision. Patches may be added as needed. Bazel will resolve the dependency, prepare the code and patch it hermetically.
For PyTorch, a different dependency mechanism is deployed because a local PyTorch
checkout is used, and this local checkout has to be built
from source and ideally installed on the system for version
compatibility (e.g. codegen in PyTorch/XLA uses torchgen
python module that should be installed in the system).
The local directory can either set in bazel/dependencies.bzl
, or overriden on the command line:
bazel build --override_repository=org_tensorflow=/path/to/exported/tf_repo //...
bazel build --override_repository=torch=/path/to/exported/and/built/torch_repo //...
Please make sure that the overridden repositories are at the appropriate revisions and in case of torch
, that it
has been built with USE_CUDA=0 python setup.py bdist_wheel
to make sure that all expected build objects are present;
ideally installed into the system.
WORKSPACE
new_local_repository(
name = "torch",
build_file = "//bazel:torch.BUILD",
path = PYTORCH_LOCAL_DIR,
)
PyTorch headers are directly sourced from the torch
dependency, the local checkout of PyTorch. The shared libraries
(e.g. libtorch.so
) are sourced from the same local checkout where the code has been built and build/lib/
contains the
built objects. For this to work, it's required to pass -isystemexternal/torch
to the compiler so it can find system
libraries and satisfy them from the local checkout. Some are included as <system>
and some as "user"
headers.
Bazel brings in pybind11 embeded python and links against it to provide libpython
to the plugin using this mechanism. Python headers are also sourced from there instead of depending on the system version.
These are satisfied from the "@pybind11//:pybind11_embed"
, which sets up compiler options for linking with libpython
transitively.
Building the libraries is simple:
bazel build //torch_xla/csrc/runtime/...
Bazel is configred via .bazelrc
, but it can also take flags on the command line.
bazel build --config=remote_cache //torch_xla/csrc/runtime/...
The remote_cache
configurations use gcloud for caching and usually faster, but require
authentication with gcloud. See .bazelrc
for the configuration.
Using bazel makes it easy to express complex dependencies and there is a lot of gain from having a single build graph with everything expressed in the same way. Therefore, there is no need to build the XLA libraries separately from the rest of the pluing as used to be the case, building the whole repository, or the plugin shared object that links everythin else in, is enough.
The normal build can be achieved by the invoking the standard python setup.py bdist_wheel
, but C++ bindings can be built simply with:
bazel build //:_XLAC.so
This will build the XLA client and the PyTorch plugin and link it all together. This can be useful when testing changes, to be able to compile the C++ code without building the python plugin faster iteration cycles.
Bazel comes with remote caching built in. There are plenty of cache backends that can be used; we deploy our caching on (GCS)[https://bazel.build/remote/caching#cloud-storage]. You can see the configuration in .bazelrc
, under config name remote_cache
.
Remote caching is disabled by default but because it speeds up incremental builds by a huge margin, it is almost always recommended, and it is enabled by default in the CI automation and on Cloud Build.
To authenticate on a machine, please ensure that you have the credentials present with gcloud auth application-default login --no-launch-browser
or equivalent.
Using the remote cache configured by remote_cache
configuration setup requires authentication with GCP.
There are various ways to authenticate with GCP. For individual developers who have access to the development GCP project, one only needs to
specify the --config=remote_cache
flag to bazel, and the default --google_default_credentials
will be used and if the
gcloud token is present on the machine, it will work out of the box, using the logged in user for authentication. The user
needs to have remote build permissions in GCP (add new developers into the Remote Bazel
role). In the CI, the service account key
is used for authentication and is passed to bazel using --config=remote_cache --google_credentials=path/to/service.key
.
On Cloud Build, docker build --network=cloudbuild
is used to pass the authentication from the service
account running the cloud build down into the docker image doing the compilation: Application Default Credentials does the work there and authenticates as the service account. All accounts, both user and service accounts, need to have remote cache read/write permissions.
Remote cache uses cache silos. Each unique machine and build should specify a unique silo key to benefit from consistent caching. The silo key can be passed using a flag: -remote_default_exec_properties=cache-silo-key=SOME_SILO_KEY'
.
Running the build with remote cache:
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel
Adding
GCLOUD_SERVICE_KEY_FILE=~/.config/gcloud/application_default_credentials.json
might help too if bazel
cannot find the auth token.
YOUR-USER
here can the author's username or machine name, a unique name that ensures good cache behavior. Other setup.py
functionality works as intended too (e.g. develop
).
The first time the code is compiled using a new cached key will be slow because it will compile everything from scratch, but incremental compilations will be very fast. On updating the TensorFlow pin, it will once again be a bit slower the first time per key, and then until the next update quite fast again.
Currently C++ code is built and tested by bazel. Python code will be migrated in the future.
Bazel is a test plafrom too, making it easy to run tests:
bazel test //test/cpp:main
Ofcourse the XLA and PJRT configuration have to be present in the environment to run the tests. Not all environmental variables are passed into the bazel test environment to make sure that the remote cache misses are not too common (environment
is part of the cache key), see .bazelrc
test configuration to see which ones are passed in, and add new ones as required.
You can run the tests using the helper script too:
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" ./test/cpp/run_tests.sh -R
The xla_client
tests are pure hermetic tests that can be easily executed. The torch_xla
plugin tests are more complex:
they require torch
and torch_xla
to be installed, and they cannot run in parallel, since they are using either
XRT server/client on the same port, or because they use a GPU or TPU device and there's only one available at the time.
For that reason, all tests under torch_xla/csrc/
are bundled into a single target :main
that runs them all sequentially.
When running tests, it can be useful to calculate code coverage.
bazel coverage //torch_xla/csrc/runtime/...
Coverage can be visualized using lcov
as described in Bazel's documentation, or in your editor of choice with lcov plugins, e.g. Coverage Gutters for VSCode.
Bazel can power a language server like clangd that brings code references, autocompletion and semantic understanding of the underlying code to your editor of choice. For VSCode, one can use Bazel Stack that can be combined with clangd functionality to bring powerful features to assist code editing.
As always, PyTorch/XLA can be built using Python distutils
:
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel