Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel coverage fails when using certain Python packages #2575

Closed
BurnzZ opened this issue Jan 23, 2025 · 18 comments · Fixed by #2607
Closed

bazel coverage fails when using certain Python packages #2575

BurnzZ opened this issue Jan 23, 2025 · 18 comments · Fixed by #2607

Comments

@BurnzZ
Copy link
Contributor

BurnzZ commented Jan 23, 2025

🐞 bug report

Affected Rule

py_test

Is this a regression?

Not that I'm aware of.

Description

bazel coverage fails without any useful messages when some Python packages are being used/imported. Some examples:

  • torchvision
  • transformers.models.distilbert.DistilBertModel

🔬 Minimal Reproduction

See https://github.com/BurnzZ/bazel-python-coverage-issue for the full code example.

  1. Clone the repo:
git clone https://github.com/BurnzZ/bazel-python-coverage-issue
cd bazel-python-coverage-issue
  1. Open test.py and uncomment either of these lines:
# import torchvision
# from transformers.models.distilbert import DistilBertModel
  1. Run the following:
bazel coverage --combined_report=lcov :test --nocache_test_results --test_output=all
lcov --list "$(bazel info output_path)/_coverage/_coverage_report.dat"
  1. Notice that test coverage fails without useful context or error messages.

  2. Run the following and notice that the code should run successfully when tested:

bazel test :test --nocache_test_results --test_output=all

Try commenting out the imports above and run bazel coverage and it should work. Curious as to why it doesn't work in some package like torchvision or some parts of transformers.

🔥 Exception or Error

$ bazel coverage --combined_report=lcov :test --nocache_test_results --test_output=all

INFO: Using default value for --instrumentation_filter: "^//".
INFO: Override the above default with --instrumentation_filter
INFO: Analyzed target //:test (0 packages loaded, 0 targets configured).
FAIL: //:test (Exit 1) (see /private/var/tmp/_bazel_user/c97d0f59e3791eddf9709b879355cbf5/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/test/test.log)
INFO: From Testing //:test:
==================== Test output for //:test:
--
Coverage runner: Not collecting coverage for failed test.
The following commands failed with status 1
/private/var/tmp/_bazel_user/c97d0f59e3791eddf9709b879355cbf5/sandbox/darwin-sandbox/7/execroot/_main/bazel-out/darwin_arm64-fastbuild/bin/test.runfiles/_main/test
================================================================================
INFO: LCOV coverage report is located at /private/var/tmp/_bazel_user/c97d0f59e3791eddf9709b879355cbf5/execroot/_main/bazel-out/_coverage/_coverage_report.dat
 and execpath is bazel-out/_coverage/_coverage_report.dat
INFO: From Coverage report generation:
Jan. 23, 2025 4:55:51 PM com.google.devtools.coverageoutputgenerator.Main getTracefiles
INFO: Found 1 tracefiles.
Jan. 23, 2025 4:55:51 PM com.google.devtools.coverageoutputgenerator.Main parseFilesSequentially
INFO: Parsing file bazel-out/darwin_arm64-fastbuild/testlogs/test/coverage.dat
Jan. 23, 2025 4:55:51 PM com.google.devtools.coverageoutputgenerator.Main getGcovInfoFiles
INFO: No gcov info file found.
Jan. 23, 2025 4:55:51 PM com.google.devtools.coverageoutputgenerator.Main getGcovJsonInfoFiles
INFO: No gcov json file found.
Jan. 23, 2025 4:55:51 PM com.google.devtools.coverageoutputgenerator.Main getProfdataFileOrNull
INFO: No .profdata file found.
Jan. 23, 2025 4:55:51 PM com.google.devtools.coverageoutputgenerator.Main runWithArgs
WARNING: There was no coverage found.
INFO: Found 1 test target...
Target //:test up-to-date:
  bazel-bin/test
INFO: Elapsed time: 14.578s, Critical Path: 14.22s
INFO: 3 processes: 2 action cache hit, 3 darwin-sandbox.
INFO: Build completed, 1 test FAILED, 3 total actions
//:test                                                                  FAILED in 13.7s
  /private/var/tmp/_bazel_user/c97d0f59e3791eddf9709b879355cbf5/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/test/test.log

Executed 1 out of 1 test: 1 fails locally.
$ lcov --list "$(bazel info output_path)/_coverage/_coverage_report.dat"

lcov: ERROR: (empty) no valid records found in tracefile /private/var/tmp/_bazel_user/c97d0f59e3791eddf9709b879355cbf5/execroot/_main/bazel-out/_coverage/_coverage_report.dat
        (use "lcov --ignore-errors empty ..." to bypass this error)

🌍 Your Environment

Operating System:

MacBook Sequoia 15.2

Output of bazel version:

8.0.1

Also tried 7.4.1

Rules_python version:

1.1.0
@groodt
Copy link
Collaborator

groodt commented Jan 29, 2025

Thanks for the report.

I've spent a bit of time reproducing this and I am a bit stumped...

I'm able to reproduce the failure, but Im not finding any clues, even if I disable sandboxing and enable debugging.

Do you have any other ideas or have you noticed anything specific about the failures? I notice that these are more "complex" packages that may require a GPU. Is it all native code packages where this occurs? Or is it only these specific torch packages?

I imagine we don't have the same issue with pure python packages?

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 30, 2025

Hi @groodt , thanks for looking into this.

even if I disable sandboxing and enable debugging.

Yeah that's the tricky part. Not enough debugging messages in this part of the python_bootstrap_template.txt. I'll need to look into this part further, though I'm still familiarizing myself with Bazel's internals.

Do you have any other ideas or have you noticed anything specific about the failures? I notice that these are more "complex" packages that may require a GPU. Is it all native code packages where this occurs? Or is it only these specific torch packages?

From my latest investigation, it would seem that the issue boils down to these modules being imported (e.g. imported when something like transformers.models.distilbert.DistilBertModel is used):

from torchvision import utils
from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    RowwiseParallel,
)

But when looking at such files, no peculiar thing about them (apart from maybe some CUDA stuff). Also tried to create the module locally and importing them but to no avail:

I imagine we don't have the same issue with pure python packages?

Yes it would seem so.

It appears pytorch uses Bazel and even supports code coverage: https://pytorch.org/xla/master/contribute/bazel.html#code-coverage. However, it looks like it has some special handling for C++ code in its BUILD definitions.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 30, 2025

Probably related: pytorch/pytorch#112903

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

Can we try reproduce using some other native code. Perhaps pydantic?

rules_python does not work well with packages that assume a site-packages layout, and at $dayjob, we are carrying some patches for torch to work in rules_python. I'm wondering, but can't confirm, if this is related.

It would be good to know the following:

  • Issue does not exist with pure python deps
  • Issue does not exist with "simpler" native code deps (eg pydantic)
  • Issue only exists with torch and transformers or anything that transitively depends on nvidia deps (which require site-packages layout and/or patches)

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

Probably related: pytorch/pytorch#112903

No, I don't think it's related at all. Your reproduction in the repo is using torch as a PyPI dep, not as a bazel dep.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 30, 2025

Can confirm that these packages are all okay: pydantic, numpy, pandas, matplotlib.

Issue only exists with torch and transformers or anything that transitively depends on nvidia deps (which require site-packages layout and/or patches)

It would seem it has something to do with PyTorch's Dynamo. All other packages like importing torchvision or transformers.models.distilbert.DistilBertModel all point to importing this module during init.

For instance, if we have the following in test.py, then bazel coverage runs well:

import torch

def add(x, y):
    return x + y

add(torch.randn(10), torch.randn(10))

However, if we add @torch.compile that would invoke Dynamo behind the scenes, then bazel coverage errors out:

@torch.compile
def add(x, y):
    return x + y

Moreover, we can also force the error by simply having:

import torch._dynamo

Seems to be closer and closer to the issue. Will keep digging.

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

Very interesting!

Starts to smell a bit funky. Perhaps some weirdness between how coverage.py works using trace functions and the way that torch.compile works. See torch compiler deep dive and PEP 523 Frame Evaluation Api

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

If we can't figure it out. It does seem like it's possible to provide exclusions in coverage.py: https://coverage.readthedocs.io/en/7.6.10/excluding.html#

or using # pragma: no cover in first-party code...

In fact, why is it attempting coverage on third-party code...

I'm very surprised I can't find anything relevant in a google search. So Im guessing this problem doesn't exist when code is run outside bazel?

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 30, 2025

If we can't figure it out. It does seem like it's possible to provide exclusions in coverage.py: https://coverage.readthedocs.io/en/7.6.10/excluding.html#
or using # pragma: no cover in first-party code...

I've tried this but it seems if we try to turn off coverage.py for the affected imports, the error is still there:

import torchvision  # pragma: no cover

Though not sure if coverage.py is still being used here for some reason, will need to double check.

In fact, why is it attempting coverage on third-party code...

I think it has something to do with how coverage.py's behavior (ref):

When running your code, the coverage run command will by default measure all code, 
unless it is part of the Python standard library.

...

Modules named as sources may be imported twice, once by coverage.py to find their
location, then again by your own code or test suite. Usually this isn’t a problem, but
could cause trouble if a module has side-effects at import time.

I'm very surprised I can't find anything relevant in a google search. So Im guessing this problem doesn't exist when code is run outside bazel?

Yeah, it's quite Bazel specific 😆 except for PyTorch's own way of building itself in Bazel: https://github.com/pytorch/pytorch/blob/main/BUILD.bazel

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

Well, this does hint at it being a possible issue: Usually this isn’t a problem, but could cause trouble if a module has side-effects at import time.

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

I don't see anything relevant to coverage in the pytorch repo? Can you point out what you mean? There's nothing to indicate to me that they are running bazel coverage anywhere.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 30, 2025

Sorry, it seems they have their own way of grabbing the coverage: https://github.com/pytorch/pytorch/tree/58cc6693cb4a3f63af7d05ccdae08588752f7cf0/tools/code_coverage.

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

Yes, they seem to have a plugin. And then in the torch XLA repo, they seem to have a bazel coverage setup, but that will be for their first-party code I think. Not code pulled in as a third party dep? You could still try similar "instrumentation_filter" etc. Im not sure thats the issue. https://github.com/pytorch/xla/blob/93a2ba6be67c9d22e81a2026b6cb35c993ead705/.bazelrc#L131

https://github.com/pytorch/xla/blob/93a2ba6be67c9d22e81a2026b6cb35c993ead705/.bazelrc#L131

bazel coverage //torch_xla/csrc/runtime/...

The torch_xla repo seems to exclude //tests from coverage, and thats the only place I can find any usage of torch.compile. You could see if the tests fail in their repo too if you remove exclusions. Im not sure.

@groodt
Copy link
Collaborator

groodt commented Jan 30, 2025

We should also try to see if numba works. That's another package I know of that does JIT. I think it might not use the same mechanism, but would be good to check it.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Feb 1, 2025

So after a bit more digging and tracing the code, it would seem that it's an issue with coverage.py (I can reproduce it outside of Bazel; coverage==7.6.10).

There are 2 steps for Bazel to produce the lcov data coverage file for Python. They can be distilled into these commands:

  1. coverage run --append --branch <test files...> (in this code)
    • This produces a coverage.py-specific artifact: .coverage.
  2. coverage lcov -o pylcov.dat (in this code)
    • This reads the .coverage artefact and transforms it into an lcov file: pylcov.dat.

I've confirmed that Step 1 returns a status code of 0.

However, Step 2 returns a status code of 1 but with no clear error message. You need to pass --test_env=VERBOSE_COVERAGE=true to your bazel coverage run to see an error like:

No source for code: '/private/var/folders/lh/yz1yvyvx5nd2vqq6t9t3qbmw0000gn/T/tmpchdogtk0/_remote_module_non_scriptable.py'.

The cause of this issue is not clear, even coverage.py's author is not aware of it (reference). However, it was recommended to append the --ignore-errors flag. As such, adding this to Step 2 results in a successful run: coverage lcov -o pylcov.dat --ignore-errors.

I'll file another issue to coverage.py for this specific issue (see nedbat/coveragepy#1921).

In the meantime, I'm proposing an escape-hatch solution in: #2597

@groodt
Copy link
Collaborator

groodt commented Feb 2, 2025

It makes sense to me that coverage can't find the source for the JIT'ed code.

That reference to _remote_module_nonscriptable.py comes from torch jit. See pytorch/pytorch#81622

@rickeylev
Copy link
Collaborator

This sounds familiar to a problem I ran into when doing some prototyping with venvs. If you look at how we invoke coverage, there's a couple of paths that get ignored (/dev, among others).

https://github.com/bazelbuild/rules_python/blob/main/python/private/stage2_bootstrap_template.py#L304

I was tempted to enable ignoring errors, but since it was well know which paths needed to be ignored, I did that instead.

Maybe we should just have ignore errors enabled by default? This is at least the second, probably more like 3rd, time coverage has broken because of something unrelated, where ignoring the errors would have been fine

Adding tempfile.gettmpdir() to the omitted paths also makes sense IMHO

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Feb 3, 2025

I was tempted to enable ignoring errors, but since it was well know which paths needed to be ignored, I did that instead.

That's a better approach I think since it would prevent other possible issues that --ignore-errors might have.

I tried this alternative but doesn't seem to be working inside Bazel: #2599 (comment)

BurnzZ added a commit to BurnzZ/rules_python that referenced this issue Feb 3, 2025
github-merge-queue bot pushed a commit that referenced this issue Mar 11, 2025
…ests (#2607)

This ensures that un-executed files _(i.e. files that aren't tested)_
are included in the coverage report. The current behavior is that
coverage.py excludes them by default.

This PR configures source files via the auto-generated `.coveragerc`
file.

See https://coverage.readthedocs.io/en/7.6.10/source.html#execution:

> If the source option is specified, only code in those locations will
be measured. Specifying the source option also enables coverage.py to
report on un-executed files, since it can search the source tree for
files that haven’t been measured at all.

Closes #2599
Closes #2597
Fixes #2575

---------

Co-authored-by: Ignas Anikevicius <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants