Skip to content

Conversation

@anasashb
Copy link

@anasashb anasashb commented Nov 30, 2025

Reference Issues/PRs

Fixes #1990.

What does this implement/fix? Explain your changes.

This Pull Request adds a use_efficient_attention boolean argument to the TimeXer model (v1 and v2 versions, both), which, if set to =True, switches to a more memory-efficient and faster attention implementation using torch.nn.functional_scaled_dot_product_attention() instead of the torch.einsum() solution inside the FullAttention class (v1 and v2 versions both).

The newly introduced argument is currently set to False to keep the new feature completely backwards compatible.

Additionally, there's a very minor bugfix in the PositionalEmbedding class (v1 and v2 versions both), where a bug carried over from tslib used to define:

pe.require_grad = False

In torch, the correct attribute for whether a tensor requires grad is called .requires_grad. This bug has also been fixed.

What should a reviewer concentrate their feedback on?

Reviewers should focus on the implementation of _einsum_attention() and _efficient attention() which are new private methods that def forward() of the FullAttention class calls to handle attention implementation.

I did not make any other changes to the code, but if it works for you I could also:

  • Remove some unused args such as tau, delta, factor scattered across the tslib code carried over here
  • Add type annotations to the PyTorch code from tslib

Or if you'd be OK with those changes too, I can also open a separate PR.

Did you add any tests for the change?

Yes, in both: tests/test_models/test_timxer.py and tests/test_models/test_timexer_v2.py. These include new assertions in initialization tests, as well as parameterization of the use_efficient_attention for integration tests.

Any other comments?

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

@anasashb
Copy link
Author

anasashb commented Nov 30, 2025

Some more notes:

  1. The CI/CD pipe seems to partially fail but it looks like all of the errors are due to weight loading code for Temporal Fusion Transformer, not related to the feature in this PR
  2. As I said in the PR description:

I did not make any other changes to the code, but if it works for you I could also:

  • Remove some unused args such as tau, delta, factor scattered across the tslib code carried over here
  • Add type annotations to the PyTorch code from tslib

If your team's ok with me addressing these, I could address these in this PR or open a new one, whatever works for your team's code review process

p.s. if anyone wants to benchmark the speed and memory consumption of the old and new attention backends, I can give you a script for that. I was not exactly sure if script such as that made sense to be included as part of the package itself

@phoeenniixx
Copy link
Member

The CI/CD pipe seems to partially fail but it looks like all of the errors are due to weight loading code for Temporal Fusion Transformer, not related to the feature in this PR

Yes the issue is known although we still doesn't know the exact source (see #1998, and the discussion from the discord thread here)

If your team's ok with me addressing these, I could address these in this PR or open a new one, whatever works for your team's code review process

I'd prefer a new PR (stacked on this PR, or maybe after this PR is merged) to keep the "responsibilities" separate for both the PRs.
FYI @PranavBhatP @fkiraly @agobbifbk

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 30, 2025

if anyone wants to benchmark the speed and memory consumption of the old and new attention backends, I can give you a script for that. I was not exactly sure if script such as that made sense to be included as part of the package itself

hm, that feels extremely useful! Could you put that into utils, in a separate PR? We could use that for performance monitoring in the CI.

@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 89.74359% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@aae13ba). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...h_forecasting/layers/_attention/_full_attention.py 88.88% 2 Missing ⚠️
pytorch_forecasting/models/timexer/sub_modules.py 89.47% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1997   +/-   ##
=======================================
  Coverage        ?   86.99%           
=======================================
  Files           ?      160           
  Lines           ?     9494           
  Branches        ?        0           
=======================================
  Hits            ?     8259           
  Misses          ?     1235           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.99% <89.74%> (?)
pytest 86.99% <89.74%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@phoeenniixx phoeenniixx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR @anasashb ! This is really great
I have added some comments:

  • related to the docstrings - I think this is from the older part of the code that's why it is still not using numpydoc style. I would really appreciate if you could use numpydoc style docstrings here. We still need to update the docstrings from the whole codebase :)
  • Can you also add some fixtures in _timexer_pkg and _timexer_pkg_v2. We are moving from the standalone tests to a unified test framework and now only just adding test fixtures and some configs would work to test the whole models. You can see some examples in the fixtures already present in the above files and you can also look at any other model, all models have this pkg class now which is used to test these models. All you need to do is just update the get_base_test_params (in case of v1_timexer_pkg) and get_test_train_params (in case of v2 _timexer_pkg_v2) to have the fixtures to test the new attention mechanism.

attention_dropout (float): Dropout rate for attention scores.
output_attention (bool): Whether to output attention weights."""
output_attention (bool): Whether to output attention weights.
efficient_attention (bool): Whether to use torch's native efficient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be use_efficient_attention here?
Also please explain what "efficient attention" means here and how is it different from einsum_attention.

Also please use numpydoc style docstrings. I think this part is from the older part of the code, that's why it's still not updated. Updating the style to numpydoc style would be greatly appreciated!

attention_dropout (float): Dropout rate for attention scores.
output_attention (bool): Whether to output attention weights."""
output_attention (bool): Whether to output attention weights.
efficient_attention (bool): Whether to use torch's native efficient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Sub-Optimal FullAttention Implementations Taken from TSLib

4 participants