Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Invalidate trace cache warning #6985

Closed
leachim opened this issue Jan 30, 2025 · 1 comment · Fixed by #7039
Closed

[BUG] Invalidate trace cache warning #6985

leachim opened this issue Jan 30, 2025 · 1 comment · Fixed by #7039
Assignees
Labels
bug Something isn't working training

Comments

@leachim
Copy link

leachim commented Jan 30, 2025

Describe the bug
During training I receive the following warning multiple times per epoch:
Invalidate trace cache @ step 1 and module 1: cache has only 1 modules

This is a bit of an odd message, I have done some digging, and these trace warnings seem to be a common issue people report here, but none like this one. Is there any input from the developers about what could be at the root of it? I spent quite some time figuring it out, but couldn't rely figure it out. Any help would be appreciated!

Also it would be really nice if there is an option to suppress this warning, in case it's not relevant. The way it's coded this is quite difficult.

ds_report output

[2025-01-30 13:15:38,204] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /cluster/home/michaes/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
 [WARNING]  gds requires the dev libaio .so object and headers but these were not found.
 [WARNING]  gds: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
 [WARNING]  using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/cluster/home/michaes/.miniforge/envs/hyperion/lib/python3.11/site-packages/torch']
torch version .................... 2.5.0+cu121
deepspeed install path ........... ['/cluster/home/michaes/.miniforge/envs/hyperion/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.16.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 250.73 GB
@leachim leachim added bug Something isn't working training labels Jan 30, 2025
@xuu416
Copy link

xuu416 commented Feb 8, 2025

Same

@tjruwase tjruwase self-assigned this Feb 13, 2025
github-merge-queue bot pushed a commit that referenced this issue Feb 18, 2025
Make trace cache warnings configurable, and disabled by default. 

Fix #6985, #4081, #5033, #5006, #5662

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Yejing-Lai pushed a commit to Yejing-Lai/DeepSpeed that referenced this issue Feb 24, 2025
Make trace cache warnings configurable, and disabled by default. 

Fix deepspeedai#6985, deepspeedai#4081, deepspeedai#5033, deepspeedai#5006, deepspeedai#5662

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
gyou2021 pushed a commit to gyou2021/DeepSpeed that referenced this issue Feb 28, 2025
Make trace cache warnings configurable, and disabled by default.

Fix deepspeedai#6985, deepspeedai#4081, deepspeedai#5033, deepspeedai#5006, deepspeedai#5662

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
tohtana pushed a commit that referenced this issue Feb 28, 2025
Make trace cache warnings configurable, and disabled by default.

Fix #6985, #4081, #5033, #5006, #5662

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
ys950902 pushed a commit to ys950902/DeepSpeed that referenced this issue Mar 6, 2025
Make trace cache warnings configurable, and disabled by default.

Fix deepspeedai#6985, deepspeedai#4081, deepspeedai#5033, deepspeedai#5006, deepspeedai#5662

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: yisheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants