Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update range of gpu arch #23309

Merged
merged 4 commits into from
Jan 14, 2025
Merged

Update range of gpu arch #23309

merged 4 commits into from
Jan 14, 2025

Conversation

yf711
Copy link
Contributor

@yf711 yf711 commented Jan 9, 2025

Description

  • Remove deprecated gpu arch to control nuget/python package size (latest TRT supports sm75 Turing and newer arch)
  • Add 90 to support blackwell series in next release (86;89 not considered as adding them will rapidly increase package size)
arch_range Python-cuda12 Nuget-cuda12
60;61;70;75;80 Linux: 279MB Win: 267MB Linux: 247MB Win: 235MB
75;80 Linux: 174MB Win: 162MB Linux: 168MB Win: 156MB
75;80;90 Linux: 299MB Win: 277MB Linux: 294MB Win: 271MB
75;80;86;89 Linux: MB Win: 390MB Linux: 416MB Win: 383MB
75;80;86;89;90 Linux: MB Win: 505MB Linux: 541MB Win: 498MB

Motivation and Context

Callout: While adding sm90 support, the build of cuda11.8+cudnn8 will be dropped in the coming ORT release,
as the build has issue with blackwell (mentioned in comments) and demand on cuda 11 is minor, according to internal ort-cuda11 repo.

snnn
snnn previously approved these changes Jan 9, 2025
@tianleiwu
Copy link
Contributor

If we drop older arch, shall we also drop ort package for cuda 11.8 in next release?

@snnn
Copy link
Member

snnn commented Jan 9, 2025

If we drop older arch, shall we also drop ort package for cuda 11.8 in next release?

I highly recommend doing so. Now we only have two people working on build pipelines. We should focus more on the main targets.

@yf711 yf711 requested a review from jywu-msft January 10, 2025 00:32
@snnn
Copy link
Member

snnn commented Jan 10, 2025

/azp run Win_TRT_Minimal_CUDA_Test_CI

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

snnn
snnn previously approved these changes Jan 10, 2025
@yf711
Copy link
Contributor Author

yf711 commented Jan 11, 2025

After testing, adding sm90 to build arch list is causing issues to cuda 11.8+cudnn8 alt pkg build on windows,
which is likely because cudnn8 is deprecated by blackwell. cuda 12 pkg build is not affected.

To support sm90, we can choose to support cuda12 only, or we might need to update current cuda 11.8 env with cudnn9

@snnn
Copy link
Member

snnn commented Jan 11, 2025

CUDA 11.8 with cudnn9 doesn't work. I tried.

I hit the following compilation error when compiling cudnn_flash_attention.cu

/build/Release/_deps/cudnn_frontend-src/include/cudnn_frontend/graph_interface.h:519:27:   required from here
/build/Release/_deps/cudnn_frontend-src/include/cudnn_frontend/thirdparty/nlohmann/json.hpp:9132:68: error: static assertion failed: Missing/invalid function: bool boolean(bool)
 9132 |     static_assert(is_detected_exact<bool, boolean_function_t, SAX>::value,

Therefore, I suggest giving up on that.

@yf711 yf711 merged commit 5c3c764 into main Jan 14, 2025
130 of 133 checks passed
@yf711 yf711 deleted the yifanl/update_arch branch January 14, 2025 22:27
@tianleiwu
Copy link
Contributor

BTW, Blackwell GPUs have compute-capabilities 10.0 (B100 and B200) and 12.0 (B40 and RTX 5090 etc). To support RTX 5090 etc, we can add 120 to CMAKE_CUDA_ARCHITECTURES, and upgrade cuda to 12.8 for cuda EP.

@jywu-msft
Copy link
Member

jywu-msft commented Jan 24, 2025

BTW, Blackwell GPUs have compute-capabilities 10.0 (B100 and B200) and 12.0 (B40 and RTX 5090 etc). To support RTX 5090 etc, we can add 120 to CMAKE_CUDA_ARCHITECTURES, and upgrade cuda to 12.8 for cuda EP.

CUDA 12.8 would require a new driver right? not sure how easy that would be to update soon? @snnn ?
And if we update CUDA , we would need to update TensorRT as well for Blackwell support.

@gedoensmax
Copy link
Contributor

cuDNN (9.7), TensorRT (10.8) and CUDA (12.8) are the first versions to support Blackwell. All of these went live this week and are now publicly available. Driver requirements are the same as with any other CUDA 12 release due to minor version compatibility which is a feature since CUDA 11.

I would say targets older than Turing (75) can certainly be dropped, or we support this by e.g. shipping PTX for an old architecture, but not sure how long JIT compilation in the driver will take for all the ORT kernels. Besides that would it make sense to differentiate between windows and linux ? Are there known sm80 (A100) and sm90 (H100) customers on Windows ? Otherwise ORT could trim the Windows package to the consumer archs (75,86,89.120 and maybe PTX for 120 as forward compatibility).
More details on blackwell+CUDA are in the migration docs

@snnn
Copy link
Member

snnn commented Jan 24, 2025

@jywu-msft , how about we do it for Linux first? 80% of the onnxruntime-gpu downloads are from Linux.

I have a draft change that moves Linux GPU tests from A10 to A100(which uses official CUDA driver). https://github.com/microsoft/onnxruntime/tree/snnn/replace_pool . But it hit some tests failures. I just asked @tianleiwu for help.

Then, we may upgrade all our pipelines to CUDA 12.4(not 12.8 yet) . Our Windows machines use Nvidia driver 550, which should be good for CUDA 12.4. We also have a cudnn frontend issue that needs be addressed when upgrading cudnn: #23244 (comment)

Then, upgrade Visual Studio to the latest

Then upgrade cudnn frontend to the latest, which needs the latest Visual Studio.

Then, continue upgrade to CUDA 12.8 for Linux build.

@yf711
Copy link
Contributor Author

yf711 commented Jan 30, 2025

Hi @gedoensmax do you have any suggestion reducing the ort package size?

arch_range Python-cuda12 Nuget-cuda12
75;80 Linux: 174MB Win: 162MB Linux: 168MB Win: 156MB
75;80;90 Linux: 299MB Win: 277MB Linux: 294MB Win: 271MB
75;80;86;89 Linux: MB Win: 390MB Linux: 416MB Win: 383MB
75;80;86;89;90 Linux: MB Win: 505MB Linux: 541MB Win: 498MB

Comparing 1st & 2nd row, simply adding sm90 increases package size by 70%, and we are hitting the size limit set by nuget.io

@gedoensmax
Copy link
Contributor

@yf711 CUDA binary size is a general problem that is not easily solved. There is a reason that cuDNN now generates/compiles kernels dynamically. This allows for optimizations that you usually only get at compile time and in addition reduces binary size.

Training kernels are not included in these packages ? Did you conduct any analysis into which kernel takes how much space ? There is nothing that immediately comes to mind to easily reduce binary size. I can take a closer look tomorrow or early next week if needed.

@gedoensmax
Copy link
Contributor

Quick update after reading up a little more.
You can expect SASS code from 8.6 to run on 8.9. See this doc. That even goes for 80, it will run on 86 and 89 but since 86 and 89 have higher fp32 throughput you will not be able to utilize this functionality.
Due to that 75;80;90 will work on all architectures after Turing. Does 90 or another include PTX ? Without blackwell datacenter nor consumer should work.

yf711 added a commit that referenced this pull request Feb 21, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
<!-- Describe your changes. -->
* Remove deprecated gpu arch to control nuget/python package size
(latest TRT supports sm75 Turing and newer arch)
* Add 90 to support blackwell series in next release (86;89 not
considered as adding them will rapidly increase package size)

| arch_range | Python-cuda12 | Nuget-cuda12 |
| -------------- |
------------------------------------------------------------ |
---------------------------------- |
| 60;61;70;75;80 | Linux: 279MB Win: 267MB | Linux: 247MB Win: 235MB |
| 75;80 | Linux: 174MB Win: 162MB | Linux: 168MB Win: 156MB |
| **75;80;90** | **Linux: 299MB Win: 277MB** | **Linux: 294MB Win:
271MB** |
| 75;80;86;89 | [Linux: MB Win:
390MB](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=647457&view=results)
| Linux: 416MB Win: 383MB |
| 75;80;86;89;90 | [Linux: MB Win:
505MB](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=646536&view=results)
| Linux: 541MB Win: 498MB |

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Callout: While adding sm90 support, the build of cuda11.8+cudnn8 will be
dropped in the coming ORT release,
as the build has issue with blackwell (mentioned in comments) and demand
on cuda 11 is minor, according to internal ort-cuda11 repo.
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
<!-- Describe your changes. -->
* Remove deprecated gpu arch to control nuget/python package size
(latest TRT supports sm75 Turing and newer arch)
* Add 90 to support blackwell series in next release (86;89 not
considered as adding them will rapidly increase package size)

| arch_range | Python-cuda12 | Nuget-cuda12 |
| -------------- |
------------------------------------------------------------ |
---------------------------------- |
| 60;61;70;75;80 | Linux: 279MB Win: 267MB | Linux: 247MB Win: 235MB |
| 75;80 | Linux: 174MB Win: 162MB | Linux: 168MB Win: 156MB |
| **75;80;90** | **Linux: 299MB Win: 277MB** | **Linux: 294MB Win:
271MB** |
| 75;80;86;89 | [Linux: MB Win:
390MB](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=647457&view=results)
| Linux: 416MB Win: 383MB |
| 75;80;86;89;90 | [Linux: MB Win:
505MB](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=646536&view=results)
| Linux: 541MB Win: 498MB |

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Callout: While adding sm90 support, the build of cuda11.8+cudnn8 will be
dropped in the coming ORT release,
as the build has issue with blackwell (mentioned in comments) and demand
on cuda 11 is minor, according to internal ort-cuda11 repo.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants