Skip to content

Commit fae5d02

Browse files
add poetry support for tensorflow maskrcnn; update tensorflow resnet5… (#2679)
* add poetry support for tensorflow maskrcnn; update tensorflow resnet50v1_5 inference * fix: correct database file path in IP leak detection workflow * fix: update database entries for IP leak detection workflow * add smoke tests for Mask R-CNN and ResNet50 models * remove unused smoke.yaml files for Mask R-CNN and ResNet50 models * fix: update Dockerfiles to set up poetry environment and add to PATH * refactor: streamline Python dependency installation in Dockerfiles and setup.sh * refactor: consolidate Python dependency installation scripts across Dockerfiles and setup.sh files * docker tf flex maskrcnn (#2682) * update tf maskrcnn gpu * modify poetry files * update poetry * remove maskrcnn flex gpu workloads * remove compose references * refactor: remove unnecessary activation of poetry environment in dependency installation script * tf rn50 training gpu update (#2685) * update dockerfile * Update tf-max-series-resnet50v1-5-training.Dockerfile * TF RN50 Inference (#2686) * update dockerfile * update tf rn50 inference * update tests and doc * Update smoke.yaml * Update tests.yaml * Update tf-max-series-resnet50v1-5-training.Dockerfile * Update smoke.yaml * Update tests.yaml * Update docker-compose.yml * update smoke and tests * Revert "update smoke and tests" This reverts commit a06b59f24eb42a2c7455e067025aab59a2832420. * refine dockerfile maskrcnn training (#2684) * refine dockerfile * Update pyproject.toml * update poetry files and add smoke test * add pythonpath * update tests * add pybind * fix service name * remove rn50 inference * remove rn50 tf inf smoke file * update bert_large build --------- Co-authored-by: Srikanth Ramakrishna <[email protected]>
1 parent e9a2a31 commit fae5d02

30 files changed

+3860
-1327
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,7 @@ For best performance on Intel® Data Center GPU Max Series, please check the [li
139139
| [DLRM v2](https://arxiv.org/abs/1906.00091) | PyTorch | Training | Max Series | [FP32 TF32 BF16](/models_v2/pytorch/torchrec_dlrm/training/gpu/README.md)
140140
| [3D-Unet](https://arxiv.org/pdf/1606.06650.pdf) | PyTorch | Inference | Max Series | [FP16 INT8 FP32](/models_v2/pytorch/3d_unet/inference/gpu/README.md) |
141141
| [3D-Unet](https://arxiv.org/pdf/1606.06650.pdf) | TensorFlow | Training | Max Series | [BFloat16 FP32](/models_v2/tensorflow/3d_unet/training/gpu/README.md) |
142+
| [Stable Diffusion](https://arxiv.org/pdf/2112.10752.pdf) | PyTorch | Inference | Max Series, Arc Series | [FP16 FP32](/models_v2/pytorch/stable_diffusion/inference/gpu/README.md) |
142143
| [Mask R-CNN](https://arxiv.org/pdf/1703.06870.pdf) | TensorFlow | Training | Max Series | [FP32 BFloat16](/models_v2/tensorflow/maskrcnn/training/gpu/README.md) |
143144
| [RNN-T](https://arxiv.org/abs/1211.3711) | PyTorch | Inference | Max Series | [FP16 BF16 FP32](/models_v2/pytorch/rnnt/inference/gpu/README.md) |
144145
| [RNN-T](https://arxiv.org/abs/1211.3711) | PyTorch | Training | Max Series | [FP32 BF16 TF32](/models_v2/pytorch/rnnt/training/gpu/README.md) |

docker/tensorflow/bert_large/training/gpu/tf-max-series-bert-large-training.Dockerfile

+2-4
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,8 @@ WORKDIR /workspace/tf-max-series-bert-large-training/models
3636

3737
COPY models_v2/tensorflow/bert_large/training/gpu .
3838

39-
RUN git clone https://github.com/titipata/pubmed_parser && \
40-
pip install ./pubmed_parser
41-
42-
RUN python -m pip install --no-cache-dir -r requirements.txt
39+
COPY models_v2/common/install-python-dependencies.sh .
40+
RUN ./install-python-dependencies.sh
4341

4442
RUN git clone https://github.com/NVIDIA/DeepLearningExamples.git && \
4543
cd DeepLearningExamples/TensorFlow2/LanguageModeling/BERT && \

docker/tensorflow/docker-compose.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ services:
3434
image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-image-recognition-tf-max-gpu-resnet50v1-5-training
3535
cap_drop:
3636
- NET_RAW
37-
bert-large-training-gpu:
37+
bert_large-training-gpu:
3838
build:
3939
dockerfile: docker/tensorflow/bert_large/training/gpu/tf-max-series-bert-large-training.Dockerfile
4040
extends: resnet50v1_5-training-gpu
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
single-tile-bf16-training:
2+
img: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-image-segmentation-tf-max-gpu-maskrcnn-training
3+
ipc: host
4+
cmd: mpirun -np 1 -prepend-rank -ppn 1 bash run_model.sh
5+
device: ["/dev/dri"]
6+
env:
7+
PRECISION: bfloat16
8+
EPOCHS: '1'
9+
STEPS_PER_EPOCH: '20'
10+
BATCH_SIZE: '4'
11+
MULTI_TILE: 'False'
12+
OUTPUT_DIR: /tmp
13+
DATASET_DIR: /tf_dataset/dataset/coco_dataset/COCO2017_training_data/
14+
volumes:
15+
- src: /tf_dataset/dataset/coco_dataset/COCO2017_training_data/
16+
dst: /tf_dataset/dataset/coco_dataset/COCO2017_training_data/
17+
- src: /dev/dri/by-path
18+
dst: /dev/dri/by-path
19+
- src: /tmp
20+
dst: /tmp

docker/tensorflow/maskrcnn/training/gpu/tests.yaml

+4-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ single-tile-bf16-training:
22
img: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-image-segmentation-tf-max-gpu-maskrcnn-training
33
ipc: host
44
cmd: mpirun -np 1 -prepend-rank -ppn 1 bash run_model.sh
5-
device: /dev/dri
5+
device: ["/dev/dri"]
66
env:
77
PRECISION: bfloat16
88
EPOCHS: '1'
@@ -21,7 +21,7 @@ single-tile-bf16-training:
2121
multi-tile-bf16-training:
2222
img: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-image-segmentation-tf-max-gpu-maskrcnn-training
2323
cmd: bash run_model.sh
24-
device: /dev/dri
24+
device: ["/dev/dri"]
2525
env:
2626
PRECISION: bfloat16
2727
EPOCHS: '1'
@@ -41,7 +41,7 @@ single-tile-fp32-training:
4141
img: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-image-segmentation-tf-max-gpu-maskrcnn-training
4242
ipc: host
4343
cmd: mpirun -np 1 -prepend-rank -ppn 1 bash run_model.sh
44-
device: /dev/dri
44+
device: ["/dev/dri"]
4545
env:
4646
PRECISION: fp32
4747
EPOCHS: '1'
@@ -60,7 +60,7 @@ single-tile-fp32-training:
6060
multi-tile-fp32-training:
6161
img: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-image-segmentation-tf-max-gpu-maskrcnn-training
6262
cmd: bash run_model.sh
63-
device: /dev/dri
63+
device: ["/dev/dri"]
6464
env:
6565
PRECISION: fp32
6666
EPOCHS: '1'

docker/tensorflow/maskrcnn/training/gpu/tf-max-series-maskrcnn-training.Dockerfile

+4-4
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,9 @@ WORKDIR /workspace/tf-max-series-maskrcnn-inference-training/models
3939

4040
COPY models_v2/tensorflow/maskrcnn/training/gpu .
4141

42-
RUN python -m pip install opencv-python-headless \
43-
pybind11 \
44-
pycocotools \
45-
-e "git+https://github.com/NVIDIA/dllogger#egg=dllogger"
42+
COPY models_v2/common/install-python-dependencies.sh .
43+
44+
RUN ./install-python-dependencies.sh
4645

4746
RUN git clone https://github.com/NVIDIA/DeepLearningExamples.git && \
4847
cd DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN && \
@@ -54,6 +53,7 @@ ENV PATH=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/bin:/opt/intel/oneapi/m
5453
ENV CCL_ROOT=/opt/intel/oneapi/ccl/2021.13
5554
ENV I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.13
5655
ENV FI_PROVIDER_PATH=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/lib/prov:/usr/lib/x86_64-linux-gnu/libfabric
56+
ENV PYTHONPATH=/root/.cache/pypoetry/virtualenvs/tf-maskrcnn-trn-gpu-3khfSOyS-py3.10/lib/python3.10/site-packages:$PYTHONPATH
5757

5858
COPY LICENSE licenses/LICENSE
5959
COPY third_party licenses/third_party

docker/tensorflow/resnet50v1_5/training/gpu/tf-max-series-resnet50v1-5-training.Dockerfile

+3-4
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,12 @@ RUN apt-get update && \
3131
intel-oneapi-ccl=${CCL_VER} && \
3232
rm -rf /var/lib/apt/lists/*
3333

34-
RUN curl -sSL https://install.python-poetry.org | python3 -
3534
WORKDIR /workspace/tf-max-series-resnet50v1-5-training/models
3635

3736
COPY models_v2/tensorflow/resnet50v1_5/training/gpu .
3837

39-
RUN /root/.local/bin/poetry install
40-
41-
RUN /root/.local/bin/poetry add intel-optimization-for-horovod
38+
COPY models_v2/common/install-python-dependencies.sh .
39+
RUN ./install-python-dependencies.sh
4240

4341
RUN mkdir -p resnet50 && \
4442
cd resnet50 && \
@@ -57,6 +55,7 @@ ENV PATH=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/bin:/opt/intel/oneapi/m
5755
ENV CCL_ROOT=/opt/intel/oneapi/ccl/2021.13
5856
ENV I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.13
5957
ENV FI_PROVIDER_PATH=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/lib/prov:/usr/lib/x86_64-linux-gnu/libfabric
58+
ENV PYTHONPATH=/root/.cache/pypoetry/virtualenvs/models-v2-tensorflow-resnet50v1-5-training-WKRPeTfh-py3.10/lib/python3.10/site-packages:$PYTHONPATH
6059

6160
COPY LICENSE licenses/LICENSE
6261
COPY third_party licenses/third_party
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
3+
if [ -f "./pyproject.toml" ]; then
4+
# Download and run the Poetry installation script
5+
curl -sSL https://install.python-poetry.org | python3 -
6+
export PATH="~/.local/bin:$PATH"
7+
# Install the pypi dependencies using poetry
8+
poetry install
9+
else
10+
echo "No pypi dependencies defined with poetry."
11+
fi

0 commit comments

Comments
 (0)