LinPoly
diff --git a/‎PyTorch/LanguageModeling/BERT/.dockerignore
+1 b/‎PyTorch/LanguageModeling/BERT/.dockerignore
+1
diff --git a/‎PyTorch/LanguageModeling/BERT/README.md
+8-7 b/‎PyTorch/LanguageModeling/BERT/README.md
+8-7
diff --git a/‎PyTorch/LanguageModeling/BERT/distillation/README.md
+13-11 b/‎PyTorch/LanguageModeling/BERT/distillation/README.md
+13-11
diff --git a/‎PyTorch/LanguageModeling/BERT/triton/Dockerfile
+1-12 b/‎PyTorch/LanguageModeling/BERT/triton/Dockerfile
+1-12
diff --git a/‎PyTorch/LanguageModeling/BERT/triton/README.md
+3-6 b/‎PyTorch/LanguageModeling/BERT/triton/README.md
+3-6
@@ -27,3 +27,4 @@ results/
 dask-worker-space/
 __pycache__
 distillation/__pycache__
+runner_workspace
@@ -239,17 +239,18 @@ Find all trained and available checkpoints in the table below:
 | Model                  | Description                                                               |
 |------------------------|---------------------------------------------------------------------------|
 | [bert-large-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_large_qa_squad11_amp/files) | Large model fine-tuned on SQuAD v1.1                                       |
-| [bert-large-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_large_ft_sst2_amp) |Large model fine-tuned on GLUE SST-2                                      |
+| [bert-large-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_large_ft_sst2_amp) |Large model fine-tuned on GLUE SST-2                                      |
 | [bert-large-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_large_pretraining_amp_lamb/files?version=20.03.0) | Large model pretrained checkpoint on Generic corpora like Wikipedia|
 | [bert-base-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_base_qa_squad11_amp/files) | Base model fine-tuned on SQuAD v1.1                                         |
 | [bert-base-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_base_ft_sst2_amp_128/files) | Base model fine-tuned on GLUE SST-2                                       |
 | [bert-base-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_base_pretraining_amp_lamb/files) | Base model pretrained checkpoint on Generic corpora like Wikipedia. |
-| [bert-dist-4L-288D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distilled_4l_288d_qa_squad11_amp/files) | 4 layer distilled model fine-tuned on SQuAD v1.1                                         |
-| [bert-dist-4L-288D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distilled_4l_288d_ft_sst2_amp/files) | 4 layer distilled model fine-tuned on GLUE SST-2                                       |
-| [bert-dist-4L-288D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distilled_4l_288d_pretraining_amp/files) | 4 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
-| [bert-dist-6L-768D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distill_6l_768d_3072di_12h_squad/files) | 6 layer distilled model fine-tuned on SQuAD v1.1                                         |
-| [bert-dist-6L-768D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distill_6l_768d_3072di_12h_sst2/files) | 6 layer distilled model fine-tuned on GLUE SST-2                                       |
-| [bert-dist-6L-768D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distill_6l_768d_3072di_12h_p2/files) | 6 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
+| [bert-dist-4L-288D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_4l_288d_qa_squad11_amp/files) | 4 layer distilled model fine-tuned on SQuAD v1.1                                         |
+| [bert-dist-4L-288D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_4l_288d_ft_sst2_amp/files) | 4 layer distilled model fine-tuned on GLUE SST-2                                       |
+| [bert-dist-4L-288D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_4l_288d_pretraining_amp/files) | 4 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
+| [bert-dist-6L-768D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_6l_768d_qa_squad11_amp/files) | 6 layer distilled model fine-tuned on SQuAD v1.1                                         |
+| [bert-dist-6L-768D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_6l_768d_ft_sst2_amp/files) | 6 layer distilled model fine-tuned on GLUE SST-2                                       |
+| [bert-dist-6L-768D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_6l_768d_pretraining_amp/files) | 6 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
+
 
 
 
 
@@ -20,8 +20,8 @@ bash run_e2e_distillation.sh
 `run_e2e_distillation.sh` contains 8 command lines to obtain fully distilled BERT models for SQuADv1.1 and SST-2. The distilled BERT model has a config (N=4, D=312, Di=1200 , H=12). To distill knowledge into models of different sizes, a new `BERT_4L_312D/config.json` can be created and passed as a starting point in `run_e2e_distillation.sh`
 
 `run_e2e_distillation.sh` contains the following:
-- Generic distillation on Wikipedia and BooksCorpus dataset(BooksCorpus is optional) of maximum sequence length 128. `--input_dir` needs to be update respectively.
-- Generic distillation on Wikipedia and BooksCorpus dataset(BooksCorpus is optional) of maximum sequence length 512. `--input_dir` needs to be update respectively.
+- Phase1 distillation: Generic distillation on Wikipedia dataset of maximum sequence length 128. `--input_dir` needs to be update respectively.
+- Phase2 distillation: Generic distillation on Wikipedia dataset of maximum sequence length 512. `--input_dir` needs to be update respectively.
 
 *Task specific distillation: SQuAD v1.1* (maximum sequence length 384)
 - Data augmentation
@@ -35,25 +35,27 @@ bash run_e2e_distillation.sh
 
 ![BERT Distillation Flow](https://developer.nvidia.com/sites/default/files/akamai/joc_model.png)
 
-Note: Distillation for SST-2 uses as output of step 1. as starting point in 7, whereas SQuaD v1.1 uses output of step 2. as a starting point in 4.
+Note: Task specific distillation for SST-2 uses as output checkpoint of phase1 distillation as starting point, whereas task specific distillation of SQuAD v1.1 uses output checkpoint of phase2 distillation as a starting point.
 
 One can download different general and task-specific distilled checkpoints from NGC:
 | Model                  | Description                                                               |
 |------------------------|---------------------------------------------------------------------------|
-| [bert-dist-4L-288D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distilled_4l_288d_qa_squad11_amp/files) | 4 layer distilled model fine-tuned on SQuAD v1.1                                         |
-| [bert-dist-4L-288D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distilled_4l_288d_ft_sst2_amp/files) | 4 layer distilled model fine-tuned on GLUE SST-2                                       |
-| [bert-dist-4L-288D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distilled_4l_288d_pretraining_amp/files) | 4 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
-| [bert-dist-6L-768D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distill_6l_768d_3072di_12h_squad/files) | 6 layer distilled model fine-tuned on SQuAD v1.1                                         |
-| [bert-dist-6L-768D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distill_6l_768d_3072di_12h_sst2/files) | 6 layer distilled model fine-tuned on GLUE SST-2                                       |
-| [bert-dist-6L-768D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/models/bert_pyt_ckpt_distill_6l_768d_3072di_12h_p2/files) | 6 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
+| [bert-dist-4L-288D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_4l_288d_qa_squad11_amp/files) | 4 layer distilled model fine-tuned on SQuAD v1.1                                         |
+| [bert-dist-4L-288D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_4l_288d_ft_sst2_amp/files) | 4 layer distilled model fine-tuned on GLUE SST-2                                       |
+| [bert-dist-4L-288D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_4l_288d_pretraining_amp/files) | 4 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
+| [bert-dist-6L-768D-uncased-qa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_6l_768d_qa_squad11_amp/files) | 6 layer distilled model fine-tuned on SQuAD v1.1                                         |
+| [bert-dist-6L-768D-uncased-sst2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_6l_768d_ft_sst2_amp/files) | 6 layer distilled model fine-tuned on GLUE SST-2                                       |
+| [bert-dist-6L-768D-uncased-pretrained](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_pyt_ckpt_distilled_6l_768d_pretraining_amp/files) | 6 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
 
 
+Following results were obtained on NVIDIA DGX-1 with 32G on pytorch:20.12-py3 NGC container.
+
 *Accuracy achieved and E2E time to train on NVIDIA DGX-1 With 32G:*
 
 | Student         | Task             | SubTask          | Time(hrs)  | Total Time (hrs)| Accuracy | BERT Base Accuracy  |
 | --------------- |:----------------:| :---------------:| :--------: | :-------------: | :------: | ------------------: |
-| 4 Layers; H=288 | Distil Phase 1   |                  | 1.399      |                 |          |                     |
-| 4 Layers; H=288 | Distil Phase 2   |                  | 0.649      |                 |          |                     |
+| 4 Layers; H=288 | Distil Phase 1   | backbone loss    | 1.399      |                 |          |                     |
+| 4 Layers; H=288 | Distil Phase 2   | backbone loss    | 0.649      |                 |          |                     |
 | 4 Layers; H=288 | Distil SST-2     | backbone loss    | 1.615      |                 |          |                     |
 | 4 Layers; H=288 | Distil SST-2     | final layer loss | 0.469      | 3.483           | 90.82    | 91.51               |
 | 4 Layers; H=288 | Distil SQuADv1.1 | backbone loss    | 3.471      |                 |          |                     |
 
@@ -12,8 +12,6 @@
 # limitations under the License.
 
 ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.10-py3
-ARG TRITON_CLIENT_IMAGE_NAME=nvcr.io/nvidia/tritonserver:21.10-py3-sdk
-FROM ${TRITON_CLIENT_IMAGE_NAME} as triton-client
 FROM ${FROM_IMAGE_NAME}
 
 # Ensure apt-get won't prompt for selecting options
@@ -22,7 +20,7 @@ ENV DCGM_VERSION=2.0.13
 
 # Install perf_client required library
 RUN apt-get update && \
-    apt-get install -y libb64-dev libb64-0d curl pbzip2 pv bzip2 cabextract && \
+    apt-get install -y libb64-dev libb64-0d curl pbzip2 pv bzip2 cabextract wget libb64-dev libb64-0d && \
     curl -s -L -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/datacenter-gpu-manager_${DCGM_VERSION}_amd64.deb && \
     dpkg -i datacenter-gpu-manager_${DCGM_VERSION}_amd64.deb && \
     rm datacenter-gpu-manager_${DCGM_VERSION}_amd64.deb && \
@@ -35,20 +33,11 @@ WORKDIR /workspace
 RUN git clone https://github.com/attardi/wikiextractor.git && cd wikiextractor && git checkout 6408a430fc504a38b04d37ce5e7fc740191dee16 && cd ..
 RUN git clone https://github.com/soskek/bookcorpus.git
 
-# Install Perf Client required library
-RUN apt-get update && apt-get install -y libb64-dev libb64-0d
-
-# Install Triton Client Python API and copy Perf Client
-COPY --from=triton-client /workspace/install/ /workspace/install/
-RUN find /workspace/install/python/ -iname triton*manylinux*.whl -exec pip install {}[all] \;
-
 # Setup environment variables to access Triton Client binaries and libs
 ENV PATH /workspace/install/bin:${PATH}
 ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
 ENV PYTHONPATH /workspace/bert
 
-RUN apt-get install -y iputils-ping
-
 WORKDIR /workspace/bert
 ADD requirements.txt /workspace/bert/requirements.txt
 ADD triton/requirements.txt /workspace/bert/triton/requirements.txt
 
@@ -62,14 +62,14 @@ After deployment, the Triton inference server is used for evaluation of the conv
   and online (dynamic batching) scenarios.
 
 
-All steps are executed by the provided  runner script. Refer to [Quick Start Guide](#quick-start-guide)
+All steps are executed by the provided runner script. Refer to [Quick Start Guide](#quick-start-guide)
 
 
 ## Setup
 Ensure you have the following components:
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch NGC container 21.07](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
-* [Triton Inference Server NGC container 21.07](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
+* [PyTorch NGC container 21.10](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
+* [Triton Inference Server NGC container 21.10](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
 * [NVIDIA CUDA](https://docs.nvidia.com/cuda/archive//index.html)
 * [NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
 
@@ -93,6 +93,3 @@ and [HPC](https://developer.nvidia.com/hpc-application-performance) benchmarks.
 ### Known issues
 
 - There are no known issues with this model.
- 
-
-