From c7e3b7c11fb94131be9b48a8e3d510589addc3ce Mon Sep 17 00:00:00 2001
From: Haodong Duan <34324155+kennymckormick@users.noreply.github.com>
Date: Mon, 24 Aug 2020 22:08:47 +0800
Subject: [PATCH] Add models for OmniSource (#208)

---
 MODEL_ZOO.md | 42 ++++++++++++++++++++++++++----------------
 README.md    | 16 +++++++++-------
 2 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md
index 09ca4719..9458b83a 100644
--- a/MODEL_ZOO.md
+++ b/MODEL_ZOO.md
@@ -2,15 +2,19 @@
 
 ## Action Recognition
 
-For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including [VMZ](https://github.com/facebookresearch/VMZ) and [kinetics_i3d](https://github.com/deepmind/kinetics-i3d)), others are trained by ourselves. If you can not reproduce our testing results due to dataset unalignment, please submit a request at [get validation data](https://forms.gle/jmBiCDJButrLwpgc9).
+For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including [VMZ](https://github.com/facebookresearch/VMZ) and [kinetics_i3d](https://github.com/deepmind/kinetics-i3d)), others are trained by ourselves. 
+
+For data preprocessing, we find that resizing short-edges of videos to 256px is generally a better choice than resizing the video to fixed width and height 340x256, since the size ratios are kept. Most of our Kinetics-400 models are trained with videos which short-edges are resized to 256px. However, some legacy Kinetics-400 models are trained with videos with fixed width and height (340x256). We use the mark $^{340\times256}$ to indicate the model is legacy. 
+
+If you can not reproduce our testing results due to dataset unalignment, please submit a request at [get validation data](https://forms.gle/jmBiCDJButrLwpgc9).
 
 ### TSN
 
 #### Kinetics
 
 | Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |                                                              Download                                                                    |
-| :------: | :--------: | :---------: | :--------: | :------------------------------------: | :------------------------------------: | -------------------------------------- |
-|    RGB   |  ImageNet  | ResNet50 | 3seg  | 70.6  |  89.4  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn2d_kinetics400_rgb_r50_seg3_f1s1-b702e12f.pth)  |
+| :------: | :--------: | :---------: | :--------: | :------------------------------------: | :------------------------------------: | :------------------------------------: |
+|    RGB   |  ImageNet  | ResNet50 | 3seg  | 70.6  |  89.4  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn2d_kinetics400_rgb_r50_seg3_f1s1-b702e12f.pth)$^{340\times256}$ |
 
 
 #### UCF101
@@ -44,7 +48,7 @@ For action recognition, unless specified, models are trained on Kinetics-400. Th
 |  Modality  | Pretrained |   Backbone   | Input | Top-1 | Top-5 |                           Download                           |
 | :--------: | :--------: | :----------: | :---: | :---: | :---: | :----------------------------------------------------------: |
 |    RGB     |  ImageNet  | Inception-V1 | 64x1  | 71.1  | 89.3  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics400_se_rgb_inception_v1_seg1_f64s1_imagenet_deepmind-9b8e02b3.pth)* |
-|    RGB     |  ImageNet  |   ResNet50   | 32x2  | 72.9  | 90.8  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics_rgb_r50_c3d_inflated3x1x1_seg1_f32s2_f32s2-b93cc877.pth) |
+|    RGB     |  ImageNet  |   ResNet50   | 32x2  | 72.9  | 90.8  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics_rgb_r50_c3d_inflated3x1x1_seg1_f32s2_f32s2-b93cc877.pth)$^{340\times256}$ |
 |    Flow    |  ImageNet  | Inception-V1 | 64x1  | 63.4  | 84.9  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics_flow_inception_v1_seg1_f64s1_imagenet_deepmind-92059771.pth)* |
 | Two-Stream |  ImageNet  | Inception-V1 | 64x1  | 74.2  | 91.3  |                              /                               |
 
@@ -70,12 +74,12 @@ For action recognition, unless specified, models are trained on Kinetics-400. Th
 |   RGB    |  ImageNet  | ResNet50 | 4x16  | 75.9  | 92.3  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/slowfast_kinetics400_se_rgb_r50_4x16_finetune-4623cf03.pth) |
 
 ### R(2+1)D
-| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |                           Download                           |
-| :------: | :--------: | :------: | :---: | :---: | :---: | :----------------------------------------------------------: |
-|   RGB    |    None    | ResNet34 |  8x8  | 63.7  | 85.9  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_scratch-1f576444.pth) |
-|   RGB    |   IG-65M   | ResNet34 |  8x8  | 74.4  | 91.7  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_finetune-c3abbbfc.pth) |
-|   RGB    |    None    | ResNet34 | 32x2  | 71.8  | 90.4  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_scratch-97f56158.pth) |
-|   RGB    |   IG-65M   | ResNet34 | 32x2  | 80.3  | 94.7  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_finetune-9baa39ea.pth) |
+| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download                                                                                                                                            |
+|:--------:|:----------:|:--------:|:-----:|:-----:|:-----:|:---------------------------------------------------------------------------------------------------------------------------------------------------:|
+| RGB      | None       | ResNet34 | 8x8   | 63.7  | 85.9  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_scratch-1f576444.pth)   |
+| RGB      | IG-65M     | ResNet34 | 8x8   | 74.4  | 91.7  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_finetune-c3abbbfc.pth)  |
+| RGB      | None       | ResNet34 | 32x2  | 71.8  | 90.4  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_scratch-97f56158.pth)  |
+| RGB      | IG-65M     | ResNet34 | 32x2  | 80.3  | 94.7  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_finetune-9baa39ea.pth) |
 
 ### CSN
 | Modality | Pretrained | Backbone  | Input | Top-1 | Top-5 |                           Download                           |
@@ -83,16 +87,22 @@ For action recognition, unless specified, models are trained on Kinetics-400. Th
 |   RGB    |   IG-65M   | irCSN-152 | 32x2  | 82.6  | 95.7  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/ircsn_kinetics400_se_rgb_r152_f32s2_ig65m_fbai-9d6ed879.pth)* |
 |   RGB    |   IG-65M   | ipCSN-152 | 32x2  | 82.7  | 95.6  | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/ipcsn_kinetics400_se_rgb_r152_f32s2_ig65m_fbai-ef39b9e3.pth)* |
 
-&ast; Converted from [VMZ in Caffe2](https://github.com/facebookresearch/VMZ).
+### OmniSource
 
+| Modality | Pretrained | Backbone  | Input | Top-1 (Baseline / OmniSource ($\Delta$)) | Top-5 (Baseline / OmniSource ($\Delta$)) |                           Download                           |
+| :------: | :--------: | :-------: | :---: | :--------------------------------------: | :--------------------------------------: | :----------------------------------------------------------: |
+|   RGB    |  ImageNet  | ResNet50  | 3seg  |           70.6 / 73.6 (+ 3.0)            |           89.4 / 91.0 (+ 1.6)            | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn2d_kinetics400_rgb_r50_seg3_f1s1-b702e12f.pth)$^{340\times256}$ / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/tsn_OmniSource_kinetics400_se_rgb_r50_seg3_f1s1_imagenet-4066cb7e.pth)$^{340\times256}$ |
+|   RGB    |   IG-1B    | ResNet50  | 3seg  |           73.1 / 75.7 (+ 2.6)            |           90.4 / 91.9 (+ 1.5)            | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn_kinetics400_se_rgb_r50_seg3_f1s1_IG1B-d4bc58ba.pth) / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/tsn_OmniSource_kinetics400_se_rgb_r50_seg3_f1s1_IG1B-25fc136b.pth) |
+|   RGB    |  Scratch   | ResNet50  | 4x16  |           72.9 / 76.8 (+ 3.9)            |           90.9 / 92.5 (+ 1.6)            | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/slowonly_kinetics400_se_rgb_r50_seg1_4x16_scratch_epoch256-594abd88.pth) / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/slowonly_OmniSource_kinetics400_se_rgb_r50_seg1_4x16_scratch-71f7b8ee.pth) |
+|   RGB    |  Scratch   | ResNet101 |  8x8  |           76.5 / 80.4 (+ 3.9)            |           92.7 / 94.4 (+ 1.7)            | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/slowonly_kinetics400_se_rgb_r101_8x8_scratch-8de47237.pth) / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/slowonly_OmniSource_kinetics400_se_rgb_r101_seg1_8x8_scratch-2f838cb0.pth) |
 
 ### Transfer Learning
 
-| Model | Modality  | Pretrained | Backbone | Input | UCF101 | HMDB51 |                      Download (split1)                       |
-| ----- | :-------: | :--------: | :------: | :---: | :----: | :----: | :----------------------------------------------------------: |
-| I3D   |    RGB    |  Kinetics  |   I3D    | 64x1  |  94.8  |  72.6  | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_rgb_f64s1_kinetics400ft-36201298.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_rgb_f64s1_kinetics400ft-1ffcf11f.pth) |
-| I3D   |   Flow    |  Kinetics  |   I3D    | 64x1  |  96.6  |  79.2  | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_flow_f64s1_kinetics400ft-93ed9ecd.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_flow_f64s1_kinetics400ft-2981c797.pth) |
-| I3D   | TwoStream |  Kinetics  |   I3D    | 64x1  |  97.8  |  80.8  |                              /                               |
+| Model | Modality  | Pretrained | Backbone | Input | UCF101 | HMDB51 | Download (split1)                                                                                                                                                                                                                                                                         |
+|-------|:---------:|:----------:|:--------:|:-----:|:------:|:------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| I3D   | RGB       | Kinetics   | I3D      | 64x1  | 94.8   | 72.6   | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_rgb_f64s1_kinetics400ft-36201298.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_rgb_f64s1_kinetics400ft-1ffcf11f.pth)   |
+| I3D   | Flow      | Kinetics   | I3D      | 64x1  | 96.6   | 79.2   | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_flow_f64s1_kinetics400ft-93ed9ecd.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_flow_f64s1_kinetics400ft-2981c797.pth) |
+| I3D   | TwoStream | Kinetics   | I3D      | 64x1  | 97.8   | 80.8   | /                                                                                                                                                                                                                                                                                         |
 
 ## Action Detection
 
diff --git a/README.md b/README.md
index 541ad424..ec2782b8 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,7 @@ It is a part of the [open-mmlab](https://github.com/open-mmlab) project develope
 
   - action recognition from trimmed videos
   - temporal action detection (also known as action localization) in untrimmed videos
-  - spatial-temporal action detection in untrimmed videos. 
+  - spatial-temporal action detection in untrimmed videos.
 
 
 - Support for various datasets
@@ -22,27 +22,29 @@ It is a part of the [open-mmlab](https://github.com/open-mmlab) project develope
   MMAction implements popular frameworks for action understanding:
 
   - For action recognition, various algorithms are implemented, including TSN, I3D, SlowFast, R(2+1)D, CSN.
-  - For temporal action detection, we implement SSN. 
+  - For temporal action detection, we implement SSN.
   - For spatial temporal atomic action detection, a Fast-RCNN baseline is provided.
 
 - Modular design
 
   The tasks in human action understanding share some common aspects such as backbones, and long-term and short-term sampling schemes.
-  Also, tasks can benefit from each other. For example, a better backbone for action recognition will bring performance gain for action detection. 
+  Also, tasks can benefit from each other. For example, a better backbone for action recognition will bring performance gain for action detection.
   Modular design enables us to view action understanding in a more integrated perspective.
 
 ## License
 The project is release under the [Apache 2.0 license](https://github.com/open-mmlab/mmaction/blob/master/LICENSE).
 
 ## Updates
-
-v0.1.0 (19/06/2019)
-- MMAction is online!
+[OmniSource](https://arxiv.org/abs/2003.13042) Model Release (22/08/2020)
+- We release several models of our work [OmniSource](https://arxiv.org/abs/2003.13042). These models are jointly trained with
+Kinetics-400 and OmniSourced web dataset. Those models are of good performance (Top1 Accuracy: **75.7%** for 3-segment TSN and **80.4%** for SlowOnly on Kinetics-400 val) and the learned representation transfer well to other tasks.
 
 v0.2.0 (15/03/2020)
-
 - We build a diversified modelzoo for action recognition, which include popular algorithms (TSN, I3D, SlowFast, R(2+1)D, CSN). The performance is aligned with or better than the original papers.
 
+v0.1.0 (19/06/2019)
+- MMAction is online!
+
 ## Model zoo
 Results and reference models are available in the [model zoo](https://github.com/open-mmlab/mmaction/blob/master/MODEL_ZOO.md).