From c7e3b7c11fb94131be9b48a8e3d510589addc3ce Mon Sep 17 00:00:00 2001 From: Haodong Duan <34324155+kennymckormick@users.noreply.github.com> Date: Mon, 24 Aug 2020 22:08:47 +0800 Subject: [PATCH] Add models for OmniSource (#208) --- MODEL_ZOO.md | 42 ++++++++++++++++++++++++++---------------- README.md | 16 +++++++++------- 2 files changed, 35 insertions(+), 23 deletions(-) diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md index 09ca4719..9458b83a 100644 --- a/MODEL_ZOO.md +++ b/MODEL_ZOO.md @@ -2,15 +2,19 @@ ## Action Recognition -For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including [VMZ](https://github.com/facebookresearch/VMZ) and [kinetics_i3d](https://github.com/deepmind/kinetics-i3d)), others are trained by ourselves. If you can not reproduce our testing results due to dataset unalignment, please submit a request at [get validation data](https://forms.gle/jmBiCDJButrLwpgc9). +For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including [VMZ](https://github.com/facebookresearch/VMZ) and [kinetics_i3d](https://github.com/deepmind/kinetics-i3d)), others are trained by ourselves. + +For data preprocessing, we find that resizing short-edges of videos to 256px is generally a better choice than resizing the video to fixed width and height 340x256, since the size ratios are kept. Most of our Kinetics-400 models are trained with videos which short-edges are resized to 256px. However, some legacy Kinetics-400 models are trained with videos with fixed width and height (340x256). We use the mark $^{340\times256}$ to indicate the model is legacy. + +If you can not reproduce our testing results due to dataset unalignment, please submit a request at [get validation data](https://forms.gle/jmBiCDJButrLwpgc9). ### TSN #### Kinetics | Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download | -| :------: | :--------: | :---------: | :--------: | :------------------------------------: | :------------------------------------: | -------------------------------------- | -| RGB | ImageNet | ResNet50 | 3seg | 70.6 | 89.4 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn2d_kinetics400_rgb_r50_seg3_f1s1-b702e12f.pth) | +| :------: | :--------: | :---------: | :--------: | :------------------------------------: | :------------------------------------: | :------------------------------------: | +| RGB | ImageNet | ResNet50 | 3seg | 70.6 | 89.4 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn2d_kinetics400_rgb_r50_seg3_f1s1-b702e12f.pth)$^{340\times256}$ | #### UCF101 @@ -44,7 +48,7 @@ For action recognition, unless specified, models are trained on Kinetics-400. Th | Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download | | :--------: | :--------: | :----------: | :---: | :---: | :---: | :----------------------------------------------------------: | | RGB | ImageNet | Inception-V1 | 64x1 | 71.1 | 89.3 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics400_se_rgb_inception_v1_seg1_f64s1_imagenet_deepmind-9b8e02b3.pth)* | -| RGB | ImageNet | ResNet50 | 32x2 | 72.9 | 90.8 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics_rgb_r50_c3d_inflated3x1x1_seg1_f32s2_f32s2-b93cc877.pth) | +| RGB | ImageNet | ResNet50 | 32x2 | 72.9 | 90.8 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics_rgb_r50_c3d_inflated3x1x1_seg1_f32s2_f32s2-b93cc877.pth)$^{340\times256}$ | | Flow | ImageNet | Inception-V1 | 64x1 | 63.4 | 84.9 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/i3d_kinetics_flow_inception_v1_seg1_f64s1_imagenet_deepmind-92059771.pth)* | | Two-Stream | ImageNet | Inception-V1 | 64x1 | 74.2 | 91.3 | / | @@ -70,12 +74,12 @@ For action recognition, unless specified, models are trained on Kinetics-400. Th | RGB | ImageNet | ResNet50 | 4x16 | 75.9 | 92.3 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/slowfast_kinetics400_se_rgb_r50_4x16_finetune-4623cf03.pth) | ### R(2+1)D -| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download | -| :------: | :--------: | :------: | :---: | :---: | :---: | :----------------------------------------------------------: | -| RGB | None | ResNet34 | 8x8 | 63.7 | 85.9 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_scratch-1f576444.pth) | -| RGB | IG-65M | ResNet34 | 8x8 | 74.4 | 91.7 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_finetune-c3abbbfc.pth) | -| RGB | None | ResNet34 | 32x2 | 71.8 | 90.4 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_scratch-97f56158.pth) | -| RGB | IG-65M | ResNet34 | 32x2 | 80.3 | 94.7 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_finetune-9baa39ea.pth) | +| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download | +|:--------:|:----------:|:--------:|:-----:|:-----:|:-----:|:---------------------------------------------------------------------------------------------------------------------------------------------------:| +| RGB | None | ResNet34 | 8x8 | 63.7 | 85.9 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_scratch-1f576444.pth) | +| RGB | IG-65M | ResNet34 | 8x8 | 74.4 | 91.7 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f8s8_finetune-c3abbbfc.pth) | +| RGB | None | ResNet34 | 32x2 | 71.8 | 90.4 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_scratch-97f56158.pth) | +| RGB | IG-65M | ResNet34 | 32x2 | 80.3 | 94.7 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/r2plus1d_kinetics400_se_rgb_r34_f32s2_finetune-9baa39ea.pth) | ### CSN | Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download | @@ -83,16 +87,22 @@ For action recognition, unless specified, models are trained on Kinetics-400. Th | RGB | IG-65M | irCSN-152 | 32x2 | 82.6 | 95.7 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/ircsn_kinetics400_se_rgb_r152_f32s2_ig65m_fbai-9d6ed879.pth)* | | RGB | IG-65M | ipCSN-152 | 32x2 | 82.7 | 95.6 | [model](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/ipcsn_kinetics400_se_rgb_r152_f32s2_ig65m_fbai-ef39b9e3.pth)* | -* Converted from [VMZ in Caffe2](https://github.com/facebookresearch/VMZ). +### OmniSource +| Modality | Pretrained | Backbone | Input | Top-1 (Baseline / OmniSource ($\Delta$)) | Top-5 (Baseline / OmniSource ($\Delta$)) | Download | +| :------: | :--------: | :-------: | :---: | :--------------------------------------: | :--------------------------------------: | :----------------------------------------------------------: | +| RGB | ImageNet | ResNet50 | 3seg | 70.6 / 73.6 (+ 3.0) | 89.4 / 91.0 (+ 1.6) | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn2d_kinetics400_rgb_r50_seg3_f1s1-b702e12f.pth)$^{340\times256}$ / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/tsn_OmniSource_kinetics400_se_rgb_r50_seg3_f1s1_imagenet-4066cb7e.pth)$^{340\times256}$ | +| RGB | IG-1B | ResNet50 | 3seg | 73.1 / 75.7 (+ 2.6) | 90.4 / 91.9 (+ 1.5) | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/tsn_kinetics400_se_rgb_r50_seg3_f1s1_IG1B-d4bc58ba.pth) / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/tsn_OmniSource_kinetics400_se_rgb_r50_seg3_f1s1_IG1B-25fc136b.pth) | +| RGB | Scratch | ResNet50 | 4x16 | 72.9 / 76.8 (+ 3.9) | 90.9 / 92.5 (+ 1.6) | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/slowonly_kinetics400_se_rgb_r50_seg1_4x16_scratch_epoch256-594abd88.pth) / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/slowonly_OmniSource_kinetics400_se_rgb_r50_seg1_4x16_scratch-71f7b8ee.pth) | +| RGB | Scratch | ResNet101 | 8x8 | 76.5 / 80.4 (+ 3.9) | 92.7 / 94.4 (+ 1.7) | [Baseline](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/slowonly_kinetics400_se_rgb_r101_8x8_scratch-8de47237.pth) / [OmniSource](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/kinetics400/omnisource/slowonly_OmniSource_kinetics400_se_rgb_r101_seg1_8x8_scratch-2f838cb0.pth) | ### Transfer Learning -| Model | Modality | Pretrained | Backbone | Input | UCF101 | HMDB51 | Download (split1) | -| ----- | :-------: | :--------: | :------: | :---: | :----: | :----: | :----------------------------------------------------------: | -| I3D | RGB | Kinetics | I3D | 64x1 | 94.8 | 72.6 | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_rgb_f64s1_kinetics400ft-36201298.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_rgb_f64s1_kinetics400ft-1ffcf11f.pth) | -| I3D | Flow | Kinetics | I3D | 64x1 | 96.6 | 79.2 | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_flow_f64s1_kinetics400ft-93ed9ecd.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_flow_f64s1_kinetics400ft-2981c797.pth) | -| I3D | TwoStream | Kinetics | I3D | 64x1 | 97.8 | 80.8 | / | +| Model | Modality | Pretrained | Backbone | Input | UCF101 | HMDB51 | Download (split1) | +|-------|:---------:|:----------:|:--------:|:-----:|:------:|:------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| I3D | RGB | Kinetics | I3D | 64x1 | 94.8 | 72.6 | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_rgb_f64s1_kinetics400ft-36201298.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_rgb_f64s1_kinetics400ft-1ffcf11f.pth) | +| I3D | Flow | Kinetics | I3D | 64x1 | 96.6 | 79.2 | [UCF101](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_ucf101_split1_flow_f64s1_kinetics400ft-93ed9ecd.pth) / [HMDB51](https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/models/ucf101/i3d_hmdb51_split1_flow_f64s1_kinetics400ft-2981c797.pth) | +| I3D | TwoStream | Kinetics | I3D | 64x1 | 97.8 | 80.8 | / | ## Action Detection diff --git a/README.md b/README.md index 541ad424..ec2782b8 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ It is a part of the [open-mmlab](https://github.com/open-mmlab) project develope - action recognition from trimmed videos - temporal action detection (also known as action localization) in untrimmed videos - - spatial-temporal action detection in untrimmed videos. + - spatial-temporal action detection in untrimmed videos. - Support for various datasets @@ -22,27 +22,29 @@ It is a part of the [open-mmlab](https://github.com/open-mmlab) project develope MMAction implements popular frameworks for action understanding: - For action recognition, various algorithms are implemented, including TSN, I3D, SlowFast, R(2+1)D, CSN. - - For temporal action detection, we implement SSN. + - For temporal action detection, we implement SSN. - For spatial temporal atomic action detection, a Fast-RCNN baseline is provided. - Modular design The tasks in human action understanding share some common aspects such as backbones, and long-term and short-term sampling schemes. - Also, tasks can benefit from each other. For example, a better backbone for action recognition will bring performance gain for action detection. + Also, tasks can benefit from each other. For example, a better backbone for action recognition will bring performance gain for action detection. Modular design enables us to view action understanding in a more integrated perspective. ## License The project is release under the [Apache 2.0 license](https://github.com/open-mmlab/mmaction/blob/master/LICENSE). ## Updates - -v0.1.0 (19/06/2019) -- MMAction is online! +[OmniSource](https://arxiv.org/abs/2003.13042) Model Release (22/08/2020) +- We release several models of our work [OmniSource](https://arxiv.org/abs/2003.13042). These models are jointly trained with +Kinetics-400 and OmniSourced web dataset. Those models are of good performance (Top1 Accuracy: **75.7%** for 3-segment TSN and **80.4%** for SlowOnly on Kinetics-400 val) and the learned representation transfer well to other tasks. v0.2.0 (15/03/2020) - - We build a diversified modelzoo for action recognition, which include popular algorithms (TSN, I3D, SlowFast, R(2+1)D, CSN). The performance is aligned with or better than the original papers. +v0.1.0 (19/06/2019) +- MMAction is online! + ## Model zoo Results and reference models are available in the [model zoo](https://github.com/open-mmlab/mmaction/blob/master/MODEL_ZOO.md).