This code is a re-implementation of the video classification experiments in the paper Non-local Neural Networks. The code is developed based on the Mindspore framework.
A non-local operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information. Table 1 shows our C2D baseline under a ResNet-50 backbone.In this repositories, we use the Inflated 3D ConvNet(I3D) under a ResNet-50 backbone. One can turn the C2D model in Table 1into a 3D convolutional counterpart by “inflating” the kernels. For example, a 2D k×k kernel can be inflated as a 3D t×k×k kernel that spans t frames. And we add 5 blocks (3 to res4 and 2 to res3, to every other residual block). For more information, please read the paper.Dataset used: Kinetics400
-
Description: Kinetics-400 is a commonly used dataset for benchmarks in the video field. For details, please refer to its official website Kinetics. For the download method, please refer to the official address ActivityNet, and use the download script provided by it to download the dataset.
-
Dataset size:
category Number of data Training set 238797 Validation set 19877
Because of the expirations of some YouTube links, the sizes of kinetics dataset copies may be different.
Dataset used in the paper Non-local Neural Networks:
Kinetics contains ∼246k training videos and 20k validation videos. It is a classification task involving 400 human action categories. They train all models on the training set and test on the validation set.
The directory structure of Kinetic-400 dataset looks like:
.
|-kinetic-400
|-- train
| |-- ___qijXy2f0_000011_000021.mp4 // video file
| |-- ___dTOdxzXY_000022_000032.mp4 // video file
| ...
|-- test
| |-- __Zh0xijkrw_000042_000052.mp4 // video file
| |-- __zVSUyXzd8_000070_000080.mp4 // video file
|-- val
| |-- __wsytoYy3Q_000055_000065.mp4 // video file
| |-- __vzEs2wzdQ_000026_000036.mp4 // video file
| ...
|-- kinetics-400_train.csv // training dataset label file.
|-- kinetics-400_test.csv // testing dataset label file.
|-- kinetics-400_val.csv // validation dataset label file.
...
To run the python scripts in the repository, you need to prepare the environment as follow:
- Python and dependencies
- python 3.7.5
- decord 0.6.0
- mindspore-gpu 1.6.1
- ml-collections 0.1.1
- numpy 1.21.5
- Pillow 9.0.1
- PyYAML 6.0
- Hardware
- Prepare hardware environment with GPU(Nvidia).
- Framework
- For more information, please check the resources below:
Some packages in requirements.txt
need Cython package to be installed first. For this reason, you should use the following commands to install dependencies:
pip install -r requirements.txt
Nonlocal model uses kinetics400 dataset to train and validate in this repository.
Our non-local model which migrated from the pretrain model for pytorch i3d_nl_dot_product_r50 is finetuned on the Kinetics400 dataset for 1 epochs. It can be downloaded here: [nonlocal_kinetics400_mindspore.ckpt]
To train or finetune the model, you can run the following script:
cd scripts/
# run training example
bash train_standalone.sh [PROJECT_PATH] [DATA_PATH]
# run distributed training example
bash train_distribute.sh [PROJECT_PATH] [DATA_PATH]
To validate the model, you can run the following script:
cd scripts/
# run evaluation example
bash eval_standalone.sh [PROJECT_PATH] [DATA_PATH]
.
│ eval.py // eval script
│ README.md // descriptions about Nonlocal
│ train.py // training script
└─scripts
| eval_standalone.sh //eval standalone script
| train_distribute.sh //train distribute script
| train_standalone.sh //train standalone script
└─src
├─config
│ nonlocal.yaml // Nonlocal parameter configuration
├─data
│ │ builder.py // build data
│ │ download.py // download dataset
│ │ generator.py // generate video dataset
│ │ images.py // process image
│ │ kinetics400.py // kinetics400 dataset
│ │ meta.py // public API for dataset
│ │ path.py // IO path
│ │ video_dataset.py // video dataset
│ │
│ └─transforms
│ builder.py // build transforms
│ video_center_crop.py // center crop
│ video_normalize.py // normalize
│ video_random_crop.py // random crop
│ video_random_horizontal_flip.py // random horizontal flip
│ video_reorder.py // reorder
│ video_rescale.py // rescale
│ video_short_edge_resize.py // short edge resize
│
├─example
│ nonlocal_kinetics400_eval.py // eval nonlocal model
│ nonlocal_kinetics400_train.py // train nonlocal model
│
├─loss
│ builder.py // build loss
│
├─models
│ │ builder.py // build model
│ │ nonlocal3d.py // nonlocal model
│ │
│ └─layers
│ adaptiveavgpool3d.py // adaptive average pooling 3D.
│ dropout_dense.py // dense head
│ inflate_conv3d.py // inflate conv3d block
| maxpool3d.py // 3D max pooling
| maxpool3dwithpad.py // 3D max pooling with padding operation
│ resnet3d.py // resnet backbone
│ unit3d.py // unit3d module
│
├─optim
│ builder.py // build optimizer
│
├─schedule
│ builder.py // build learning rate shcedule
│ lr_schedule.py // learning rate shcedule
│
└─utils
callbacks.py // eval loss monitor
check_param.py // check parameters
class_factory.py // class register
config.py // parameter configuration
six_padding.py // convert padding list into tuple
Parameters for both training and evaluation can be set in nonlocal.yaml
- config for Nonlocal, Kinetics400 dataset
# ==============================================================================
# model architecture
model_name: "nonlocal"
# The dataset sink mode.
dataset_sink_mode: False
# Context settings.
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "GPU"
# model settings of every parts
model:
type: nonlocal3d
in_d: 32
in_h: 224
in_w: 224
num_classes: 400
keep_prob: 0.5
# learning rate for training process
learning_rate:
lr_scheduler: "cosine_annealing"
lr: 0.0003
lr_epochs: [2, 4]
lr_gamma: 0.1
eta_min: 0.0
t_max: 100
max_epoch: 5
warmup_epochs: 1
# optimizer for training process
optimizer:
type: 'SGD'
momentum: 0.9
weight_decay: 0.0001
loss:
type: SoftmaxCrossEntropyWithLogits
sparse: True
reduction: "mean"
train:
pre_trained: True
pretrained_model: "./ms_nonlocal_dot_kinetics400_finetune.ckpt"
ckpt_path: "./output/"
epochs: 5
save_checkpoint_epochs: 5
save_checkpoint_steps: 4975
keep_checkpoint_max: 10
eval:
pretrained_model: "./nonlocal-1_4975.ckpt"
infer:
pretrained_model: "./nonlocal-1_4975.ckpt"
batch_size: 1
image_path: ""
normalize: True
output_dir: "./infer_output"
# Kinetic400 dataset config
data_loader:
train:
dataset:
type: Kinetic400
path: "/data/kinetics-dataset"
split: 'train'
seq: 32
seq_mode: 'interval'
num_parallel_workers: 1
shuffle: True
batch_size: 6
frame_interval: 6
map:
operations:
- type: VideoShortEdgeResize
size: 256
interpolation: 'bicubic'
- type: VideoRandomCrop
size: [224, 224]
- type: VideoRandomHorizontalFlip
prob: 0.5
- type: VideoRescale
- type: VideoReOrder
order: [3, 0, 1, 2]
- type: VideoNormalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.255]
input_columns: ["video"]
eval:
dataset:
type: Kinetic400
path: "/data/kinetics-dataset"
split: 'val'
seq: 32
seq_mode: 'interval'
num_parallel_workers: 1
shuffle: False
batch_size: 1
frame_interval: 6
map:
operations:
- type: VideoShortEdgeResize
size: 256
interpolation: 'bicubic'
- type: VideoCenterCrop
size: [256, 256]
- type: VideoRescale
- type: VideoReOrder
order: [3, 0, 1, 2]
- type: VideoNormalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.255]
input_columns: ["video"]
group_size: 1
# ==============================================================================
- train_distributed.log for Kinetics400
epoch: 1 step: 4975, loss is 0.44932037591934204
epoch: 1 step: 4975, loss is 0.3773573338985443
epoch: 1 step: 4975, loss is 0.19342052936553955
epoch: 1 step: 4975, loss is 0.5734817385673523
epoch: 1 step: 4975, loss is 0.09291025996208191
epoch: 1 step: 4975, loss is 0.5412027835845947
epoch: 1 step: 4975, loss is 0.08211661130189896
epoch: 1 step: 4975, loss is 0.9573349356651306
epoch time: 18000 s, per step time: 2064 ms
epoch time: 18000 s, per step time: 2063 ms
epoch time: 18000 s, per step time: 2064 ms
epoch time: 18000 s, per step time: 2064 ms
epoch time: 18001 s, per step time: 2065 ms
epoch time: 18001 s, per step time: 2065 ms
epoch time: 18001 s, per step time: 2065 ms
epoch time: 18002 s, per step time: 2066 ms
...
- eval.log for Kinetics400
[Start eval `nonlocal`]
eval: 1/19877
eval: 2/19877
eval: 3/19877
eval: 4/19877
eval: 5/19877
eval: 6/19877
eval: 7/19877
eval: 8/19877
eval: 9/19877
eval: 10/19877
...
eval: 19874/19877
eval: 19875/19877
eval: 19876/19877
eval: 19877/19877
{'Top_1_Accuracy': 0.7248, 'Top_5_Accuracy': 0.9072}
Kinetics400 contains ∼246k training videos and 20k validation videos. It is a classification task involving 400 human action categories. We train the model on the training set and test on the validation set. Under the same setting conditions, we compared the accuracy of the models under the three frameworks.
type | input frames | non-local? | top1 | top5 | model |
---|---|---|---|---|---|
i3d_nlnet_origin_caffe | 32 | Yes | 74.90 | 91.60 | link |
i3d_nlnet_pytorch | 32 | Yes | 73.92 | 91.59 | link |
i3d_nlnet_mindspore | 32 | Yes | 72.48 | 90.72 | link |
Here is the accuracy of the model from source paper.
We have done some visualization of the classification results of the model. The following is a visual sample.
@article{NonLocal2018,
author = {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
title = {Non-local Neural Networks},
year = {2018},
journal = {CVPR},
doi = {10.1109/CVPR.2018.00813},
}
@misc{2020mmaction2,
title={OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark},
author={MMAction2 Contributors},
howpublished = {\url{https://github.com/open-mmlab/mmaction2}},
year={2020}
}