PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery
Jiaojiao Li, Penghao Tian, Rui song, Yunsong Li, Haitao Xu and Qian Du.
Updates | Introduction | Results & Models | Usage | Citation Details | Acknowledge
This branch contains the official pytorch implementation for PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery [TGRS'24].
The codes of the PCViT has been released. The weights and logs will be uploaded soon.
This repository contains codes, models and test results for the paper "PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery".
Fig. 1: **The structure of the baseline of the proposed PCViT.** The proposed backbone constitutes a multiscale pyramid with three scale stages. The initial two stages consist of convolutional blocks, and the final stage consists of transformer blocks. Here, we refine the transformer block using the PCM and LGKA module. Then, The multiscale features derived from the backbone are then fed into the subsequent FRPN neck to facilitate contextual information interaction before being directed to the detection head. Fig. 2: **The pipeline of the proposed MPP.** During pretraining, K masked perspectives of each image are randomly sampled in a mini-batch with MPM. Then, they will be fed to the encoder and the decoder for invisible reconstruction with targets. Fig. 3: **Local/Global k-NN Attention.** In each group of transformer subblocks, we use local attention for the first two layers, that is, reduce computational complexity through 16x16 window attention. For propagation between windows, we use global attention in the third layer.The models are trained on 4 x 3090 machines with 2 images per gpu, which makes a batch size of 32 during training.
Pretrain | Backbone | Input size | Params (M) | Pretrained model |
---|---|---|---|---|
MPP | PCViT | 224 × 224 | 112 | Weights; 百度云 |
The models are trained on 2 x 3090 machines with 2 images per gpu, which makes a batch size of 1 during training.
Model | Pretrain | Machine | FrameWork | Box mAP@50 | config | log | weight |
---|---|---|---|---|---|---|---|
PCViT | MPP | GPU | Faster RCNN | 80.25 | config | log | Weights; 百度云 |
- Python 3.8.5
- Pytorch 1.9.0+cu111
- torchvision 0.10.0+cu111
- timm 0.4.12
- mmcv-full 1.3.9
-
Preparing the MillionAID: Download the MillionAID. It is easy for users to record image names and revise corresponding codes
prtrain
. -
To pretrain PCViT with multi-node distributed training, run the following on 1 node with 4 GPUs each (only mask 75% is supported): (batchsize: 128=4*32)
python -m torch.distributed.launch --nproc_per_node 4 main_pretrain.py \
--batch_size 32 --model fastconvmae_convvitae_base_patch16 \
--norm_pix_loss --mask_ratio 0.75 --epochs 100 \
--warmup_epochs 20 --blr 6.0e-4 --weight_decay 0.05
Note: Padding the convolutional kernel of PCM in the pretrained PCViT with convertK1toK3.py
for finetuning.
We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/andytianph/TGRS_PCViT.git
cd PCViT/finetune
pip install -v -e .
After install the two repos, install timm and einops, i.e.,
pip install timm==0.4.9 einops
Download the pretrained models from MAE, ViTAE or PCViT, and then conduct the experiments by
# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH>
# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch
If you find this code helpful, please kindly cite:
@ARTICLE{10417056,
author={Li, Jiaojiao and Tian, Penghao and Song, Rui and Xu, Haitao and Li, Yunsong and Du, Qian},
journal={IEEE Transactions on Geoscience and Remote Sensing},
title={PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery},
year={2024},
volume={62},
number={},
pages={1-15},
keywords={Transformers;Feature extraction;Task analysis;Object detection;Detectors;Nickel;Semantics;Convolutional neural network (CNN);feature pyramid network (FPN);multiscale object detection;remote-sensing images (RSIs);vision transformer (ViT)},
doi={10.1109/TGRS.2024.3360456}}
We acknowledge the excellent implementation from mmdetection, MAE, Remote-Sensing-RVSA