Skip to content

This project is the official implementation of the paper "Hierarchical Chat-Based Strategies with MLLMs For Spatio-Temporal Action Detection".

License

Notifications You must be signed in to change notification settings

TristanAlkaid/HCBS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Chat-Based Strategies with MLLMs for Spatio-Temporal Action Detection

Paper Dataset LICENSE

This project is the official implementation of the paper "Hierarchical Chat-Based Strategies with MLLMs For Spatio-Temporal Action Detection". [Paper][Free access before April 08, 2025]

🏈 Key Features

  • Hierarchical Chat-Based Strategy (HCBS)

    A progressive dialogue protocol guiding Multimodal Large Language Models(MLLMs) to generate structured action descriptions:

    • Entity Localization- Identify potential actors

    • Trajectory Prediction- Analyze motion patterns

    • Micro-action Parsing- Capture subtle motion details

  • Football Description Dataset

    Contains 712 football match clip descriptions focusing on:

    • 94.47% small-target actions

    • 40% multi-participator overlapping scenarios

🛠️ Setup

Code Integration

  1. Clone base framework:
git clone https://github.com/MCG-NJU/MOC-Detector.git
  1. Overwrite core modules:
cp -r Project/src/ MOC-Detector/src/

🚀 Quick Start

Please follow the instruction of https://github.com/MCG-NJU/MOC-Detector.

📊 Benchmark Results

Performance on Multisports dataset:

Method [email protected] (%) [email protected] (%)
ROAD (ICCV, 2017) 3.90 0.00
SlowFast (ICCV, 2019) 0.00 0.00
MOC-Detector (ECCV, 2023) 6.40 0.00
MOC+ConvFormer (MICCAI, 2023) 5.10 0.00
MOC+DilateFormer (TMM, 2023) 5.97 0.04
Ours with VideoLLaMA2 7.21 0.00
Ours with LLaVA-NeXT 7.46 0.06
Ours with LongVA 7.30 0.02
Ours with LLaVA 8.23 0.11

Performance on J-HMDB dataset:

Method [email protected] (%) [email protected] (%)
ROAD (ICCV, 2017) 71.10 72.00
MOC-Detector (ECCV, 2020) 81.06 77.20
Tad-TR (TIP, 2022) 68.70 78.90
HIT (WACV, 2023) 88.10 83.80
MOC+ConvFormer (MICCAI, 2023) 79.65 67.01
MOC+DilateFormer (TMM, 2023) 10.25 5.85
Ours with VideoLLaMA2 98.91 100.00
Ours with LLaVA-NeXT 99.32 100.00
Ours with LongVA 98.82 100.00
Ours with LLaVA 98.44 100.00

Performance on UCF101-24 dataset;

Method [email protected] (%) [email protected] (%)
ROAD (ICCV, 2017) 43.30 46.30
MOC-Detector (ECCV, 2020) 98.49 53.80
YOWOv2 (ARXIV,2023) 87.00 52.80
HIT (WACV, 2023) 84.80 74.30
MOC+ConvFormer (MICCAI, 2023) 97.00 88.49
MOC+DilateFormer (TMM, 2023) 92.16 86.85
Ours with VideoLLaMA2 98.39 91.69
Ours with LLaVA-NeXT 97.94 89.36
Ours with LongVA 97.82 90.80
Ours with LLaVA 98.88 92.20

Performance using Prompts Pooling:

MLLMs [email protected] (%) [email protected] (%) [email protected] (%) [email protected] (%) [email protected] (%) [email protected] (%)
LLaVA-NeXT 99.99 99.58 73.38 100.00 100.00 86.79
LLaVA 99.56 98.02 61.69 100.00 100.00 79.79
VideoLLaMA2 99.58 98.98 60.42 100.00 100.00 75.49
LongVA 99.72 97.66 64.52 100.00 100.00 67.01

📜 Citation

If you use this work, please cite:

@article{HCBS,
    title = {Hierarchical chat-based strategies with MLLMs for Spatio-temporal action detection},
    journal = {Information Processing & Management},
    author = {Xuyang Zhou and Ye Wang and Fei Tao and Hong Yu and Qun Liu},
    year = {2025},
    volume = {62},
    number = {4},
    pages = {104094},
    issn = {0306-4573},
    doi = {https://doi.org/10.1016/j.ipm.2025.104094}
}

License

This project is released under the MIT License. See LICENSE for details.

About

This project is the official implementation of the paper "Hierarchical Chat-Based Strategies with MLLMs For Spatio-Temporal Action Detection".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published