Skip to content

pej0918/Robust-AV-Classification

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 

Repository files navigation

Uncertain Missing Modality Audio-Visual Classification Framework

πŸ“£ Accepted at Summer Annual Conference of IEIE, 2025.

License Python Pytorch

πŸš€ Overview

This repository contains an Audio-Visual Classification Framework designed to handle Uncertain Missing Modality scenarios where missing modalities are unpredictable at test time. Our approach integrates Prompt Learning at both the Input Level and Attention Level, allowing the model to dynamically adapt to missing or noisy modalities.

πŸ”₯ Key Contributions

  • βœ… End-to-End Framework for Uncertain Missing Modality
    • Designed to handle unpredictable modality loss by training across multiple missing modality scenarios.
  • 🎯 Prompt Learning for Robustness
    • Introduces Input-Level and Attention-Level Prompts to reinforce missing modality adaptation.
  • πŸ’‘ Efficient Training
    • Reduces memory usage by 82.3% and training time by 96%, making it highly scalable.
  • πŸ“ˆ Performance Improvement
    • Outperforms Fine-Tuning in noisy and missing modality environments by up to 10%.

βš™οΈ Framework Overview

image

The framework addresses Uncertain Missing Modality scenarios through a robust integration of learnable prompt tokens at both the input and attention levels. This design allows the model to adaptively handle incomplete or noisy data across modalities while maintaining computational efficiency.

1️⃣ Input-Level Prompt Integration

image

At the input stage, learnable prompt tokens are concatenated directly with the input features of each modality (audio and visual). This mechanism embeds prior knowledge about modality-specific patterns (e.g., noise or missing data) into the input representation.

  • Key Benefits:
    • Prompts encode modality-specific signals like noise patterns or missing data indicators.
    • Each modality's encoder processes enriched inputs with context about the data's state.

2️⃣ Attention-Level Prompt Integration

image

Prompts from the Input Level stage are used as Key and Value inputs in the Cross-Attention mechanism during the fusion phase. This enables enhanced interaction between audio and visual modalities by leveraging learnable tokens and modality-specific embeddings.

  • Key Benefits:
    • Prompts ensure missing or noisy modality information is supplemented by the complementary modality.
    • Facilitate robust feature alignment, strengthening shared representations even when one modality is compromised.
  • Mechanism Highlights:
    • Query: Originates from one modality's embeddings (e.g., audio for Audio-to-Visual attention).
    • Key & Value: Combines corresponding modality embeddings and learnable prompt tokens.

3️⃣ Fusion Module

image

The Fusion Module introduces Cross-Attention layers to balance contributions from both modalities. This module resolves imbalances caused by varying sequence lengths and noise levels in audio and visual data.

  • Key Benefits:
    • Aligns features from different modalities for mutual reinforcement.
    • Handles discrepancies in sequence lengths (e.g., longer audio vs. shorter visual sequences).
  • Structure:
    • Visual-to-Audio Attention: Visual embeddings query audio embeddings and associated prompts.
    • Audio-to-Visual Attention: Audio embeddings query visual embeddings and their prompts.

4️⃣ Prompt Token Integration (Input + Attention Combination)

image

This unified approach combines strengths from both Input-Level and Attention-Level Integration:

  • At the Input Level, learnable tokens enhance input representations with prior knowledge about noise and modality-specific characteristics.
  • At the Attention Level, these tokens guide cross-modal interactions as Key and Value inputs in the Fusion Module.

This combination ensures robust multimodal processing under uncertain conditions, such as noisy or missing modalities.


πŸ“Š Datasets

Pre-Training Datasets

  • AudioSet: 1.7M videos, 632 classes.
  • VGGSound: 200K+ videos, 300 classes.

Fine-Tuning Dataset

  • UrbanSound8K-AV: 8,732 samples, 10 classes (audio + visual).

🎯 Training & Evaluation

Training : Case-Wise Training

image

  • Independent prompts are trained for 4 each case (e.g., Complete, Visual-Only, Audio-Only, Both Noise).
  • 4 Training Scenarios:
    • βœ… Complete (Audio + Visual)
    • πŸŽ₯ Vision Only (Noisy Audio)
    • 🎡 Audio Only (Noisy Visual)
    • ❌ Noise to Both (Noisy Audio + Visual)
python train.py --dataset UrbanSound8K-AV --epochs 50 --batch_size 16 --lr 1e-4

Evaluation : Unified Evaluation

image

  • Uses all learned prompts concatenated to handle Uncertain Missing Modality.
  • All learned prompts are combined for inference, ensuring robust performance in noisy and missing modality conditions.
python evaluation.py --dataset UrbanSound8K-AV --case noise_to_both

πŸ“ˆ Results

Performance Comparison

Case Fine-Tuning (FT) FT + Prompt Learning (PL) Improvement
βœ… Complete 0.99 0.99 -
πŸŽ₯ Vision Only (Noisy A) 0.69 0.79 +0.10
🎡 Audio Only (Noisy V) 0.83 0.86 +0.03
❌ Noise to Both 0.71 0.80 +0.09

Key Insights:

  1. βœ… Complete Case:

    • Both Fine-Tuning (FT) and Prompt Learning (PL) achieve near-perfect performance.
    • Indicates that Prompt Learning does not degrade performance in ideal conditions despite being computationally more efficient.
  2. πŸŽ₯ Vision Only (Noisy Audio):

    • PL demonstrates significant improvement (+0.10) over FT by leveraging visual features more effectively through cross-attention and prompts.
    • Highlights the robustness of PL in compensating for noisy audio data by emphasizing the complementary modality.
  3. 🎡 Audio Only (Noisy Visual):

    • Improvement is smaller (+0.03) but still notable.
    • Reflects that audio data inherently carries less noise sensitivity, and prompts enhance robustness without major dependency on visual data.
  4. ❌ Noise to Both:

    • PL provides a substantial gain (+0.09) in the most challenging scenario.
    • Demonstrates the ability of prompts to optimize cross-modal interactions, ensuring stable performance even under severe noise.

Resource Efficiency

Method Total Memory (GiB) Training Memory (GiB) Memory Saving Time per Epoch
Fine-Tuning 95.12 93.89 - 1 min
Prompt Learning 17.85 13.62 82.3% 2.4 sec

Key Insights:

  1. πŸ’Ύ Memory Usage:

    • PL significantly reduces total memory usage by 82.3%, lowering computational demands.
    • This is achieved by learning only a small set of prompt parameters, unlike FT, which updates the entire model.
  2. πŸ“± Training Memory:

    • PL uses 13.62 GiB compared to 93.89 GiB in FT.
    • Such drastic memory savings make PL scalable for larger datasets and models, particularly in resource-constrained environments.
  3. ⏰ Training Time:

    • PL requires only 2.4 seconds per epoch, a 96% reduction compared to FT (1 minute per epoch).
    • This efficiency is particularly critical for large-scale or real-time applications where training time is a bottleneck.

About

A robust audio-visual classification framework using learnable prompts to handle uncertain and noisy modalities effectively.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published