π£ Accepted at Summer Annual Conference of IEIE, 2025.
This repository contains an Audio-Visual Classification Framework designed to handle Uncertain Missing Modality scenarios where missing modalities are unpredictable at test time. Our approach integrates Prompt Learning at both the Input Level and Attention Level, allowing the model to dynamically adapt to missing or noisy modalities.
- β
End-to-End Framework for Uncertain Missing Modality
- Designed to handle unpredictable modality loss by training across multiple missing modality scenarios.
- π― Prompt Learning for Robustness
- Introduces Input-Level and Attention-Level Prompts to reinforce missing modality adaptation.
- π‘ Efficient Training
- Reduces memory usage by 82.3% and training time by 96%, making it highly scalable.
- π Performance Improvement
- Outperforms Fine-Tuning in noisy and missing modality environments by up to 10%.
The framework addresses Uncertain Missing Modality scenarios through a robust integration of learnable prompt tokens at both the input and attention levels. This design allows the model to adaptively handle incomplete or noisy data across modalities while maintaining computational efficiency.
At the input stage, learnable prompt tokens are concatenated directly with the input features of each modality (audio and visual). This mechanism embeds prior knowledge about modality-specific patterns (e.g., noise or missing data) into the input representation.
- Key Benefits:
- Prompts encode modality-specific signals like noise patterns or missing data indicators.
- Each modality's encoder processes enriched inputs with context about the data's state.
Prompts from the Input Level stage are used as Key and Value inputs in the Cross-Attention mechanism during the fusion phase. This enables enhanced interaction between audio and visual modalities by leveraging learnable tokens and modality-specific embeddings.
- Key Benefits:
- Prompts ensure missing or noisy modality information is supplemented by the complementary modality.
- Facilitate robust feature alignment, strengthening shared representations even when one modality is compromised.
- Mechanism Highlights:
- Query: Originates from one modality's embeddings (e.g., audio for Audio-to-Visual attention).
- Key & Value: Combines corresponding modality embeddings and learnable prompt tokens.
The Fusion Module introduces Cross-Attention layers to balance contributions from both modalities. This module resolves imbalances caused by varying sequence lengths and noise levels in audio and visual data.
- Key Benefits:
- Aligns features from different modalities for mutual reinforcement.
- Handles discrepancies in sequence lengths (e.g., longer audio vs. shorter visual sequences).
- Structure:
- Visual-to-Audio Attention: Visual embeddings query audio embeddings and associated prompts.
- Audio-to-Visual Attention: Audio embeddings query visual embeddings and their prompts.
This unified approach combines strengths from both Input-Level and Attention-Level Integration:
- At the Input Level, learnable tokens enhance input representations with prior knowledge about noise and modality-specific characteristics.
- At the Attention Level, these tokens guide cross-modal interactions as Key and Value inputs in the Fusion Module.
This combination ensures robust multimodal processing under uncertain conditions, such as noisy or missing modalities.
- AudioSet: 1.7M videos, 632 classes.
- VGGSound: 200K+ videos, 300 classes.
- UrbanSound8K-AV: 8,732 samples, 10 classes (audio + visual).
- Independent prompts are trained for 4 each case (e.g., Complete, Visual-Only, Audio-Only, Both Noise).
- 4 Training Scenarios:
- β Complete (Audio + Visual)
- π₯ Vision Only (Noisy Audio)
- π΅ Audio Only (Noisy Visual)
- β Noise to Both (Noisy Audio + Visual)
python train.py --dataset UrbanSound8K-AV --epochs 50 --batch_size 16 --lr 1e-4- Uses all learned prompts concatenated to handle Uncertain Missing Modality.
- All learned prompts are combined for inference, ensuring robust performance in noisy and missing modality conditions.
python evaluation.py --dataset UrbanSound8K-AV --case noise_to_both| Case | Fine-Tuning (FT) | FT + Prompt Learning (PL) | Improvement |
|---|---|---|---|
| β Complete | 0.99 | 0.99 | - |
| π₯ Vision Only (Noisy A) | 0.69 | 0.79 | +0.10 |
| π΅ Audio Only (Noisy V) | 0.83 | 0.86 | +0.03 |
| β Noise to Both | 0.71 | 0.80 | +0.09 |
-
β Complete Case:
- Both Fine-Tuning (FT) and Prompt Learning (PL) achieve near-perfect performance.
- Indicates that Prompt Learning does not degrade performance in ideal conditions despite being computationally more efficient.
-
π₯ Vision Only (Noisy Audio):
- PL demonstrates significant improvement (+0.10) over FT by leveraging visual features more effectively through cross-attention and prompts.
- Highlights the robustness of PL in compensating for noisy audio data by emphasizing the complementary modality.
-
π΅ Audio Only (Noisy Visual):
- Improvement is smaller (+0.03) but still notable.
- Reflects that audio data inherently carries less noise sensitivity, and prompts enhance robustness without major dependency on visual data.
-
β Noise to Both:
- PL provides a substantial gain (+0.09) in the most challenging scenario.
- Demonstrates the ability of prompts to optimize cross-modal interactions, ensuring stable performance even under severe noise.
| Method | Total Memory (GiB) | Training Memory (GiB) | Memory Saving | Time per Epoch |
|---|---|---|---|---|
| Fine-Tuning | 95.12 | 93.89 | - | 1 min |
| Prompt Learning | 17.85 | 13.62 | 82.3% | 2.4 sec |
-
πΎ Memory Usage:
- PL significantly reduces total memory usage by 82.3%, lowering computational demands.
- This is achieved by learning only a small set of prompt parameters, unlike FT, which updates the entire model.
-
π± Training Memory:
- PL uses 13.62 GiB compared to 93.89 GiB in FT.
- Such drastic memory savings make PL scalable for larger datasets and models, particularly in resource-constrained environments.
-
β° Training Time:
- PL requires only 2.4 seconds per epoch, a 96% reduction compared to FT (1 minute per epoch).
- This efficiency is particularly critical for large-scale or real-time applications where training time is a bottleneck.






