Skip to content

bilal1718/Temporal_Modeling_Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Temporal Modeling for Depression Detection: A Controlled Ablation Study

A rigorous investigation of temporal architecture choices for depression severity regression from facial video, using identical WebFace260M texture features.

Python PyTorch Colab

Main Paper: Automatic Depression Recognition with an Ensemble of Multimodal Spatio-Temporal Routing FeaturesRead Paper Here


Table of Contents


Motivation & Scope

Why This Study?

Depression detection from facial video requires modeling long temporal sequences (1000-2000 frames) under weak supervision (single BDI-II label per video). Recent work, including the LDBM architecture from above paper, proposes complex temporal modeling strategies (patch-based attention, multi-scale merging). However, it remains unclear:

Does architectural complexity actually improve generalization, or does the bottleneck lie elsewhere?

What This Study Is

A controlled ablation study isolating temporal modeling strategies on identical features
A debugging journey documenting feature extraction pitfalls and fixes

What This Study Is Not

Not a claim that "simple models are always better"
Not a final depression detection system

Problem Definition

Component Specification
Task Depression severity regression (BDI-II, 0-63 scale)
Dataset AVEC2014 (50 train / 50 dev / 50 test subjects)
Input Aligned facial frames (112×112 PNG)
Features WebFace260M ResNet-50 embeddings (512-D, L2-normalized)
Sequence Length 2000 frames (zero-padded or truncated)
Supervision Weak: single BDI-II label per 2000-frame video
Evaluation Metrics MAE ↓, RMSE ↓, PCC ↑, σ_ratio (prediction variance / label variance)

Why These Choices?

  • WebFace260M: State-of-the-art face recognition encoder; tests whether identity-focused features transfer to affective tasks
  • 2000 frames: Matches LDBM protocol from Paper for fair comparison
  • Weak supervision: Reflects real-world clinical labeling constraints (one assessment per session)
  • σ_ratio diagnostic: Detects model collapse (σ_ratio ≈ 0) vs. meaningful variance (σ_ratio > 0.5)

Controlled Experimental Protocol

Variables Held Constant (Fair Comparison)

Variable Value Rationale
Encoder WebFace260M ResNet-50 (frozen) Isolate temporal modeling from feature learning
Feature Dim 512-D per frame Matches LDBM input specification
Sequence Length 2000 frames Standardized temporal context
Normalization L2-normalized embeddings + z-score labels Training stability
Loss Function MAE (L1) Robust to outliers in clinical scores
Optimizer Adam (lr=3e-4, weight_decay=1e-4) Standard choice for transformers
Batch Size 32 (train), 64 (eval) Memory-efficient on Colab T4
Evaluation Subject-level aggregation + window-level diagnostics Realistic deployment metric

Variables Ablated (What We Test)

Variable Values Tested Research Question
Temporal Architecture Mean pooling, Patch Attention (LDBM), Full Attention (Transformer), Sequential SSM (MambaGRU) Does architectural complexity improve generalization?
Patch Size (LDBM) 4, 8, 16 Does finer temporal granularity help?
Prediction Variance σ_ratio diagnostic Is the model learning signal or collapsing to constant?

Evaluation Protocol

for each 512-frame window:
    predict normalized BDI-II score
    compute MAE, PCC vs. normalized label

for each subject:
    aggregate window predictions (mean)
    denormalize to original BDI-II scale
    compute MAE, RMSE, PCC vs. ground truth

Models Compared

Architecture Overview

Model Paradigm Key Mechanism Params Temporal Complexity
TemporalMeanMLP No temporal modeling Global mean pooling + MLP 131K O(1)
LDBM Patch-based attention Multi-scale merging + structured attention 1.35M O(N·log N)
VanillaTransformer Full self-attention Standard transformer encoder ~500K O(N²)
MambaGRU Sequential state-space Causal GRU (SSM approximation) 528K O(N)

Note: LDBM implementation follows paper with modifications for controlled comparison (removed NIE, FAPs, rPPG modules).

Implementation Details

TemporalMeanMLP (Baseline)

class TemporalMeanMLP(nn.Module):
    def forward(self, x):  
        x = x.mean(dim=2)  
        return self.mlp(x).squeeze(1) 

LDBM (Patch-based Transformer)

class LDBM(nn.Module):
    def forward(self, x):  
        patches = unfold(x, patch_size, stride)  
        z = self.proj(patches)  
        z = self.pos_enc(z)  
        for layer in self.encoder_layers:
            z = layer(z)  
            z = self.merge_patches(z)  
        return self.regressor(z.mean(dim=1))  

VanillaTransformer

class VanillaTemporalTransformer(nn.Module):
    def forward(self, x):  
        x = x.permute(0, 2, 1)  
        x = self.input_proj(x) + self.pos_emb[:, :T]  
        x = self.transformer(x)  
        return self.regressor(x.mean(dim=1))  

MambaGRU (Sequential SSM Approximation)

class MambaGRU(nn.Module):
    def forward(self, x):  
        x = x.permute(0, 2, 1) 
        x = self.input_proj(x) + self.pos_emb[:, :T]  
        _, hidden = self.gru(x)  
        return self.regressor(hidden[-1])  

Results

Final Comparison Table (All Metrics on Original BDI-II Scale)

Model Paradigm Params Dev MAE ↓ Dev PCC ↑ Test MAE ↓ Test PCC ↑ σ_ratio
TemporalMeanMLP Mean pooling 131K 8.645 0.426 10.426 0.101 0.53
LDBM Patch Attention 1.35M 9.061 0.485 10.529 0.267 0.93
VanillaTransformer Full Attention ~500K 8.272 0.484 10.428 0.186 0.79
MambaGRU Sequential SSM 528K 8.885 0.508 10.907 0.181 0.80

Performance Statistics

Test MAE:  mean=10.573, std=0.195, range=[10.426, 10.907]
Test PCC:  mean=0.184, std=0.068, range=[0.101, 0.267]
DevTest Gap: +2.0 MAE, -0.25 PCC (consistent across architectures)

Key Observations

All Models Learn Meaningful Signal

  • Development PCC > 0.4 for all architectures
  • σ_ratio > 0.5 indicates non-collapsed predictions
  • Conclusion: WebFace260M features contain some depression-relevant signal

Significant Generalization Gap

Metric Development Testing Delta
MAE 8.3-9.1 10.4-10.9 +2.0
PCC 0.42-0.51 0.10-0.27 -0.25

Interpretation: Weak supervision (single label per video) creates noisy training signal that models overfit to, limiting cross-subject generalization.

Architecture Choice Has Minimal Impact on Test Performance

  • Test MAE variance across 4 architectures: std = 0.195 (negligible)
  • No architecture clearly dominates generalization
  • Conclusion: Temporal modeling capacity is not the primary bottleneck

LDBM Shows Interesting Trade-offs

  • Highest Test PCC (0.267) → patch-based attention may capture different temporal patterns
  • Highest σ_ratio (0.93) → produces most varied predictions (useful for uncertainty estimation)
  • But: 10× more parameters than baseline, higher Dev MAE
  • Trade-off: Worth it if you need prediction variance; not for pure accuracy

Key Scientific Findings

Finding 1: Feature Representation Is the Primary Bottleneck

Evidence:
• All architectures achieve similar Test MAE (10.4-10.9)
• All show identical Dev→Test generalization gap
• WebFace features (identity-focused) contain limited affective signal

Implication:
• Prioritize affect-specific encoders (FER2013-finetuned, expression-aware)
• Consider multi-modal fusion (audio + visual + physiological)
• Self-supervised temporal pretraining on unlabeled facial video may help

Finding 2: Weak Supervision Limits Cross-Subject Generalization

Evidence:
• Single BDI-II label per 2000-frame video → high label noise
• Models learn dataset-specific patterns that don't transfer
• Dev PCC = 0.42-0.51 vs. Test PCC = 0.10-0.27

Implication:
• Explore temporal label smoothing or multi-instance learning
• Consider subject-adaptive fine-tuning for deployment
• Collect denser temporal annotations if possible

Finding 3: Simple Baselines Are Strong Competitors

Evidence:
• TemporalMeanMLP achieves Test MAE within 0.1 of complex models
• No architecture shows statistically significant improvement
• Training time: MLP (5 min) vs. LDBM (45 min) on Colab T4

Implication:
• Start with mean-pooling baseline before investing in complex architectures
• Use σ_ratio diagnostic to detect collapse early
• Reserve complex models for cases where prediction variance matters

Finding 4: Feature Pipeline Debugging Is Critical

Bug Discovered:
• buffalo_l model returned zeros for pre-aligned faces
• All features were 100% zero → all models collapsed (σ_ratio=0.00)

Fix Implemented:
• Direct ONNX runtime inference with manual preprocessing
• L2 normalization verification + per-frame variance checks
• Random initialization fallback (not zeros) for failed detections

Lesson:
• Always verify feature statistics before training
• Include variance diagnostics in evaluation pipeline
• Document extraction fixes for reproducibility

Implications for Future Architecture Design

Design Principles

Based on this ablation studyL

  1. Start simple: Use mean-pooling baseline as lower bound
  2. Prioritize features: Invest in affect-specific encoders before temporal complexity
  3. Handle weak supervision: Incorporate temporal smoothing or multi-instance learning
  4. Monitor variance: Use σ_ratio diagnostic during development
  5. Justify complexity: Only add architectural components that show clear generalization benefit

Specific Architecture Considerations

Component Decision Rationale
Temporal Modeling Hybrid: mean pooling + lightweight attention Balance efficiency and expressivity
Feature Encoder Affect-specific (FER2013-finetuned) + WebFace fusion Address identity/affect mismatch
Supervision Temporal label smoothing + subject-adaptive head Reduce weak supervision noise
Uncertainty Monte Carlo dropout + σ_ratio monitoring Enable trustworthy predictions
Deployment Export to ONNX + quantization support Ensure edge-device compatibility

Success Metrics

Metric Target Rationale
Test MAE < 9.5 Improve over ablation baseline (10.4)
Test PCC > 0.30 Double the best ablation result (0.267)
Dev→Test Gap < 1.0 MAE Reduce generalization gap by 50%
Inference Time < 100ms/frame Enable real-time deployment
Model Size < 5M params Edge-device compatibility

Limitations

  1. Single Dataset: Results are specific to AVEC2014; generalization to other datasets requires validation
  2. Single Feature Type: Only WebFace260M texture features tested; audio or multi-modal features may change conclusions
  3. Weak Supervision: Single BDI-II label per video creates inherent noise; denser annotations may improve results
  4. Computational Constraints: Experiments run on Colab T4; larger models may require more resources
  5. Clinical Translation: This is a research prototype; not validated for clinical deployment

Contributing

This is an independent research project. Contributions are welcome!

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/your-idea)
  3. Make your changes (follow existing code style)
  4. Add tests for new functionality
  5. Submit a pull request with clear description

Contribution Guidelines

  • Code Style: Follow PEP 8; use type hints; document public functions
  • Testing: Add unit tests for new modules; ensure existing tests pass
  • Documentation: Update README/docs for user-facing changes
  • Reproducibility: Include config files and seeds for experiments

Issues & Discussions

  • Bug Reports: Use GitHub Issues with minimal reproducible example
  • Feature Requests: Discuss in GitHub Discussions before implementing
  • Questions: Check README first, then ask in Discussions

Built with ❤️ for reproducible affective computing research

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors