Temporal Modeling for Depression Detection: A Controlled Ablation Study

A rigorous investigation of temporal architecture choices for depression severity regression from facial video, using identical WebFace260M texture features.

Main Paper: Automatic Depression Recognition with an Ensemble of Multimodal Spatio-Temporal Routing Features — Read Paper Here

Motivation & Scope

Why This Study?

Depression detection from facial video requires modeling long temporal sequences (1000-2000 frames) under weak supervision (single BDI-II label per video). Recent work, including the LDBM architecture from above paper, proposes complex temporal modeling strategies (patch-based attention, multi-scale merging). However, it remains unclear:

Does architectural complexity actually improve generalization, or does the bottleneck lie elsewhere?

What This Study Is

A controlled ablation study isolating temporal modeling strategies on identical features
A debugging journey documenting feature extraction pitfalls and fixes

What This Study Is Not

Not a claim that "simple models are always better"
Not a final depression detection system

Problem Definition

Component	Specification
Task	Depression severity regression (BDI-II, 0-63 scale)
Dataset	AVEC2014 (50 train / 50 dev / 50 test subjects)
Input	Aligned facial frames (112×112 PNG)
Features	WebFace260M ResNet-50 embeddings (512-D, L2-normalized)
Sequence Length	2000 frames (zero-padded or truncated)
Supervision	Weak: single BDI-II label per 2000-frame video
Evaluation Metrics	MAE ↓, RMSE ↓, PCC ↑, σ_ratio (prediction variance / label variance)

Why These Choices?

WebFace260M: State-of-the-art face recognition encoder; tests whether identity-focused features transfer to affective tasks
2000 frames: Matches LDBM protocol from Paper for fair comparison
Weak supervision: Reflects real-world clinical labeling constraints (one assessment per session)
σ_ratio diagnostic: Detects model collapse (σ_ratio ≈ 0) vs. meaningful variance (σ_ratio > 0.5)

Controlled Experimental Protocol

Variables Held Constant (Fair Comparison)

Variable	Value	Rationale
Encoder	WebFace260M ResNet-50 (frozen)	Isolate temporal modeling from feature learning
Feature Dim	512-D per frame	Matches LDBM input specification
Sequence Length	2000 frames	Standardized temporal context
Normalization	L2-normalized embeddings + z-score labels	Training stability
Loss Function	MAE (L1)	Robust to outliers in clinical scores
Optimizer	Adam (lr=3e-4, weight_decay=1e-4)	Standard choice for transformers
Batch Size	32 (train), 64 (eval)	Memory-efficient on Colab T4
Evaluation	Subject-level aggregation + window-level diagnostics	Realistic deployment metric

Variables Ablated (What We Test)

Variable	Values Tested	Research Question
Temporal Architecture	Mean pooling, Patch Attention (LDBM), Full Attention (Transformer), Sequential SSM (MambaGRU)	Does architectural complexity improve generalization?
Patch Size (LDBM)	4, 8, 16	Does finer temporal granularity help?
Prediction Variance	σ_ratio diagnostic	Is the model learning signal or collapsing to constant?

Evaluation Protocol

for each 512-frame window:
    predict normalized BDI-II score
    compute MAE, PCC vs. normalized label

for each subject:
    aggregate window predictions (mean)
    denormalize to original BDI-II scale
    compute MAE, RMSE, PCC vs. ground truth

Models Compared

Architecture Overview

Model	Paradigm	Key Mechanism	Params	Temporal Complexity
TemporalMeanMLP	No temporal modeling	Global mean pooling + MLP	131K	O(1)
LDBM	Patch-based attention	Multi-scale merging + structured attention	1.35M	O(N·log N)
VanillaTransformer	Full self-attention	Standard transformer encoder	~500K	O(N²)
MambaGRU	Sequential state-space	Causal GRU (SSM approximation)	528K	O(N)

Note: LDBM implementation follows paper with modifications for controlled comparison (removed NIE, FAPs, rPPG modules).

Implementation Details

TemporalMeanMLP (Baseline)

class TemporalMeanMLP(nn.Module):
    def forward(self, x):  
        x = x.mean(dim=2)  
        return self.mlp(x).squeeze(1)

LDBM (Patch-based Transformer)

class LDBM(nn.Module):
    def forward(self, x):  
        patches = unfold(x, patch_size, stride)  
        z = self.proj(patches)  
        z = self.pos_enc(z)  
        for layer in self.encoder_layers:
            z = layer(z)  
            z = self.merge_patches(z)  
        return self.regressor(z.mean(dim=1))

VanillaTransformer

class VanillaTemporalTransformer(nn.Module):
    def forward(self, x):  
        x = x.permute(0, 2, 1)  
        x = self.input_proj(x) + self.pos_emb[:, :T]  
        x = self.transformer(x)  
        return self.regressor(x.mean(dim=1))

MambaGRU (Sequential SSM Approximation)

class MambaGRU(nn.Module):
    def forward(self, x):  
        x = x.permute(0, 2, 1) 
        x = self.input_proj(x) + self.pos_emb[:, :T]  
        _, hidden = self.gru(x)  
        return self.regressor(hidden[-1])

Results

Final Comparison Table (All Metrics on Original BDI-II Scale)

Model	Paradigm	Params	Dev MAE ↓	Dev PCC ↑	Test MAE ↓	Test PCC ↑	σ_ratio
TemporalMeanMLP	Mean pooling	131K	8.645	0.426	10.426	0.101	0.53
LDBM	Patch Attention	1.35M	9.061	0.485	10.529	0.267	0.93
VanillaTransformer	Full Attention	~500K	8.272	0.484	10.428	0.186	0.79
MambaGRU	Sequential SSM	528K	8.885	0.508	10.907	0.181	0.80

Performance Statistics

Test MAE:  mean=10.573, std=0.195, range=[10.426, 10.907]
Test PCC:  mean=0.184, std=0.068, range=[0.101, 0.267]
Dev→Test Gap: +2.0 MAE, -0.25 PCC (consistent across architectures)

Key Observations

All Models Learn Meaningful Signal

Development PCC > 0.4 for all architectures
σ_ratio > 0.5 indicates non-collapsed predictions
Conclusion: WebFace260M features contain some depression-relevant signal

Significant Generalization Gap

Metric	Development	Testing	Delta
MAE	8.3-9.1	10.4-10.9	+2.0
PCC	0.42-0.51	0.10-0.27	-0.25

Interpretation: Weak supervision (single label per video) creates noisy training signal that models overfit to, limiting cross-subject generalization.

Architecture Choice Has Minimal Impact on Test Performance

Test MAE variance across 4 architectures: std = 0.195 (negligible)
No architecture clearly dominates generalization
Conclusion: Temporal modeling capacity is not the primary bottleneck

LDBM Shows Interesting Trade-offs

Highest Test PCC (0.267) → patch-based attention may capture different temporal patterns
Highest σ_ratio (0.93) → produces most varied predictions (useful for uncertainty estimation)
But: 10× more parameters than baseline, higher Dev MAE
Trade-off: Worth it if you need prediction variance; not for pure accuracy

Key Scientific Findings

Finding 1: Feature Representation Is the Primary Bottleneck

Evidence:
• All architectures achieve similar Test MAE (10.4-10.9)
• All show identical Dev→Test generalization gap
• WebFace features (identity-focused) contain limited affective signal

Implication:
• Prioritize affect-specific encoders (FER2013-finetuned, expression-aware)
• Consider multi-modal fusion (audio + visual + physiological)
• Self-supervised temporal pretraining on unlabeled facial video may help

Finding 2: Weak Supervision Limits Cross-Subject Generalization

Evidence:
• Single BDI-II label per 2000-frame video → high label noise
• Models learn dataset-specific patterns that don't transfer
• Dev PCC = 0.42-0.51 vs. Test PCC = 0.10-0.27

Implication:
• Explore temporal label smoothing or multi-instance learning
• Consider subject-adaptive fine-tuning for deployment
• Collect denser temporal annotations if possible

Finding 3: Simple Baselines Are Strong Competitors

Evidence:
• TemporalMeanMLP achieves Test MAE within 0.1 of complex models
• No architecture shows statistically significant improvement
• Training time: MLP (5 min) vs. LDBM (45 min) on Colab T4

Implication:
• Start with mean-pooling baseline before investing in complex architectures
• Use σ_ratio diagnostic to detect collapse early
• Reserve complex models for cases where prediction variance matters

Finding 4: Feature Pipeline Debugging Is Critical

Bug Discovered:
• buffalo_l model returned zeros for pre-aligned faces
• All features were 100% zero → all models collapsed (σ_ratio=0.00)

Fix Implemented:
• Direct ONNX runtime inference with manual preprocessing
• L2 normalization verification + per-frame variance checks
• Random initialization fallback (not zeros) for failed detections

Lesson:
• Always verify feature statistics before training
• Include variance diagnostics in evaluation pipeline
• Document extraction fixes for reproducibility

Implications for Future Architecture Design

Design Principles

Based on this ablation studyL

Start simple: Use mean-pooling baseline as lower bound
Prioritize features: Invest in affect-specific encoders before temporal complexity
Handle weak supervision: Incorporate temporal smoothing or multi-instance learning
Monitor variance: Use σ_ratio diagnostic during development
Justify complexity: Only add architectural components that show clear generalization benefit

Specific Architecture Considerations

Component	Decision	Rationale
Temporal Modeling	Hybrid: mean pooling + lightweight attention	Balance efficiency and expressivity
Feature Encoder	Affect-specific (FER2013-finetuned) + WebFace fusion	Address identity/affect mismatch
Supervision	Temporal label smoothing + subject-adaptive head	Reduce weak supervision noise
Uncertainty	Monte Carlo dropout + σ_ratio monitoring	Enable trustworthy predictions
Deployment	Export to ONNX + quantization support	Ensure edge-device compatibility

Success Metrics

Metric	Target	Rationale
Test MAE	< 9.5	Improve over ablation baseline (10.4)
Test PCC	> 0.30	Double the best ablation result (0.267)
Dev→Test Gap	< 1.0 MAE	Reduce generalization gap by 50%
Inference Time	< 100ms/frame	Enable real-time deployment
Model Size	< 5M params	Edge-device compatibility

Limitations

Single Dataset: Results are specific to AVEC2014; generalization to other datasets requires validation
Single Feature Type: Only WebFace260M texture features tested; audio or multi-modal features may change conclusions
Weak Supervision: Single BDI-II label per video creates inherent noise; denser annotations may improve results
Computational Constraints: Experiments run on Colab T4; larger models may require more resources
Clinical Translation: This is a research prototype; not validated for clinical deployment

Contributing

This is an independent research project. Contributions are welcome!

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/your-idea)
Make your changes (follow existing code style)
Add tests for new functionality
Submit a pull request with clear description

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Temporal_Modeling_Independent_Study.ipynb		Temporal_Modeling_Independent_Study.ipynb
readme.md		readme.md

Folders and files

Latest commit

History

Repository files navigation

Temporal Modeling for Depression Detection: A Controlled Ablation Study

Table of Contents

Motivation & Scope

Why This Study?

What This Study Is

What This Study Is Not

Problem Definition

Why These Choices?

Controlled Experimental Protocol

Variables Held Constant (Fair Comparison)

Variables Ablated (What We Test)

Evaluation Protocol

Models Compared

Architecture Overview

Implementation Details

TemporalMeanMLP (Baseline)

LDBM (Patch-based Transformer)

VanillaTransformer

MambaGRU (Sequential SSM Approximation)

Results

Final Comparison Table (All Metrics on Original BDI-II Scale)

Performance Statistics

Key Observations

All Models Learn Meaningful Signal

Significant Generalization Gap

Architecture Choice Has Minimal Impact on Test Performance

LDBM Shows Interesting Trade-offs

Key Scientific Findings

Finding 1: Feature Representation Is the Primary Bottleneck

Finding 2: Weak Supervision Limits Cross-Subject Generalization

Finding 3: Simple Baselines Are Strong Competitors

Finding 4: Feature Pipeline Debugging Is Critical

Implications for Future Architecture Design

Design Principles

Specific Architecture Considerations

Success Metrics

Limitations

Contributing

How to Contribute

Contribution Guidelines

Issues & Discussions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages