A rigorous investigation of temporal architecture choices for depression severity regression from facial video, using identical WebFace260M texture features.
Main Paper: Automatic Depression Recognition with an Ensemble of Multimodal Spatio-Temporal Routing Features — Read Paper Here
- Motivation & Scope
- Problem Definition
- Controlled Experimental Protocol
- Models Compared
- Results
- Key Scientific Findings
- Implications for Future Architecture Design
- Limitations
- Contributing
Depression detection from facial video requires modeling long temporal sequences (1000-2000 frames) under weak supervision (single BDI-II label per video). Recent work, including the LDBM architecture from above paper, proposes complex temporal modeling strategies (patch-based attention, multi-scale merging). However, it remains unclear:
Does architectural complexity actually improve generalization, or does the bottleneck lie elsewhere?
A controlled ablation study isolating temporal modeling strategies on identical features
A debugging journey documenting feature extraction pitfalls and fixes
Not a claim that "simple models are always better"
Not a final depression detection system
| Component | Specification |
|---|---|
| Task | Depression severity regression (BDI-II, 0-63 scale) |
| Dataset | AVEC2014 (50 train / 50 dev / 50 test subjects) |
| Input | Aligned facial frames (112×112 PNG) |
| Features | WebFace260M ResNet-50 embeddings (512-D, L2-normalized) |
| Sequence Length | 2000 frames (zero-padded or truncated) |
| Supervision | Weak: single BDI-II label per 2000-frame video |
| Evaluation Metrics | MAE ↓, RMSE ↓, PCC ↑, σ_ratio (prediction variance / label variance) |
- WebFace260M: State-of-the-art face recognition encoder; tests whether identity-focused features transfer to affective tasks
- 2000 frames: Matches LDBM protocol from Paper for fair comparison
- Weak supervision: Reflects real-world clinical labeling constraints (one assessment per session)
- σ_ratio diagnostic: Detects model collapse (σ_ratio ≈ 0) vs. meaningful variance (σ_ratio > 0.5)
| Variable | Value | Rationale |
|---|---|---|
| Encoder | WebFace260M ResNet-50 (frozen) | Isolate temporal modeling from feature learning |
| Feature Dim | 512-D per frame | Matches LDBM input specification |
| Sequence Length | 2000 frames | Standardized temporal context |
| Normalization | L2-normalized embeddings + z-score labels | Training stability |
| Loss Function | MAE (L1) | Robust to outliers in clinical scores |
| Optimizer | Adam (lr=3e-4, weight_decay=1e-4) | Standard choice for transformers |
| Batch Size | 32 (train), 64 (eval) | Memory-efficient on Colab T4 |
| Evaluation | Subject-level aggregation + window-level diagnostics | Realistic deployment metric |
| Variable | Values Tested | Research Question |
|---|---|---|
| Temporal Architecture | Mean pooling, Patch Attention (LDBM), Full Attention (Transformer), Sequential SSM (MambaGRU) | Does architectural complexity improve generalization? |
| Patch Size (LDBM) | 4, 8, 16 | Does finer temporal granularity help? |
| Prediction Variance | σ_ratio diagnostic | Is the model learning signal or collapsing to constant? |
for each 512-frame window:
predict normalized BDI-II score
compute MAE, PCC vs. normalized label
for each subject:
aggregate window predictions (mean)
denormalize to original BDI-II scale
compute MAE, RMSE, PCC vs. ground truth| Model | Paradigm | Key Mechanism | Params | Temporal Complexity |
|---|---|---|---|---|
| TemporalMeanMLP | No temporal modeling | Global mean pooling + MLP | 131K | O(1) |
| LDBM | Patch-based attention | Multi-scale merging + structured attention | 1.35M | O(N·log N) |
| VanillaTransformer | Full self-attention | Standard transformer encoder | ~500K | O(N²) |
| MambaGRU | Sequential state-space | Causal GRU (SSM approximation) | 528K | O(N) |
Note: LDBM implementation follows paper with modifications for controlled comparison (removed NIE, FAPs, rPPG modules).
class TemporalMeanMLP(nn.Module):
def forward(self, x):
x = x.mean(dim=2)
return self.mlp(x).squeeze(1) class LDBM(nn.Module):
def forward(self, x):
patches = unfold(x, patch_size, stride)
z = self.proj(patches)
z = self.pos_enc(z)
for layer in self.encoder_layers:
z = layer(z)
z = self.merge_patches(z)
return self.regressor(z.mean(dim=1)) class VanillaTemporalTransformer(nn.Module):
def forward(self, x):
x = x.permute(0, 2, 1)
x = self.input_proj(x) + self.pos_emb[:, :T]
x = self.transformer(x)
return self.regressor(x.mean(dim=1)) class MambaGRU(nn.Module):
def forward(self, x):
x = x.permute(0, 2, 1)
x = self.input_proj(x) + self.pos_emb[:, :T]
_, hidden = self.gru(x)
return self.regressor(hidden[-1]) | Model | Paradigm | Params | Dev MAE ↓ | Dev PCC ↑ | Test MAE ↓ | Test PCC ↑ | σ_ratio |
|---|---|---|---|---|---|---|---|
| TemporalMeanMLP | Mean pooling | 131K | 8.645 | 0.426 | 10.426 | 0.101 | 0.53 |
| LDBM | Patch Attention | 1.35M | 9.061 | 0.485 | 10.529 | 0.267 | 0.93 |
| VanillaTransformer | Full Attention | ~500K | 8.272 | 0.484 | 10.428 | 0.186 | 0.79 |
| MambaGRU | Sequential SSM | 528K | 8.885 | 0.508 | 10.907 | 0.181 | 0.80 |
Test MAE: mean=10.573, std=0.195, range=[10.426, 10.907]
Test PCC: mean=0.184, std=0.068, range=[0.101, 0.267]
Dev→Test Gap: +2.0 MAE, -0.25 PCC (consistent across architectures)- Development PCC > 0.4 for all architectures
- σ_ratio > 0.5 indicates non-collapsed predictions
- Conclusion: WebFace260M features contain some depression-relevant signal
| Metric | Development | Testing | Delta |
|---|---|---|---|
| MAE | 8.3-9.1 | 10.4-10.9 | +2.0 |
| PCC | 0.42-0.51 | 0.10-0.27 | -0.25 |
Interpretation: Weak supervision (single label per video) creates noisy training signal that models overfit to, limiting cross-subject generalization.
- Test MAE variance across 4 architectures: std = 0.195 (negligible)
- No architecture clearly dominates generalization
- Conclusion: Temporal modeling capacity is not the primary bottleneck
- Highest Test PCC (0.267) → patch-based attention may capture different temporal patterns
- Highest σ_ratio (0.93) → produces most varied predictions (useful for uncertainty estimation)
- But: 10× more parameters than baseline, higher Dev MAE
- Trade-off: Worth it if you need prediction variance; not for pure accuracy
Evidence:
• All architectures achieve similar Test MAE (10.4-10.9)
• All show identical Dev→Test generalization gap
• WebFace features (identity-focused) contain limited affective signal
Implication:
• Prioritize affect-specific encoders (FER2013-finetuned, expression-aware)
• Consider multi-modal fusion (audio + visual + physiological)
• Self-supervised temporal pretraining on unlabeled facial video may help
Evidence:
• Single BDI-II label per 2000-frame video → high label noise
• Models learn dataset-specific patterns that don't transfer
• Dev PCC = 0.42-0.51 vs. Test PCC = 0.10-0.27
Implication:
• Explore temporal label smoothing or multi-instance learning
• Consider subject-adaptive fine-tuning for deployment
• Collect denser temporal annotations if possible
Evidence:
• TemporalMeanMLP achieves Test MAE within 0.1 of complex models
• No architecture shows statistically significant improvement
• Training time: MLP (5 min) vs. LDBM (45 min) on Colab T4
Implication:
• Start with mean-pooling baseline before investing in complex architectures
• Use σ_ratio diagnostic to detect collapse early
• Reserve complex models for cases where prediction variance matters
Bug Discovered:
• buffalo_l model returned zeros for pre-aligned faces
• All features were 100% zero → all models collapsed (σ_ratio=0.00)
Fix Implemented:
• Direct ONNX runtime inference with manual preprocessing
• L2 normalization verification + per-frame variance checks
• Random initialization fallback (not zeros) for failed detections
Lesson:
• Always verify feature statistics before training
• Include variance diagnostics in evaluation pipeline
• Document extraction fixes for reproducibility
Based on this ablation studyL
- Start simple: Use mean-pooling baseline as lower bound
- Prioritize features: Invest in affect-specific encoders before temporal complexity
- Handle weak supervision: Incorporate temporal smoothing or multi-instance learning
- Monitor variance: Use σ_ratio diagnostic during development
- Justify complexity: Only add architectural components that show clear generalization benefit
| Component | Decision | Rationale |
|---|---|---|
| Temporal Modeling | Hybrid: mean pooling + lightweight attention | Balance efficiency and expressivity |
| Feature Encoder | Affect-specific (FER2013-finetuned) + WebFace fusion | Address identity/affect mismatch |
| Supervision | Temporal label smoothing + subject-adaptive head | Reduce weak supervision noise |
| Uncertainty | Monte Carlo dropout + σ_ratio monitoring | Enable trustworthy predictions |
| Deployment | Export to ONNX + quantization support | Ensure edge-device compatibility |
| Metric | Target | Rationale |
|---|---|---|
| Test MAE | < 9.5 | Improve over ablation baseline (10.4) |
| Test PCC | > 0.30 | Double the best ablation result (0.267) |
| Dev→Test Gap | < 1.0 MAE | Reduce generalization gap by 50% |
| Inference Time | < 100ms/frame | Enable real-time deployment |
| Model Size | < 5M params | Edge-device compatibility |
- Single Dataset: Results are specific to AVEC2014; generalization to other datasets requires validation
- Single Feature Type: Only WebFace260M texture features tested; audio or multi-modal features may change conclusions
- Weak Supervision: Single BDI-II label per video creates inherent noise; denser annotations may improve results
- Computational Constraints: Experiments run on Colab T4; larger models may require more resources
- Clinical Translation: This is a research prototype; not validated for clinical deployment
This is an independent research project. Contributions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-idea) - Make your changes (follow existing code style)
- Add tests for new functionality
- Submit a pull request with clear description
- Code Style: Follow PEP 8; use type hints; document public functions
- Testing: Add unit tests for new modules; ensure existing tests pass
- Documentation: Update README/docs for user-facing changes
- Reproducibility: Include config files and seeds for experiments
- Bug Reports: Use GitHub Issues with minimal reproducible example
- Feature Requests: Discuss in GitHub Discussions before implementing
- Questions: Check README first, then ask in Discussions
Built with ❤️ for reproducible affective computing research