Skip to content

feat(blacksmith-cache): add explicit durability flush after unmount (BLA-3202)#19

Merged
adityamaru merged 8 commits intomainfrom
devin/1770147855-bla-3202-durability-flush
Feb 4, 2026
Merged

feat(blacksmith-cache): add explicit durability flush after unmount (BLA-3202)#19
adityamaru merged 8 commits intomainfrom
devin/1770147855-bla-3202-durability-flush

Conversation

@adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Feb 3, 2026

Summary

Adds blockdev --flushbufs operation after unmounting the git-mirror sticky disk to ensure data durability before the Ceph RBD snapshot is taken. This is part of BLA-3202 which adds explicit durability fences across the Blacksmith infrastructure.

Changes:

  • Add getDeviceFromMount() to detect block device from mount point (tries findmnt, falls back to parsing mount output)
  • Add flushBlockDevice() to execute flush with I/O stats logging before/after
  • Call flush after successful unmount in cleanup(), before commit
  • Support ENABLE_DURABILITY_FLUSH env var for gradual rollout (default: enabled)

The flush is best-effort—failures are logged as warnings but don't break the cleanup flow to maintain backward compatibility.

Review & Testing Checklist for Human

  • Verify the flush sequence is correct: device path captured before unmount → unmount → flush (device still mapped even though unmounted) → commit
  • Confirm 30-second timeout for flush is appropriate for git-mirror workloads
  • Review error handling: all failures log warnings but don't fail cleanup—is this the desired behavior?
  • Test in staging environment with actual RBD devices to verify flush executes and logs correctly

Recommended test plan: Deploy to staging, run a workflow that uses git-mirror sticky disk, and verify logs show [git-mirror] guest flush duration: Xms, device: /dev/rbdX, before_stats: ..., after_stats: ...

Notes

This is part of a multi-repo change (BLA-3202). Related PRs:

Link to Devin run: https://app.devin.ai/sessions/611301f918674712b016558ddc2fba0e
Requested by: @adityamaru


Note

Medium Risk
Changes the sticky-disk cleanup/commit sequence (removes fsck, adds unmount retry and post-unmount device flush), which can affect cache durability and whether a disk is committed; failures are mostly best-effort but unmount behavior now gates commits.

Overview
Adds a durability fence to git-mirror sticky disk cleanup by capturing the backing block device for the mount point and running a best-effort blockdev --flushbufs after unmount (with before/after /sys/block/*/stat logging) and before commitStickyDisk.

Reworks cleanup to remove the git fsck integrity gate and its metric reporting, and makes unmount more robust via timeout-wrapped retries with exponential backoff plus lsof/fuser diagnostics; unmount failure now prevents committing the sticky disk.

Written by Cursor Bugbot for commit 06d605f. This will update automatically on new commits. Configure here.

…BLA-3202)

Add blockdev --flushbufs operation after unmounting sticky disk to ensure
data durability before Ceph RBD snapshot is taken.

Changes:
- Add FLUSH_TIMEOUT_SECS constant (30 seconds)
- Add getDeviceFromMount() function to detect block device from mount point
- Add flushBlockDevice() function with I/O stats logging
- Call flush after successful unmount in cleanup()
- Support ENABLE_DURABILITY_FLUSH env var for gradual rollout (default: enabled)

The flush operation is best-effort - failures are logged but don't break
the cleanup flow to maintain backward compatibility.

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
@linear
Copy link

linear bot commented Feb 3, 2026

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
@adityamaru
Copy link
Contributor Author

@devin can you:

  • remove the git fsck check as its timing out
  • ensure there is a retry around the umount, retry the umount a few times with some backoff and then print some diagnostic info on what process is preventing the unmount

- Remove git fsck check that was timing out
- Add retry logic with exponential backoff for umount (3 attempts)
- Add diagnostic info (lsof/fuser) when unmount fails to show blocking processes

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
@adityamaru
Copy link
Contributor Author

@devin if umount fails as well we should not commit, is that already the case?

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
@devin-ai-integration
Copy link
Contributor

Good catch! I've added that logic - if unmount fails after all 3 retry attempts, shouldCommit and vmHydratedGitMirror are now set to false to prevent committing potentially corrupted state. See commit 07fa9a3.

…nup hang

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@adityamaru
Copy link
Contributor Author

@devin address comments

…unt point

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
@adityamaru
Copy link
Contributor Author

@devin can you shorten the timout of unmount to 10s

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
@devin-ai-integration
Copy link
Contributor

Done - reduced the umount timeout to 10s. See commit 06d605f.

@adityamaru adityamaru merged commit 2f7f0dc into main Feb 4, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant