feat(blacksmith-cache): add explicit durability flush after unmount (BLA-3202)#19
Conversation
…BLA-3202) Add blockdev --flushbufs operation after unmounting sticky disk to ensure data durability before Ceph RBD snapshot is taken. Changes: - Add FLUSH_TIMEOUT_SECS constant (30 seconds) - Add getDeviceFromMount() function to detect block device from mount point - Add flushBlockDevice() function with I/O stats logging - Call flush after successful unmount in cleanup() - Support ENABLE_DURABILITY_FLUSH env var for gradual rollout (default: enabled) The flush operation is best-effort - failures are logged but don't break the cleanup flow to maintain backward compatibility. Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
|
@devin can you:
|
- Remove git fsck check that was timing out - Add retry logic with exponential backoff for umount (3 attempts) - Add diagnostic info (lsof/fuser) when unmount fails to show blocking processes Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
|
@devin if umount fails as well we should not commit, is that already the case? |
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
|
Good catch! I've added that logic - if unmount fails after all 3 retry attempts, |
…nup hang Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
@devin address comments |
…unt point Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
|
@devin can you shorten the timout of unmount to 10s |
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
|
Done - reduced the umount timeout to 10s. See commit 06d605f. |
Summary
Adds
blockdev --flushbufsoperation after unmounting the git-mirror sticky disk to ensure data durability before the Ceph RBD snapshot is taken. This is part of BLA-3202 which adds explicit durability fences across the Blacksmith infrastructure.Changes:
getDeviceFromMount()to detect block device from mount point (triesfindmnt, falls back to parsingmountoutput)flushBlockDevice()to execute flush with I/O stats logging before/aftercleanup(), before commitENABLE_DURABILITY_FLUSHenv var for gradual rollout (default: enabled)The flush is best-effort—failures are logged as warnings but don't break the cleanup flow to maintain backward compatibility.
Review & Testing Checklist for Human
Recommended test plan: Deploy to staging, run a workflow that uses git-mirror sticky disk, and verify logs show
[git-mirror] guest flush duration: Xms, device: /dev/rbdX, before_stats: ..., after_stats: ...Notes
This is part of a multi-repo change (BLA-3202). Related PRs:
Link to Devin run: https://app.devin.ai/sessions/611301f918674712b016558ddc2fba0e
Requested by: @adityamaru
Note
Medium Risk
Changes the sticky-disk cleanup/commit sequence (removes
fsck, adds unmount retry and post-unmount device flush), which can affect cache durability and whether a disk is committed; failures are mostly best-effort but unmount behavior now gates commits.Overview
Adds a durability fence to git-mirror sticky disk cleanup by capturing the backing block device for the mount point and running a best-effort
blockdev --flushbufsafter unmount (with before/after/sys/block/*/statlogging) and beforecommitStickyDisk.Reworks cleanup to remove the
git fsckintegrity gate and its metric reporting, and makes unmount more robust viatimeout-wrapped retries with exponential backoff pluslsof/fuserdiagnostics; unmount failure now prevents committing the sticky disk.Written by Cursor Bugbot for commit 06d605f. This will update automatically on new commits. Configure here.