Skip to content

Add file system mounting support (FUSE, Cloud Files, NFS, FileProvider)#296

Open
chipsenkbeil wants to merge 197 commits intomasterfrom
feature/file-mount
Open

Add file system mounting support (FUSE, Cloud Files, NFS, FileProvider)#296
chipsenkbeil wants to merge 197 commits intomasterfrom
feature/file-mount

Conversation

@chipsenkbeil
Copy link
Copy Markdown
Owner

Summary

  • Consolidate 6 file I/O protocol variants into 2 (FileRead/FileWrite) with ReadFileOptions and WriteFileOptions, adding range read/write support needed by mount backends
  • Add distant-mount crate with shared mount infrastructure: InodeTable (bidirectional inode↔path mapping with ref counting + LRU eviction), TTL caches (attr, dir, read), write-back buffers, and RemoteFs translation layer
  • Implement 4 platform-native mount backends:
    • FUSE (Linux, FreeBSD, macOS) via fuser — default on Unix
    • Windows Cloud Files (Windows 10+) via cloud-filter — native File Explorer placeholders
    • NFS (OpenBSD, NetBSD, all Unix) via nfsserve — localhost NFSv3 server fallback
    • macOS FileProvider via objc2-file-provider — native Finder integration with NSFileProviderReplicatedExtension
  • Add CLI distant mount <MOUNT_POINT> and distant unmount <MOUNT_POINT> commands

Closes #145

Test plan

  • Verify cargo test --all-features -p distant-core -p distant-host passes (protocol consolidation)
  • Verify cargo test -p distant-mount --no-default-features passes (54 unit tests for inode table, caches, write buffers)
  • Verify cargo clippy -p distant-mount --no-default-features --features nfs passes
  • Verify cargo clippy -p distant-mount --no-default-features --features macos-file-provider passes
  • Manual smoke test: distant mount /tmp/test on a Linux/macOS host with FUSE installed
  • Verify Windows Cloud Files backend compiles on Windows CI
  • Verify NFS backend compiles and clippy-clean with --features nfs

Reduce 6 file I/O request variants (FileRead, FileReadText, FileWrite,
FileWriteText, FileAppend, FileAppendText) to 2 (FileRead, FileWrite)
with options structs. This is groundwork for the file mount feature
which needs offset-based reads and writes.

- Add ReadFileOptions (offset, len) and WriteFileOptions (offset, append)
- Consolidate Api trait from 6 methods to 2 with options parameters
- Update all three backends (host, ssh, docker) with options handling
- ChannelExt keeps convenience wrappers (read_file_text, append_file, etc.)
- Update all tests across the workspace
New crate providing the shared infrastructure for mounting remote
filesystems locally. This includes:

- InodeTable: bidirectional inode<->RemotePath mapping with ref counting
  and LRU eviction (54 unit tests)
- Cache layer: AttrCache, DirCache, ReadCache with TTL + LRU eviction
- WriteBuffer: per-file write-back buffer with dirty range tracking
- RemoteFs: translation layer from filesystem ops to ChannelExt calls
- MountConfig/MountHandle: configuration and lifecycle management
- Backend module skeleton for platform-specific implementations

The mount feature is opt-in (not in default features) since it requires
platform-specific libraries (macFUSE/libfuse).
Implements the FUSE backend for Linux, FreeBSD, and macOS using fuser:

- FuseHandler implements fuser::Filesystem, delegating all callbacks
  to RemoteFs (lookup, getattr, read, write, readdir, create, mkdir,
  unlink, rmdir, rename, flush, fsync, release, forget)
- Public mount() entry point creates RemoteFs and spawns background
  FUSE session
- CLI `distant mount <MOUNT_POINT>` command with options for
  remote-root, readonly, cache TTLs
- CLI `distant unmount <MOUNT_POINT>` for clean unmount

The mount feature is opt-in and requires macFUSE (macOS), libfuse
(Linux), or FUSE (FreeBSD) to be installed.
Implements the Cloud Filter API backend for native File Explorer
integration on Windows 10+ using the cloud-filter crate:

- CloudFilesHandler implements SyncFilter with fetch_data (on-demand
  hydration), fetch_placeholders (lazy directory population),
  deleted, and renamed callbacks
- Sync root registration/deregistration lifecycle
- Placeholder files appear natively in File Explorer

The windows-cloud-files feature requires Windows 10+ and the
cloud-filter crate. Exact API compatibility will be verified when
first compiled on Windows.
Implements a localhost NFSv3 server using the nfsserve crate that
translates NFS operations to distant API calls:

- NfsHandler implements NFSFileSystem trait (lookup, getattr, read,
  write, readdir, create, remove, rename, mkdir, setattr)
- Server binds to 127.0.0.1 on a random port
- OS-native mount_nfs command attaches the server
- Platform-specific mount commands for OpenBSD, NetBSD, Linux,
  macOS, and FreeBSD

The nfs feature is available on all Unix platforms but the mount()
entry point is gated to OpenBSD/NetBSD where FUSE is unavailable.
Documents the architecture for the future FileProvider backend which
requires a .appex inside a .app bundle (hard Apple requirement):

- DistantMount.app: headless container app (LSBackgroundOnly=true)
- DistantFileProvider.appex: NSFileProviderReplicatedExtension
- IPC via App Group shared container for connection credentials
- Build via shell script (not Xcode), code signed with codesign

The macos-file-provider feature flag is defined but has no
dependencies yet. Implementation requires the objc2 ecosystem
crates and the .app bundle infrastructure.
Full implementation of NSFileProviderReplicatedExtension using
objc2 and objc2-file-provider for native Finder integration on
macOS 12+:

- DistantFileProvider: implements NSFileProviderReplicatedExtension
  with initWithDomain, invalidate, itemForIdentifier,
  fetchContents, createItem, modifyItem, deleteItem
- DistantFileProviderItem: implements NSFileProviderItemProtocol
  with itemIdentifier, parentItemIdentifier, filename
- DistantFileProviderEnumerator: implements NSFileProviderEnumerator
  with invalidate and enumerateItemsForObserver
- Global RemoteFs access via OnceLock for cross-process state
- mount_file_provider() public API for CLI --backend selection
- Added get_path/get_ino_for_path helpers to RemoteFs

Requires .appex inside .app bundle (Apple requirement). The
macos-file-provider feature enables the objc2 dependency chain.
…nd macOS bundle

Replace the single `mount` feature with per-backend features (mount-fuse,
mount-nfs, mount-windows-cloud-files, mount-macos-file-provider) defaulting
to mount-nfs since nfsserve is pure Rust and compiles everywhere.

Add MountBackend enum with cfg-gated variants, Default/Display/FromStr
impls, and unified mount() dispatch. The CLI exposes --backend to select
a backend explicitly; the default auto-selects based on platform and
whether the binary is running inside a .app bundle.

Add watch-based cache invalidation in RemoteFs: Arc-wrap caches, spawn a
best-effort watch task that invalidates attr/dir/read caches on remote
filesystem changes, falling back to TTL-only when watch is unsupported.

Add macOS FileProvider .app bundle infrastructure: Info.plist files,
entitlements (production + dev), build-macos-bundle.sh script, extension
entry point with NSBundle-based detection, and App Group-aware socket
path in constants.rs.
Include mount-fuse, mount-nfs, mount-windows-cloud-files, and
mount-macos-file-provider in the LONG_VERSION feature list shown
by --version.
App Sandbox with ad-hoc codesigning causes macOS to kill the process
on launch. Remove sandbox and network.client entitlements (not needed
for local dev), keep testing-mode for FileProvider, and add
get-task-allow for debugger support.
Ad-hoc codesigning with entitlements produces signatures that
Gatekeeper rejects in /Applications. Default to no entitlements
for ad-hoc (local dev); set ENTITLEMENTS env var for distribution
builds with a real signing identity.
Use PossibleValuesParser with MountBackend::available_backends()
so clap lists the compiled-in backends as possible values for
the --backend flag.
RemoteFs::new() uses rt.block_on(system_info()) to resolve the
default remote root, which panics when called from within a tokio
runtime. Resolve system_info in the async CLI handler and pass the
result via MountConfig::remote_root so block_on is never needed.
Add needs_foreground flag to MountHandle. NFS/FUSE/Cloud Files
backends need a foreground process (server stays alive), so the
CLI blocks on Ctrl+C for those. FileProvider registers a persistent
domain and exits immediately — macOS manages the .appex lifecycle.
When macOS launches the .appex (cold boot, Finder access), it calls
initWithDomain: on DistantFileProvider. The extension now reads domain
metadata (connection_id, destination) from NSUserDefaults, calls a
resolver callback to get a Channel from the distant manager, and
creates a RemoteFs for all subsequent file operations.

- distant-mount: add init(), bootstrap(), ChannelResolver type, and
  TOKIO_HANDLE/CHANNEL_RESOLVER globals; modify initWithDomain to
  call bootstrap(); register_domain now persists metadata in
  NSUserDefaults and calls addDomain
- binary crate: run_extension() creates tokio runtime, inits file
  logging, registers channel resolver; resolve_connection() tries
  stored ID then falls back to destination search; connect_headless()
  uses DummyAuthHandler with exponential backoff
- MountConfig gains extra: Map field for backend-specific data
The .appex was invisible to fileproviderd due to missing Info.plist
keys and broken code signing.

Extension-Info.plist:
- Add CFBundlePackageType (XPC!), CFBundleInfoDictionaryVersion,
  CFBundleShortVersionString, CFBundleSupportedPlatforms,
  CFBundleDisplayName — required for pluginkit discovery
- Add NSExtensionFileProviderDocumentGroup — required for
  fileproviderd to associate the extension with its group container

Entitlements (split app vs appex):
- distant-dev.entitlements: application-groups, network.client,
  get-task-allow (app needs group access for NSUserDefaults write)
- distant-appex-dev.entitlements: same plus app-sandbox and
  fileprovider.testing-mode (appex requires sandbox for launch)
- Note: restricted entitlements require a development certificate;
  ad-hoc signing will register the domain but AMFI blocks the
  appex launch. Sign with CODESIGN_IDENTITY="Apple Development".

Build script:
- Use hardlink instead of symlink (codesign rejects symlinked
  executables)
- Separate ENTITLEMENTS and APPEX_ENTITLEMENTS variables
- Default to dev entitlements instead of none
- Remove stale domain (same ID) before addDomain to handle re-mounts
- Include connection_id in display name so multiple connections to the
  same host get unique CloudStorage folder paths
- Add APP_PROFILE / APPEX_PROFILE support to build script for
  embedding provisioning profiles
- Remove fileprovider.testing-mode from appex entitlements (requires
  separate Apple approval; not needed with proper provisioning)
…support

Replace metadata-file-based domain enumeration with macOS
getDomainsWithCompletionHandler and removeAllDomainsWithCompletionHandler
APIs so orphaned domains (whose metadata was lost) are properly cleaned up.
Remove symlink mount_point support — macOS manages the CloudStorage folder
path, so FileProvider mounts now reject a mount_point argument.
…nection info using json, and add log_level as option for macos provider via connection info with trace logging for implementation
Replace the broken ThreadedRemoteFs with a new Runtime struct that
bridges sync backend callbacks (FUSE, FileProvider) to async RemoteFs
operations via tokio Handle + OnceCell + watch channel.

- Runtime::new() for lazy init (.appex bootstrap path)
- Runtime::with_fs() for eager init (normal mount path)
- Runtime::spawn() dispatches async work, waits for init

Update FUSE backend for fuser 0.17 API (INodeNo, Errno, FileHandle,
Generation, etc.) and dispatch all callbacks through Runtime::spawn().

Update FileProvider backend with UnsafeSendable<T> wrapper for !Send
Apple types, async dispatch via Runtime::spawn(), and proper module
imports across provider/enumerator/item hierarchy.

Fix binary crate integration: missing .await on mount(), refactor
macos_file_provider.rs into macos_appex.rs, fix PossibleValuesParser.
…uff appropriately; fix warnings about dead code
The FileProvider framework validates items at runtime and aborts if
itemVersion is missing. Add itemVersion (mtime-based), capabilities
(read/write/delete/rename/reparent), and enumerator sync anchor stubs
to suppress the degraded-performance warnings. Also add logs-appex.sh
script for quick appex log inspection.
- Handle NSFileProviderWorkingSetContainerItemIdentifier (primary
  blocker for "Loading..." in Finder) and trash container by returning
  empty results immediately
- Fix root container handling in itemForIdentifier: use framework
  constant as item identifier with "/" filename (empty filename
  caused CRIT: missing filename crash)
- Fix parent identifiers everywhere: use root constant when parent
  ino=1 instead of hardcoded "1" (enumerator, fetchContents,
  modifyItem, createItem)
- Add resolve_parent_identifier helper for consistent parent ID
  resolution across all handlers
- Fix createItem to read file content from URL and write to remote
  (was creating empty files), use conformsToType for dir detection
- Signal bootstrap failures to Finder via NSFileProviderErrorCode::
  ServerUnreachable instead of returning empty results
- Add make_fp_error for proper NSFileProviderErrorDomain errors
- Add "Distant — " prefix to domain display names in Finder sidebar
- Clean up /tmp/distant_fp_* temp files on unmount
- Fix logs-appex.sh to check App Group container path
- Add channel resolver outcome logging in appex entry point
- Complete structured logging for enumerateChanges
Lists registered FileProvider domains with their metadata: domain
identifier, display name, metadata file presence, and destination.
Supports --format shell (table) and --format json output.

Exposes list_file_provider_domains() and DomainInfo from distant-mount
for the binary crate to use.
Skip remove_domain_blocking when domain doesn't exist — calling
removeDomain on a non-existent domain causes fileproviderd to unregister
the extension entirely, preventing the appex from launching. Now checks
get_all_domains() first and only removes if the domain ID is found.

Result: appex now launches and bootstraps successfully. 20/37 FP tests
pass (up from 11). Remaining 17 failures are "No such file or directory"
(enumeration timing) rather than "Operation timed out" (connectivity).
- signal_enumerator_for_domain: async function calls
  NSFileProviderManager.signalEnumerator after bootstrap + cache warm,
  telling macOS to enumerate now that the appex is connected
- wait_for_fp_mount_ready: polls read_dir until mount is accessible
  (restores implicit wait removed with discover_cloud_storage_entry)
- enumerateChanges returns SyncAnchorExpired to force fresh enumeration
  on every access (remote FS has no change tracking)

20/37 FP tests pass. Remaining 17 fail because tests seed data after
mount and macOS caching prevents immediate visibility — requires the
disconnect/reconnect architecture (TODO #9) for proper fix.
Production:
- Signal enumerator targets working set (was root — only working set works)
- Working set enumerator returns root items (was empty)
- enumerateChanges returns syncAnchorExpired to force fresh enumeration
- Configurable poll_interval via MountConfig.extra (default 5s)
- FP mount spawns background task that signals working set periodically
- Mount CLI gets --extra key=value flag for backend-specific config

Test infrastructure:
- FP mounts use poll_interval=0.05 (50ms) for fast refresh
- wait_for_fp_path helper polls local path until visible (10s timeout)

21/37 FP tests pass. Remaining 16 need wait_for_fp_path calls in tests.
Added mount::wait_for_path(mount_backend, path) — polls local path for
FP mounts (no-op for NFS/FUSE/WCF). FileProvider refreshes directory
listings asynchronously via the manager's 50ms poll_interval in tests.
Tests now wait for the FP refresh before reading through the mount.

37/37 FP tests pass. Combined with NFS/FUSE/Docker: 228/228 total.
…plan

Move docs/mount-tests-PRD.md → PRD.md and docs/mount-tests-progress.md
→ PROGRESS.md so they live alongside README.md as canonical project
references. Correct the stale "9 FP failures remain" status — the FP
suite has been at 37/37 since 86d794d, total at 228/228.

Embed the full Network Resilience + Mount Health plan into PRD.md as a
new "Plan: Network Resilience + Mount Health" section so the plan
survives context compaction. PROGRESS.md gains an "Active plan"
pointer at the top plus a Phase 0 checklist (0a–0j) tracking the
incorporation of PR #288 and a Phase 1–6 checklist for the mount
health work that follows.

Update .claude/commands/mount-test-loop.md to point at the new
top-level paths.
…epalive

Move socket2 from unix-only to cross-platform dependencies and configure
SO_KEEPALIVE (15s idle, 5s probe interval) on every TCP stream owned by
distant: client connect, transport reconnect, and listener accept.

Cherry-picked from #288 (commit 61e48c0). Address
review comment 2933801998 ("Does this one function need to be pulled
up to be available?") by exposing keepalive through the public
TcpTransport surface instead of a `pub(crate) use` re-export of a free
helper:

- TcpTransport gains a `set_keepalive(&self) -> io::Result<()>`
  method that does the SO_KEEPALIVE configuration via socket2::SockRef.
- TcpTransport gains `from_accepted(stream, peer_addr) -> Self` which
  the listener uses to wrap accepted streams; it sets keepalive
  internally.
- TcpListener::accept stops reaching into a private helper and just
  delegates to TcpTransport::from_accepted.
- TcpTransport::connect / Reconnectable::reconnect call the new
  set_keepalive method internally so callers get keepalive for free.

Keepalive failures log a warning but do not fail the connection.
Add max_heartbeat_failures (default 3) to ServerConfig. The connection
loop now counts consecutive non-WouldBlock heartbeat write errors and
terminates the connection when the threshold is reached. The counter
resets on any successful write (heartbeat or response). Setting the
value to 0 disables the feature. Backward-compatible via serde default.

Cherry-picked from #288 (commit fa40953).
start_file_provider and get_or_start_file_provider in singleton.rs call
crate::mount::install_test_app(), but the mount module is gated behind
the mount feature. Building distant-test-harness without the mount
feature (which any subset --all-features test of distant-core's plugin
deps does) errored out with "could not find mount in the crate root".

Add `feature = "mount"` to the existing `target_os = "macos"` cfg on
both functions so they only compile when the dependency is actually
available. All callers (in mount.rs) are already gated on the same
feature.
Extend the Plugin trait with default reconnect() (returns Unsupported)
and reconnect_strategy() (returns Fail). All three plugins override
both: Host, SSH, and Docker delegate reconnect to connect and return
ExponentialBackoff with backend-appropriate parameters:

- Host: 3 retries / 2s base / 30s max / 60s timeout
- SSH:  5 retries / 2s base / 30s max / 30s timeout
- Docker: 10 retries / 1s base / 60s max / 30s timeout (slow daemon
  restarts deserve more patience)

Cherry-picked from #288 (commit 3660e62). Strip
the new separator-style test section comments per review comments
2915971580 / 2933755107 / 2933823312 (CLAUDE.md anti-pattern #11) and
rename per-crate test functions to disambiguate them across crates
(docker_reconnect_strategy_returns_*, ssh_reconnect_strategy_returns_*).
…ination

Add ShutdownSender type and ServerRef::shutdown_sender() for lightweight
shutdown signaling. The SSH backend spawns a health monitor that polls
api.is_session_closed() every 2s. The Docker backend polls daemon ping
and container state every 5s. When the backend dies, the health
monitor triggers server shutdown, dropping the in-memory transport so
the manager can detect the disconnection.

Add ApiServerHandler::from_arc(Arc<T>) so the in-memory server can
share its API instance with a health-monitor task that needs to query
backend liveness.

ChannelPool::is_closed and SshApi::is_session_closed are async on
file-mount because the russh handle lives behind a tokio Mutex (added
for tcpip_forward, see russh#658). The poll loop awaits both.

Cherry-picked from #288 (commit 993ed8d). Resolved
conflicts in distant-ssh/src/lib.rs (file-mount's tunnel state +
SshApi 5-arg constructor needed to coexist with PR #288's
Arc<SshApi> wrapping and ssh_health_monitor). Also added the
`options` field to two test cases in distant-core/src/api.rs that the
file-mount branch added since the cherry-pick base.
ManagerConnection::spawn() now clones the UntypedClient's
ConnectionWatcher and optionally spawns a monitor task that sends the
connection ID through a death notification channel when the connection
transitions to Disconnected. ManagerServer wires this up for all
connections created via connect().

For now the death-handling task in ManagerServer::new just logs the
disconnect — full reconnection orchestration arrives in step 0h
(adapted from PR #288 commit aa035a8). Step 0f only plumbs the
watcher through.

Cherry-picked from #288 (commit 594c3ca). Resolved
conflicts in distant-core/src/net/manager/server.rs to keep
file-mount's mount/tunnel struct fields and ManagerServer::new
constructor alongside the new death_tx field.
…ations

Adds the manager-side push protocol that PR #288 began but reshapes
it per the review thread on #288 (comments
2933812110, 2933814790, 2933821911, 2933826601). Instead of bespoke
SubscribeConnectionEvents / SubscribedConnectionEvents /
ConnectionStateChanged / Reconnect / ReconnectInitiated variants,
the protocol now exposes a generic three-piece API:

  ManagerRequest::Subscribe { topics: Vec<EventTopic> }
  ManagerRequest::Unsubscribe
  ManagerRequest::Reconnect { id: ConnectionId }

  ManagerResponse::Subscribed
  ManagerResponse::Unsubscribed
  ManagerResponse::Event { event: Event }
  ManagerResponse::ReconnectInitiated { id: ConnectionId }

A new `distant-core/src/net/manager/data/event.rs` module defines:

- `EventTopic { All, Connection, Mount }` — subscribers filter on
  topics; `All` matches every variant present and future. `Mount` is
  reserved (no producers yet — Phase 1 of the mount-health work
  ships `Event::MountState` together with the typed `MountStatus`
  enum).
- `Event { ConnectionState { id, state } }` — a tagged event enum.
  Future variants (mount, tunnel, server status) plug in here
  without protocol additions.
- `Event::topic(&self) -> EventTopic` — used by the dispatcher in
  step 0h to filter pushed events for clients that subscribed to
  specific topics.

Wire shape (JSON):
  {"type":"subscribe","topics":["connection","mount"]}
  {"type":"event","event":{"type":"connection_state","id":7,"state":"reconnecting"}}

To make this protocol layer fully functional in step 0h:

- ConnectionState gains `Serialize`/`Deserialize` (snake_case) so
  it can ride the wire as part of `Event::ConnectionState`.
- ReconnectStrategy::initial_sleep_duration and adjust_sleep are
  promoted from private to `pub` so the orchestration in 0h can
  drive its own retry loop.

Stub handlers in `ManagerServer` return `Error` responses for the
new request variants; step 0h replaces them with the real
broadcast::channel + handle_reconnection wiring.
Add handle_reconnection() to orchestrate plugin reconnection when a
connection dies. The death loop in ManagerServer::new now drives this
function instead of just logging the disconnect:

1. Read the connection's destination + options under a brief read lock.
2. Look up the plugin by scheme.
3. If reconnect_strategy() is Fail, broadcast Disconnected and stop.
4. Honor the `no_reconnect` option from the CLI flag (added in 0i).
5. Broadcast Reconnecting and enter the retry loop, sleeping per the
   plugin's strategy and timing each attempt against strategy.timeout().
6. On success, hot-swap the connection via ManagerConnection::replace_client
   and broadcast Connected. On exhaustion, broadcast Disconnected.

Add ManagerConnection::replace_client which aborts old request /
response / monitor tasks, mints a fresh action task with a new
request_tx, and spawns a new connection monitor with the death_tx.
Existing channels are invalidated by design — callers must re-open
them after replacement.

Add NonInteractiveAuthenticator: a no-prompt Authenticator used during
background reconnection. challenge() fails with PermissionDenied
(callers using key-file or ssh-agent auth never invoke it); verify()
auto-accepts host verification because the host was verified on the
original connect.

Wire the protocol stubs from step 0g into the real implementations:

- Subscribe { topics } now spawns a forwarder task that drains the
  broadcast bus, filters events by the requested topics
  (EventTopic::All matches everything), and pushes
  ManagerResponse::Event { event } back through the channel reply.
- Unsubscribe acks immediately. The forwarder task tied to the channel
  exits naturally when the reply stream closes; per-channel teardown
  while keeping the channel open is a future refinement.
- Reconnect { id } verifies the connection exists, then spawns
  handle_reconnection in the background and returns ReconnectInitiated.
  State transitions arrive later as Event::ConnectionState pushes.

Replace the placeholder publish helper from PR #288 with
publish_connection_state, which sends Event::ConnectionState into the
broadcast::Sender<Event> bus. ManagerServer gains an event_tx field;
broadcast::channel<Event> capacity is 16.

Cherry-picked from #288 (commit aa035a8) and
adapted to:

- Use the generic Subscribe/Event protocol from step 0g instead of
  the bespoke SubscribeConnectionEvents/ConnectionStateChanged.
- Coexist with the file-mount branch's mount + tunnel struct fields
  and request handlers.
- Match the existing reply_err helper and the file-mount tunnel +
  mount handler ordering inside the request match.

Imports the new ConnectionState serde tests and ReconnectStrategy
{initial_sleep_duration, adjust_sleep} unit tests from the upstream
commit, with separator-style comments stripped per the review thread.
Cherry-picked from #288 (commit c40c543), adapted
to use the generic Subscribe/Event protocol from step 0g instead of
the bespoke SubscribeConnectionEvents helpers PR #288 originally
shipped.

CLI helper changes:
- src/cli/common/client.rs: replace
  subscribe_and_display_connection_events(client, format) with
  subscribe_and_display_events(client, topics, format) accepting a
  Vec<EventTopic>. Long-running CLI commands (Shell, Api, Spawn,
  Ssh) now subscribe with [Connection, Mount] so a backgrounded
  mount drop surfaces in the same stderr/JSON stream as connection
  drops. JSON shape mirrors the wire format:
  {"type":"event","event":{"type":"connection_state",...}}.
- A new display_event() helper renders each Event variant in both
  Format::Shell and Format::Json, ready to be extended for the
  Event::MountState variant that lands in Phase 1.

ManagerClient API:
- ManagerClient::subscribe(topics) → io::Result<Mailbox<...>>:
  sends Subscribe { topics }, waits for the Subscribed ack, then
  returns the mailbox so callers don't see the ack mixed with
  events.
- ManagerClient::unsubscribe() → io::Result<()>: best-effort hint.
- ManagerClient::reconnect(id) → io::Result<()>: sends
  Reconnect { id }, waits for ReconnectInitiated. The actual state
  transitions arrive later as Event::ConnectionState pushes on any
  open subscription.

CLI command:
- distant client reconnect <id> uses ManagerClient::reconnect.
  Format::Json prints {"type":"reconnect_initiated","id":<id>};
  Format::Shell prints a Ui::success line.
Cherry-picked from #288 (commit a12a240).

Add --no-reconnect to Connect, Launch, and Ssh client subcommands
to disable automatic reconnection on connection loss. The flag is
plumbed through the options Map (`no_reconnect=true`) into the
manager's reconnection orchestration, where handle_reconnection
checks for it before doing any work and broadcasts Disconnected
straight away.

Add --heartbeat-interval and --max-heartbeat-failures to the
server Listen subcommand for configuring the heartbeat counter
introduced in step 0c.

Renamed notify_state_change → publish_connection_state at one
follow-on call site that came in with this commit's no_reconnect
check (the rest were renamed in step 0h).
Replace MountInfo.status: String with a typed MountStatus enum:

  pub enum MountStatus {
      Active,
      Reconnecting,
      Disconnected,
      Failed { reason: String },
  }

The state machine is documented inline. `Failed` is terminal — the
only exit is to unmount and remount. The vocabulary is deliberately
distinct from net::client::ConnectionState (Connected/Reconnecting/
Disconnected, no Failed) so the user can tell at a glance which
subsystem they're looking at and so mount-side terminal failures are
distinguishable from transient connection drops.

`#[serde(tag = "state", rename_all = "snake_case")]` keeps the wire
shape stable across the inner and outer state representations:

  {"state":"active"}
  {"state":"reconnecting"}
  {"state":"disconnected"}
  {"state":"failed","reason":"fuse session ended"}

Add Event::MountState { id, state: MountStatus } to the generic
event bus, with Event::topic() returning EventTopic::Mount. The
producer wires up in Phase 3 (per-mount monitor task) — Phase 1
just establishes the wire shape and the CLI rendering so the rest
of the work can plug in cleanly.

CLI changes:
- src/cli/commands/client.rs gains a format_mount_status helper
  for the shell rendering of `distant status --show mount`.
  Active/Reconnecting/Disconnected render as their lowercase
  variant names; Failed renders as `failed: <reason>` so the
  failure cause is visible in the same row as the mount.
- src/cli/common/client.rs::display_event learns the
  Event::MountState variant. Shell prints
  `[distant] mount N: failed (<reason>)`; JSON nests the
  serialized MountStatus inside the {"type":"event","event":...}
  envelope.

Round-trip tests cover both the new MountStatus serde shape and
the Event::MountState topic mapping.
Add a `probe(&self) -> MountProbe` default method to the MountHandle
trait so the manager's per-mount monitor task (Phase 3) can poll
each backend for liveness without coupling to backend-specific
internals.

  pub enum MountProbe {
      Healthy,
      Degraded(String),
      Failed(String),
  }

The default impl returns `Healthy` so existing backends continue to
work without changes — Phase 4 wires up the real per-backend probes
(NFS server task alive, FUSE BackgroundSession alive, FP domain
registered + appex bootstrap, WCF watcher thread alive).

`probe` is `&self` (no `&mut`) so it can be called concurrently
with `unmount` and other operations: the monitor task locks an
Arc<Mutex<Option<Box<dyn MountHandle>>>> read-only.

Re-exported from distant_core::plugin alongside MountHandle and
MountPlugin.
Wire the generic event bus from Step 0 into mount lifecycle. Each
managed mount now has a dedicated monitor task that polls the
backend's MountHandle::probe (Phase 2) and reacts to
Event::ConnectionState pushes for its underlying connection,
publishing Event::MountState transitions through the broadcast bus.

ManagedMount restructure:
- info: Arc<RwLock<MountInfo>> so the monitor can update status
  without blocking the outer self.mounts write lock.
- handle: Arc<Mutex<Option<Box<dyn MountHandle>>>> so the monitor
  can call probe(&self) under a brief read lock while the unmount
  path retains exclusive access via .lock().await.take().
- monitor: tokio::task::JoinHandle<()> aborted on unmount and on
  connection kill.

monitor_mount task body (top of server.rs after publish helpers):
- 5s tokio::time::interval (configurable via
  Config::mount_health_interval — defaults to DEFAULT_MOUNT_HEALTH_INTERVAL).
- tokio::select! between the ticker (calls probe) and an
  event_rx.subscribe() receiver that drains the broadcast bus
  filtered for Event::ConnectionState matching the monitor's
  connection_id.
- Two pure helpers map probe → status and connection state →
  status without holding any locks: probe_to_status,
  connection_state_to_mount_status.
- Failed status is terminal — the monitor logs and exits without
  publishing further events.
- Lagged broadcast receiver warnings are surfaced; closed bus
  causes the monitor to exit cleanly.

publish_mount_state helper sits next to publish_connection_state.

Mount handler now wraps info/handle in the new types and spawns
the monitor before inserting the ManagedMount into self.mounts.

Unmount handler aborts the monitor first, then takes the handle
out of the Mutex via .lock().await.take() before calling
handle.unmount(). If the monitor or another caller already took
the handle, it logs and continues.

List handler now snapshots Arc<RwLock<MountInfo>> values under the
outer read lock, then locks each individual info to clone it,
avoiding holding the outer lock across .await.

Latent kill-leak fix: ManagerServer::kill(id) now tears down every
mount whose connection_id matches. Previously, killing an
SSH/Host/Docker connection that had mounts on it would orphan the
mounts in the map with stale Active status — the kill code
followed the tunnel-cleanup pattern but missed mounts entirely.

Config gains mount_health_interval: Duration (default 5s) so the
monitor poll interval is tunable per-manager.

The cfg(test) test_config helper now delegates to Config::default
so future Config additions don't break tests.

All 2265 distant-core lib tests pass; mount integration smoke
test (status_should_show_active_mount on host_nfs) confirms
nothing regressed.
Wire the per-mount monitor task (Phase 3) up to real backend
liveness signals.

Add `is_alive()` to the concrete `core::MountHandle` (pub(crate)
on the wrapper side, used only by the trait impl). It returns
`true` while the outer background task has not yet completed:
catches panics and premature exits, but not finer-grained inner
failures (FUSE thread death, NFS server task panic without the
outer wrapper noticing). The doc-comment says so explicitly so
future contributors don't expect more from it.

Wire `MountHandleWrapper::probe` in distant-mount/src/plugin.rs:

- For all backends: returns `Failed("mount task ended")` if the
  outer task has ended.

- For FileProvider: additionally calls
  `list_file_provider_domains()` and returns
  `Failed("FileProvider domain ... no longer registered")` if the
  domain has disappeared from the OS-side list (e.g. user toggled
  the File Providers setting). If the listing call itself errors,
  returns `Degraded(...)` rather than failing the mount — the
  OS API may be temporarily unavailable.

Granular per-backend signals (lifting NFS `server_task` into an
`AtomicBool`, watching `BackgroundSession.guard.is_finished()`,
checking the WCF watcher thread, etc.) are deferred. The current
coverage gives the monitor enough to react to wholesale mount
death; finer-grained probes can be added incrementally without
changing the monitor or the event bus shape.
Unit test layer for the mount health subsystem (distant-core):

- protocol::mount::mount_status_tests covers MountStatus serde:
  default is Active, every variant round-trips through JSON,
  Failed { reason } requires the reason field, and bogus state
  values fail to parse rather than silently default.

- net::manager::server::tests grows coverage for the per-mount
  monitor's pure helper functions:
  - probe_to_status (8 cases): Healthy is no-op when already
    Active, restores Active from Reconnecting/Disconnected, won't
    revive Failed; Degraded never changes state; Failed
    transitions Active to Failed with reason but doesn't
    re-trigger on already-Failed.
  - connection_state_to_mount_status (6 cases): Connected
    restores Reconnecting/Disconnected to Active and is no-op
    when already Active; Reconnecting only transitions Active;
    Disconnected transitions Active and Reconnecting but not
    Failed.
  - publish_mount_state happy path through the broadcast bus.

- Three end-to-end monitor_mount tests with a scripted test-double
  MountHandle (`ScriptedMountHandle`) that pops MountProbe values
  off a shared queue:
  - monitor_mount_publishes_failed_event_when_probe_returns_failed:
    a single Failed probe causes both an Event::MountState
    publish and an info status update to Failed.
  - monitor_mount_reacts_to_connection_state_event: publishing
    Event::ConnectionState::Reconnecting on the bus causes the
    monitor to transition the mount to Reconnecting and publish a
    matching Event::MountState.
  - monitor_mount_ignores_connection_state_for_other_connection:
    a ConnectionState event for a DIFFERENT connection_id leaves
    the mount untouched.

CLI integration test (HLT-05):

- tests/cli/mount/health.rs adds
  kill_should_remove_mounts_owned_by_connection. The test starts
  an isolated host manager, finds the connection_id via
  `distant status --show connection --format json`, mounts NFS on
  it, kills the connection via `distant kill <id>`, then polls
  `distant status --show mount` for up to 10s and asserts the
  mount is gone. Without the kill-leak fix in Phase 3 the
  manager's self.mounts map would still contain the mount with
  stale Active status — this is a regression test for that bug.

Total: 2291 distant-core lib tests pass; HLT-05 passes against a
fresh isolated manager. EVT-* and HLT-01..04 (which require
killing sshd / connection drops in the singleton harness) are
deferred to a follow-up.
Mark all 0a–0j and Phases 1–5 boxes as complete in PROGRESS.md
with the commit hashes for each. Phase 6 (this commit) is the
docs roll-up itself.

Update PRD.md status section to reflect 228/228 mount tests +
2291 distant-core lib tests passing, with the highlight bullet
list of what landed (generic Subscribe/Event protocol,
MountStatus enum, per-mount monitor, kill-leak fix, network
resilience stack from PR #288, CLI flags).

Document the deferred items (HLT-01..04, EVT-01..02, granular
per-backend probes, process audit, Windows VM testing) so the
next session has a clear list to work from.

docs/CHANGELOG.md gains an Unreleased section listing every
user-facing addition and the two breaking changes
(`MountInfo.status` enum, `kill(id)` cleans mounts).
Capture the friction observed during the Network Resilience +
Mount Health rollout (Phases 0–6, commits eb0747b6de03b1) and
turn each incident into a concrete next-slice phase.

PRD.md gains two new sections:

1. "Lessons from Phase 0–6 implementation (2026-04-07)" — the
   post-mortem inventory. Each subsection documents one
   incident, its root cause, and the phase that addresses it:
   - Stale singleton state was the #1 friction source (the
     wire-format change in Phase 1 caused silent "No mounts
     found" failures across every FP test until I manually
     pkill'd the singleton)
   - "No mounts found" panic messages were uninformative
   - Test harness compilation was fragile under feature subsets
   - Cherry-pick conflict resolution was lossy (separator
     comments, missed renames, wire-format field updates)
   - Tests didn't catch the orphan-mount latent bug
   - Background tasks vs foreground tasks vs timeouts
   - Build cycle was 10–30s of latency between commits
   - Test author boilerplate was too high (HLT-05 had two
     CLI subcommand typos on first attempt)
   - Flakes are masked by retries

2. "Plan: Test Quality & Stability" — Phases E–K with goals,
   agent usage, per-phase deliverables, and acceptance criteria.
   Phases are ordered by dependency:
   - Phase E: state hygiene (cleanup script, build-hash
     validation in singleton meta files, FP domain bulk reset)
   - Phase F: diagnostics (assert_mount_status! macro, singleton
     diagnostic dump, inline log tail in panic hook)
   - Phase G: test isolation (Owned-singleton scope, PID-locked
     sentinels, RAII tempdirs)
   - Phase H: coverage (wire-format fixtures, HLT-01..04 +
     EVT-01..02, cross-version compatibility, soak tests,
     per-backend probes, proptest round-trips)
   - Phase I: simplification (typed DistantCmd builder,
     fixture set, mock handles in test-harness, dev-fast
     profile + linker docs)
   - Phase J: CI (nextest profile tweaks, preflight script,
     test result triage)
   - Phase K: documentation (TESTING.md additions, CLAUDE.md
     test author checklist)

Each phase ties back to a specific session incident, draws from
industry practices (gitoxide pack format snapshots, tokio
proptest codec testing, nextest retry policies), and never
removes existing coverage.

PROGRESS.md gains the corresponding checklist under
"Phases E–K — Test Quality & Stability (next slice)" with one
checkbox per sub-phase, cross-referenced to the PRD section.

Phase 6 remains the final commit of the previous slice; this
docs update is the bridge into the next slice.
After 30+ minutes of dedicated research into nextest internals,
ctor/dtor crates, process supervision patterns, and the actual
harness code (3 parallel agents producing 3000+ lines of
research notes archived under ~/.claude/plans/), the conclusion
is that the singleton-for-everything model is the bug.

**The earlier draft of Phases E–K is REPLACED with a smaller,
sharper plan** built around two architectural changes:

1. **Per-test ephemeral fixtures for the 80% case.** 37 of 39
   mount tests don't need a singleton — they only got one
   because spawning a fresh manager+server per test was slow.
   With command-group + pdeathsig (Linux) + kqueue NOTE_EXIT
   (macOS) + the existing tempfile RAII, per-test cost is
   ~100ms and SIGKILL handling is automatic.

2. **A tiny Ryuk-style sidecar reaper for the FP appex** (the
   one true singleton — macOS allows one File Provider
   extension instance per bundle ID per machine). Connection
   lease lifecycle, schema-hash in socket path, self-heals
   stale state on startup.

3. **Schema-hash baked into all singleton paths** so binaries
   from different wire formats automatically use different
   paths. Wire-format mismatch becomes structurally impossible
   — no silent "No mounts found" failures, no manual cleanup.

**PRD.md gains:**

- "Test Architecture Today (2026-04-07)" section with 6 ASCII
  diagrams: how a mount test runs today, where it breaks (10
  failure modes inventoried), the orphaned-process tree on a
  typical run, the test inventory by mount-source pattern, the
  proposed architecture, and what changes per test category.
- "Plan: Test Quality & Stability (revised 2026-04-07)" that
  REPLACES the earlier draft. New phases:
  - E (50 LOC, 1 day) — `#[serde(other)]` fallback variants on
    every wire enum + compile-time WIRE_SCHEMA_HASH constant
  - F (30 LOC, 1 day) — schema-hash baked into singleton paths
  - G (400 LOC, 2 days) — distant-test-reaper sidecar binary +
    FpFixtureLease test-side struct
  - H (600 LOC, 3 days) — MountedHost/MountedSsh/MountedDocker
    fixtures using command-group + --watch-parent flag on
    distant manager/server (Linux pdeathsig, macOS kqueue)
  - I (500 LOC, 2 days) — DistantCmd builder, assert_mount_status!
    macro, mock MountHandle in test-harness, dev-fast profile,
    panic-hook log dump
  - J (800 LOC, 4 days) — wire-format frozen fixtures, HLT-01..04
    + EVT-01..02, soak tests, per-backend probe tests, proptest
    round-trips
  - K (150 LOC, 1 day) — tighter nextest profile, optional test
    report
  - L (200 LOC, 1 day) — TESTING.md + CLAUDE.md updates
- Explicit "What we DROP from the previous draft" reconciliation
  table explaining why each old item is replaced or obviated.
- Validation checklist: 5 end-to-end scenarios that must pass
  without manual intervention before the refactor is complete
  (idempotent reruns, SIGKILL recovery, wire-format mismatch
  isolation, cargo test parity, SIGINT cleanup).

**PROGRESS.md gains** the corresponding revised checklist
under "Phases E–L", including an explicit "DROPPED from the
previous draft" section listing the eight items that don't
carry forward (cleanup scripts, build-hash sentinels,
owned-singleton opt-in, PID-locked sentinels, cross-version
compat tests, FP domain bulk reset, MountTempDir panic-hook,
preflight script).

Total estimated LOC for the revised plan: ~2730 (delta against
current harness ~+2000 net, since the per-test fixtures replace
~700 LOC of singleton machinery).
Msg<T> wraps every request/response on the wire. It was previously
derived with #[serde(untagged)], which meant any failure inside
T::deserialize (unknown variant, unknown field, wrong type) got
collapsed to the generic "data did not match any variant of untagged
enum Msg" error, hiding the real cause.

Replace the derived Deserialize with a hand-written impl that
dispatches via deserialize_any + Visitor: visit_seq -> Msg::Batch,
visit_map -> Msg::Single. When T::deserialize fails, the real inner
error propagates unchanged.

Narrows Msg<T> to map/seq payloads (the only shapes used in
production: Msg<Request> and Msg<Response> are internally-tagged
struct enums that always serialize as maps). Two existing tests that
round-tripped Msg<String> scalar payloads are updated to use a struct
fixture.

New failure_paths submodule adds 8 regression tests covering the
real production type (Msg<protocol::Request>), deny_unknown_fields
interaction, batch-element failures, and round-trip preservation
across JSON and MessagePack.

The on-wire bytes are unchanged - Serialize is still derived with
#[serde(untagged)].
deserialize_from_slice previously returned just
"Deserialize failed: <raw rmp_serde error>" with no indication of
which type was being decoded or how large the payload was. When the
underlying error was an untagged-enum collapse or a terse serde
message, there was no way to locate the failing call site from logs.

Enrich the io::Error to include std::any::type_name::<T>() and the
slice length, producing messages like:
  "Failed to deserialize <fully::qualified::Type> from 1234 bytes: <e>"

Every caller that forwards the io::Error upward (UntypedRequest /
UntypedResponse decode paths, FramedTransport::read_frame_as, the
authentication macro, all packet::*::from_slice helpers) automatically
inherits the enriched context. Combined with sub-phase 1's custom
Msg<T> Deserialize, a failing decode now tells you both which type
and which specific variant failed.

Also adds a doc comment with an Errors section on the helper, closing
a pre-existing docs gap.
The server receive loop used to log decode failures with
String::from_utf8_lossy over the raw MessagePack bytes, gated behind
log_enabled!(Debug) so it only appeared with --log-level debug. At
info level, the failure surfaced as just "Invalid request: <terse
serde error>" with no way to correlate with the raw payload.

Add a hex_preview helper in net/common/utils that renders the first
64 bytes of a slice as lowercase hex via hex::encode, appending "..."
when truncated. Safe for binary data, no lossy UTF-8.

Rewrite both decode-error arms of the server receive loop to:
  - always fire at error! level (no log_enabled! gate)
  - include the byte length of the payload
  - include a hex preview via utils::hex_preview
  - drop the lossy String::from_utf8_lossy dump

The remaining log_enabled!(Debug) gate on the happy-path "New request"
log and the log_enabled!(Trace) gate on the heartbeat loop are
untouched - they are happy-path diagnostics, not errors.
Mirror the server-side rewrite from the previous commit for the client
receive path in map_to_typed_mailbox. The old code gated the raw-
payload dump behind log_enabled!(Trace) and used lossy UTF-8 over
binary MessagePack; the "always-on" error! line carried only the
target type name and the terse serde error.

Replace with a single error! call that always fires and includes
target type, byte length, a hex preview of the payload via
utils::hex_preview, and the real inner deserialize error. After this
commit, both ends of the wire produce the same information-rich
decode-error format at info log level.
Records the four-commit slice that rescoped Phase E+F into a targeted
fix for buried deserialize errors in the wire protocol. Documents the
Msg<T> custom Deserialize, enriched deserialize_from_slice, hex_preview
helper, and rewritten server/client receive logging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support user-level file system mounting

1 participant