ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

kneutron · 2023-11-10T01:23:22Z

System information

Type	Version/Name
Distribution Name	MacOS Monterey
Distribution Version	12.7.1
MacOS Kernel	21.6.0
Architecture	AMD64 / Mac pro ("trashcan") late 2013, 3.5GHz 6-core Intel Xeon E5 with 32GB RAM
ZFS Version	zfs-macOS-2.1.6-1 / zfs-kmod-2.1.6-1
SPL Version	n/a

Describe the problem you're observing

PROBOX 4-bay enclosure (several years older version) connected via eSATA, via expansion dock, via Thunderbolt 2 cable
with 3 disks, 2x4TB Seagate NAS and 1x 6TB Toshiba NAS, all 3.5 inch spinners

Disks dropped out during a large chown operation on a dataset due to a bad eSATA cable; all CPU cores went to 100% and there were no timestamps in the zed log, although there were entries; and one documented the disks dropping out:

[ Excerpt from zed log ]
Invoking "all-syslog.sh" eid=29 pid=768
Invoking "snapshot_mount.sh" eid=29 pid=769
Diagnosis Engine: discarding 'resource.fs.zfs.snapshot_mount for vdev 0
Finished "all-syslog.sh" eid=29 pid=768 time=0.007513s exit=0
Finished "snapshot_mount.sh" eid=29 pid=769 time=0.022611s exit=127
Invoking "all-syslog.sh" eid=30 pid=777

Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11771030784008253581
Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Retire Agent: matched vdev 11771030784008253581

Finished "all-syslog.sh" eid=30 pid=777 time=0.008458s exit=0
Invoking "all-syslog.sh" eid=31 pid=779
Invoking "data-notify.sh" eid=31 pid=780
Finished "data-notify.sh" eid=31 pid=780 time=0.007834s exit=3
Invoking "all-syslog.sh" eid=32 pid=783
Invoking "data-notify.sh" eid=32 pid=784
Finished "all-syslog.sh" eid=31 pid=779 time=0.014180s exit=0
Finished "data-notify.sh" eid=32 pid=784 time=0.008389s exit=3
Invoking "all-syslog.sh" eid=33 pid=786
Invoking "data-notify.sh" eid=33 pid=788
Finished "all-syslog.sh" eid=32 pid=783 time=0.014474s exit=0
Finished "data-notify.sh" eid=33 pid=788 time=0.007542s exit=3
Missed 1 events

All 3 zpools (all are single-disk, no mirrors or raidz) went into suspended state but zfs/zpool commands still responded. System was able to shutdown OK from commandline without having to do a hard reset and recovered/rebooted nicely with a replacement eSATA cable and the chown completed OK. System ran normally with good cable. Have not run scrub yet.

Describe how to reproduce the problem

I was going to toss the bad eSATA cable, but can hold onto it and ship it for testing if you like. Probox is up to you.

Include any warning/errors/backtraces from the system logs

Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11771030784008253581
Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Retire Agent: matched vdev 11771030784008253581

The zfs zed error log is currently ~15MB, which is why I posted an excerpt. Can provide full log if necessary.

Expected behavior would be to have timestamps in the zed log, document where disk(s) dropped out and any corrective zed daemon action, and for CPU cores not to go redlined / 100%

The text was updated successfully, but these errors were encountered:

lundman · 2023-11-10T01:26:25Z

The best thing would be to run sudo spindump when it has happened, it will show which thread(s) are busy and what they are doing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

kneutron commented Nov 10, 2023

lundman commented Nov 10, 2023

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

Comments

kneutron commented Nov 10, 2023

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

lundman commented Nov 10, 2023