Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

Open
kneutron opened this issue Nov 10, 2023 · 1 comment
Open

ZFS 2.1.6 on Monterey 100% CPU usage when eSATA disks drop out #124

kneutron opened this issue Nov 10, 2023 · 1 comment

Comments

@kneutron
Copy link

System information

Type Version/Name
Distribution Name MacOS Monterey
Distribution Version 12.7.1
MacOS Kernel 21.6.0
Architecture AMD64 / Mac pro ("trashcan") late 2013, 3.5GHz 6-core Intel Xeon E5 with 32GB RAM
ZFS Version zfs-macOS-2.1.6-1 / zfs-kmod-2.1.6-1
SPL Version n/a

Describe the problem you're observing

PROBOX 4-bay enclosure (several years older version) connected via eSATA, via expansion dock, via Thunderbolt 2 cable
with 3 disks, 2x4TB Seagate NAS and 1x 6TB Toshiba NAS, all 3.5 inch spinners

Disks dropped out during a large chown operation on a dataset due to a bad eSATA cable; all CPU cores went to 100% and there were no timestamps in the zed log, although there were entries; and one documented the disks dropping out:

[ Excerpt from zed log ]
Invoking "all-syslog.sh" eid=29 pid=768
Invoking "snapshot_mount.sh" eid=29 pid=769
Diagnosis Engine: discarding 'resource.fs.zfs.snapshot_mount for vdev 0
Finished "all-syslog.sh" eid=29 pid=768 time=0.007513s exit=0
Finished "snapshot_mount.sh" eid=29 pid=769 time=0.022611s exit=127
Invoking "all-syslog.sh" eid=30 pid=777

Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11771030784008253581
Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Retire Agent: matched vdev 11771030784008253581

Finished "all-syslog.sh" eid=30 pid=777 time=0.008458s exit=0
Invoking "all-syslog.sh" eid=31 pid=779
Invoking "data-notify.sh" eid=31 pid=780
Finished "data-notify.sh" eid=31 pid=780 time=0.007834s exit=3
Invoking "all-syslog.sh" eid=32 pid=783
Invoking "data-notify.sh" eid=32 pid=784
Finished "all-syslog.sh" eid=31 pid=779 time=0.014180s exit=0
Finished "data-notify.sh" eid=32 pid=784 time=0.008389s exit=3
Invoking "all-syslog.sh" eid=33 pid=786
Invoking "data-notify.sh" eid=33 pid=788
Finished "all-syslog.sh" eid=32 pid=783 time=0.014474s exit=0
Finished "data-notify.sh" eid=33 pid=788 time=0.007542s exit=3
Missed 1 events

All 3 zpools (all are single-disk, no mirrors or raidz) went into suspended state but zfs/zpool commands still responded. System was able to shutdown OK from commandline without having to do a hard reset and recovered/rebooted nicely with a replacement eSATA cable and the chown completed OK. System ran normally with good cable. Have not run scrub yet.

Describe how to reproduce the problem

I was going to toss the bad eSATA cable, but can hold onto it and ship it for testing if you like. Probox is up to you.

Include any warning/errors/backtraces from the system logs

Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11771030784008253581
Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Retire Agent: matched vdev 11771030784008253581

The zfs zed error log is currently ~15MB, which is why I posted an excerpt. Can provide full log if necessary.

Expected behavior would be to have timestamps in the zed log, document where disk(s) dropped out and any corrective zed daemon action, and for CPU cores not to go redlined / 100%

@lundman
Copy link
Contributor

lundman commented Nov 10, 2023

The best thing would be to run sudo spindump when it has happened, it will show which thread(s) are busy and what they are doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants