Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hddtemp_smartctl: configure warning and critical temps per device #1560

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 32 additions & 12 deletions plugins/node.d/hddtemp_smartctl
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,20 @@ the harddrive devices.

The following environment variables are used

smartctl - path to smartctl executable
drives - List drives to monitor. E.g. "env.drives hda hdc".
type_$dev - device type for one drive, e.g. "env.type_sda 3ware,0"
or more typically "env.type_sda ata" if sda is a SATA disk.
args_$dev - additional arguments to smartctl for one drive,
e.g. "env.args_hda -v 194,10xCelsius". Use this to make
the plugin use the --all or -a option if your disk will
not return its temperature when only the -A option is
used.
dev_$dev - monitoring device for one drive, e.g. twe0
smartctl - path to smartctl executable
drives - List drives to monitor. E.g. "env.drives hda hdc".
type_$dev - device type for one drive, e.g. "env.type_sda 3ware,0"
or more typically "env.type_sda ata" if sda is a SATA disk.
args_$dev - additional arguments to smartctl for one drive,
e.g. "env.args_hda -v 194,10xCelsius". Use this to make
the plugin use the --all or -a option if your disk will
not return its temperature when only the -A option is
used.
dev_$dev - monitoring device for one drive, e.g. twe0
$dev.warning - set warning temperature for $dev, default 57 (°C)
e.g. "env.nvme0n1.warning 70"
$dev.critical - set critical temperature for $dev, default 60 (°C),
e.g. "env.nvme0n1.critical 80"

If the "smartctl" environment variable is not set the plugin will
search your $PATH, /usr/bin, /usr/sbin, /usr/local/bin and
Expand All @@ -46,6 +50,9 @@ All rights reserved.
2016-08-27, Gabriele Pohl ([email protected])
Fix for github issue #690

2023-07-23, Andreas Perhab, WT-IO-IT GmbH
enable configuring warning and critical temps

=head1 LICENSE

Redistribution and use in source and binary forms, with or without
Expand Down Expand Up @@ -94,6 +101,11 @@ parameter. If this parameter isn't supported by your version of
smartctl then hdparm will be used. Note that hdparm isn't available
on all platforms.

For nvme disks you can get the warning and critical temperatures with the following command, that should report wctemp
and cctemp:

sudo nvme id-ctrl -H /dev/nvme0

=cut

use File::Spec::Functions qw(splitdir);
Expand Down Expand Up @@ -227,8 +239,16 @@ if (defined $ARGV[0]) {
my @dirs = splitdir($_);
print $d . ".label " . $dirs[-1] . "\n";
print $d . ".max 100\n";
print $d . ".warning 57\n";
print $d . ".critical 60\n";
my $warning = "57";
if (defined($ENV{$d . ".warning"})) {
$warning = $ENV{$d . ".warning"};
}
print $d . ".warning $warning\n";
my $critical = "60";
if (defined($ENV{$d . ".critical"})) {
$critical = $ENV{$d . ".critical"};
}
print $d . ".critical $critical\n";
my $id = get_drive_id($_, device_for_drive($_), $use_nocheck);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks. This is good.

Could you please do my $cfn = clean_fieldname($_); at your line 237 and use that in the place of clean_fieldname($_) in the rest of the patch?

I'm not sure that we should introduce default temperature warning/critical levels here. The temperatures you chose are sort of sane but to narrow compared to the operating temperature of some of my disks. The first of my disks I checked has a "operating" envelope from 0 to 65 and non-operating from -40 - 70. Don't know if other disks are less or more temperature tolerant.

Having a env.warning and env.critical to use as default is entirely sane.

Munin::Plugin has a API to support this: print_thresholds, but there is no need to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the PR later today with the suggestion of $cfn. I will also add the documentation how to get the warning and critical temperatures with the nvme-cli (nvme).

The warning 57 and critical 60 were not chosen by me but are already present in munin. I just kept them for backwards compatiblity.
We mainly use this to fix the warning and critical values for SSDs (which can be checked with sudo nvme id-ctrl -H /dev/nvme0) as the often go past the 60°C munin has now for critical temperature (which i guess was chosen for spinning disks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use the `$d variable already present in the current munin master (i missed that because i started the patch from the munin version in debian)

print $d . ".info $id\n";
}
Expand Down