-
Notifications
You must be signed in to change notification settings - Fork 44
Support bundles should include log files #7973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bundles should include log files #7973
Conversation
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Using a4x2 I was able to successfully capture a support bundle that includes logs. We may want to consider the use of zstd for the log zips as it appears the macOS and illumos unzip -l bundle.zip
|
Created using spr 1.3.6-beta.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @papertigers, this is great! I've got a few small comments and suggestions, though overall it looks good to me. Thanks for adding the tests, those are excellent.
@@ -1222,8 +1222,8 @@ impl Zfs { | |||
snap_name: &'a str, | |||
properties: &'a [(&'a str, &'a str)], | |||
) -> Result<(), CreateSnapshotError> { | |||
let mut command = std::process::Command::new(ZFS); | |||
let mut cmd = command.arg("snapshot"); | |||
let mut command = std::process::Command::new(PFEXEC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why did we not need pfexec
before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because as far as I can tell zone bundles and the sled diagnostics creates are the only consumers. I ran into the issue because I was running cargo nextest run
and I saw permission denied, where as sled-agent is running as root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this may be a "usable in tests" things -- adding the pfexec is harmless, but without it, we cannot run integration tests
illumos-utils/src/zfs.rs
Outdated
/// Note that if this is called on the root dataset such as | ||
/// `rpool/ROOT/<BE>` it will return "legacy/.zfs/snapshot/<SNAP_NAME>". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe provide a bit more context here, like why this is important or what one does with that information. Is the point that this isn't a full path anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: What is a "root dataset"? This is on an impl Snapshot
, so which snapshots is this true / not true for?
illumos-utils/src/zfs.rs
Outdated
pub fn full_path(&self) -> Result<Utf8PathBuf, GetValueError> { | ||
// TODO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track this in an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed as #8023
method = GET, | ||
path = "/support/logs/zones", | ||
}] | ||
async fn support_logs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about what this does based on the name, given that it differs from the API path. Is this returning the names of the zones that have logs? Maybe add a docstring to help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 23430d7
method = GET, | ||
path = "/support/logs/download/{zone}", | ||
}] | ||
async fn support_logs_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, a docstring would be very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 23430d7
sled-diagnostics/src/logs.rs
Outdated
// 5 is an arbitrary amount of logs to grab, if we find that we are | ||
// missing important information while debugging we should bump this | ||
// value. | ||
for file in archived.iter().rev().take(5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we provide this as an argument to this function? It's probably worthwhile to plumb that all the way to the API as well. It could default to 5, but I'm sure we're going to find situations where we're missing data. It'd be unfortunate to require a redeployment of sled-agent to get at that data.
sled-diagnostics/src/logs.rs
Outdated
snapshot_logfile: &Utf8Path, | ||
) -> Result<(), LogError> { | ||
let Some(log_name) = snapshot_logfile.file_name() else { | ||
debug!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a warning or error condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a warning. This is not a hard error as I want to collect as many logs as we can in a support bundle rather than bailing out early.
zip_path, | ||
FullFileOptions::default() | ||
.compression_method(zip::CompressionMethod::Zstd) | ||
.compression_level(Some(3)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to maximize the compression level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some very minor comparisons locally that took into account speed vs size. So we are optimizing for speed here as the final compression doesn't matter as we are unpacking this zip on the nexus side before nexus assembles the final support bundle.
sled-diagnostics/src/logs.rs
Outdated
}; | ||
|
||
// We grab the first part of a log file which is prefixed with the log | ||
// type so that we gague our interest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling nit: "gauge"
sled-diagnostics/src/logs.rs
Outdated
if cockroach_log_prefix.contains(prefix) { | ||
let entry = interested | ||
.entry(prefix) | ||
.or_insert(CockroachExtraLog::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: or_default()
should work here.
Created using spr 1.3.6-beta.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I echo all of @bnaecker 's comments!
@@ -1222,8 +1222,8 @@ impl Zfs { | |||
snap_name: &'a str, | |||
properties: &'a [(&'a str, &'a str)], | |||
) -> Result<(), CreateSnapshotError> { | |||
let mut command = std::process::Command::new(ZFS); | |||
let mut cmd = command.arg("snapshot"); | |||
let mut command = std::process::Command::new(PFEXEC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this may be a "usable in tests" things -- adding the pfexec is harmless, but without it, we cannot run integration tests
illumos-utils/src/zfs.rs
Outdated
/// Note that if this is called on the root dataset such as | ||
/// `rpool/ROOT/<BE>` it will return "legacy/.zfs/snapshot/<SNAP_NAME>". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: What is a "root dataset"? This is on an impl Snapshot
, so which snapshots is this true / not true for?
})??; | ||
|
||
// Cleanup the zip file since we no longer need it | ||
if let Err(e) = tokio::fs::remove_file(&output_file).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm: We clean this up so it doesn't get included in the final zipfile, right?
(I'm just confirming the "cancel safety" of collect_bundle_as_file
- even if we left this file around and cancelled the future, we'd still end up deleting the whole directory, so that would be fine?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's the idea.
@@ -790,6 +816,80 @@ async fn sha2_hash(file: &mut tokio::fs::File) -> anyhow::Result<ArtifactHash> { | |||
Ok(ArtifactHash(digest.as_slice().try_into()?)) | |||
} | |||
|
|||
/// For a given zone, save a zip of its log files into a support bundle path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document the "path" argument here, as the path to a per-sled directory that will end up zipfile.
sled-diagnostics/src/logs.rs
Outdated
"{SLED_DIAGNOSTICS_SNAPSHOT_PREFIX}{}", | ||
thread_rng() | ||
.sample_iter(Alphanumeric) | ||
.take(6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So:
- We aren't retrying here
- Multiple zones are taking snapshots concurrently
- https://en.wikipedia.org/wiki/Birthday_problem exists
What do you think about bumping up this number slightly higher from 6 to like, 12?
I think the odds here are still pretty unlikely to collide (I think there are 56800235584 possible file names with this current config) but increasing the length here from 6 -> 12 would make a collision have roughly the same likelihood as a UUIDv4 collision, which we consider "unlikely enough to not care about".
sled-diagnostics/src/logs.rs
Outdated
/// e.g. `/pool/ext/<UUID>/crypt/zone/<ZONE_NAME>/root/var/log/svc/<LOGFILE>` | ||
Current, | ||
/// Logs that have been archived by sled-agent into a debug dataset. | ||
/// e.g. `/pool/ext/<UUID>/crypt/debug/<ZONE_NAME/<LOGFILE>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// e.g. `/pool/ext/<UUID>/crypt/debug/<ZONE_NAME/<LOGFILE>` | |
/// e.g. `/pool/ext/<UUID>/crypt/debug/<ZONE_NAME>/<LOGFILE>` |
sled-diagnostics/src/logs.rs
Outdated
&name, | ||
&[SLED_DIAGNOSTICS_ZFS_PROPERTY_NAME], | ||
Some(illumos_utils::zfs::PropertySource::Local), | ||
) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was copied from the zone bundle code, but we probably want to match on this result and log the error, in this "else" case.
sled-diagnostics/src/logs.rs
Outdated
} | ||
|
||
#[derive(Debug)] | ||
struct SnapshotPermit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: Why are these called permits?
sled-diagnostics/src/logs.rs
Outdated
let snapshots = get_sled_diagnostics_snapshots(zfs_filesystem); | ||
assert_eq!(snapshots.len(), 1, "duplicate snapshots not taken"); | ||
|
||
// // Free all of the permits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// // Free all of the permits | |
// Free all of the permits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a couple nits
// We log any errors saving the zip file to disk and | ||
// continue on. | ||
if let Err(e) = log_collection_result { | ||
error!(&self.log, "failed to write logs outut: {e}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error!(&self.log, "failed to write logs outut: {e}"); | |
error!(&self.log, "failed to write logs output: {e}"); |
} | ||
} | ||
Err(err) => { | ||
tokio::fs::write(path.join("{zone}.logs.err"), err.to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will generate a literal {zone}
.
tokio::fs::write(path.join("{zone}.logs.err"), err.to_string()) | |
tokio::fs::write(path.join(format!("{zone}.logs.err")), err.to_string()) |
sled-diagnostics/src/logs.rs
Outdated
/// Cleanup snapshots that may have been left around due to unknown | ||
/// circumstances such as a crash. | ||
pub fn cleanup_snapshots(&self) { | ||
let diagnostic_snapshots = Zfs::list_snapshots().unwrap().into_iter().filter (|snap| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to assume Zfs::list_snapshots
will never return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jinx! This was copied from zone bundles which also calls unwrap. I am not opposed to logging an error and allowing sled-agent to still come up in the face of errors, this seems like the right approach.
Created using spr 1.3.6-beta.1
@@ -1222,8 +1222,8 @@ impl Zfs { | |||
snap_name: &'a str, | |||
properties: &'a [(&'a str, &'a str)], | |||
) -> Result<(), CreateSnapshotError> { | |||
let mut command = std::process::Command::new(ZFS); | |||
let mut cmd = command.arg("snapshot"); | |||
let mut command = std::process::Command::new(PFEXEC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because as far as I can tell zone bundles and the sled diagnostics creates are the only consumers. I ran into the issue because I was running cargo nextest run
and I saw permission denied, where as sled-agent is running as root.
sled-diagnostics/src/logs.rs
Outdated
/// Cleanup snapshots that may have been left around due to unknown | ||
/// circumstances such as a crash. | ||
pub fn cleanup_snapshots(&self) { | ||
let diagnostic_snapshots = Zfs::list_snapshots().unwrap().into_iter().filter (|snap| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this as it was when I pulled it in from the zone bundle code but do we want to block sled-agent from starting up with the unwrap()
or should we change this to a error!()
statement and carry on?
illumos-utils/src/zfs.rs
Outdated
pub fn full_path(&self) -> Result<Utf8PathBuf, GetValueError> { | ||
// TODO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed as #8023
})??; | ||
|
||
// Cleanup the zip file since we no longer need it | ||
if let Err(e) = tokio::fs::remove_file(&output_file).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's the idea.
method = GET, | ||
path = "/support/logs/zones", | ||
}] | ||
async fn support_logs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 23430d7
method = GET, | ||
path = "/support/logs/download/{zone}", | ||
}] | ||
async fn support_logs_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 23430d7
let mut archived: Vec<_> = service_logs | ||
.archived | ||
.into_iter() | ||
.filter(|log| log.path.as_str().contains("crypt/debug")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 23430d7
I added a comment about why we are searching for only crypt/debug.
sled-diagnostics/src/logs.rs
Outdated
snapshot_logfile: &Utf8Path, | ||
) -> Result<(), LogError> { | ||
let Some(log_name) = snapshot_logfile.file_name() else { | ||
debug!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a warning. This is not a hard error as I want to collect as many logs as we can in a support bundle rather than bailing out early.
zip_path, | ||
FullFileOptions::default() | ||
.compression_method(zip::CompressionMethod::Zstd) | ||
.compression_level(Some(3)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some very minor comparisons locally that took into account speed vs size. So we are optimizing for speed here as the final compression doesn't matter as we are unpacking this zip on the nexus side before nexus assembles the final support bundle.
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
/// NB: Be careful when calling this method as it may return a `Utf8PathBuf` | ||
/// that does not actually map to a real filesystem path. On helios systems | ||
/// `rpool/ROOT/<BE>` for example will will return | ||
/// "legacy/.zfs/snapshot/<SNAP_NAME>" because the mountpoint of the dataset | ||
/// is a "legacy" mount. Additionally a fileystem with no mountpoint will | ||
/// have a zfs mountpoint property of "-". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this API, it seems really easy to misuse, even with this comment. I suppose we have filed an issue here, but I think we should prioritize fixing this quickly, rather than leaving it as tech debt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can pick this up right after we land this. I will probably steal the implementation from here and just update call sites
sled-diagnostics/src/logs.rs
Outdated
@@ -66,20 +65,25 @@ pub enum LogError { | |||
Zip(#[from] ZipError), | |||
} | |||
|
|||
///A ZFS snapshot that is taken by the `sled-diagnostics` crate and handles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
///A ZFS snapshot that is taken by the `sled-diagnostics` crate and handles | |
/// A ZFS snapshot that is taken by the `sled-diagnostics` crate and handles |
Created using spr 1.3.6-beta.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing further from me, thanks!
Take two of adding log files into support bundles.
This PR leverages oxlog to find all of the zones/services on a sled and then through a series of zfs snapshots will collect those logs and expose them as a zip file that nexus will slurp down and stick into the support bundle that is being collected.