-
Notifications
You must be signed in to change notification settings - Fork 53
Support bundles should include log files #7540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
|
||
// Stream the log file contents into the support bundle log file on disk | ||
// | ||
// TODO MTZ: is there a better way to do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there's a better solution here. Open to other ideas if someone knows of a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to convert e
to a string, right? https://doc.rust-lang.org/std/io/struct.Error.html#method.other already is generic.
FWIW, we use the tokio_util::io::StreamReader
with our TUF artifact downloading too, and we use a similar "map_err" pattern (see: update-common/src/artifacts/update_plan.rs
for context)
async fn support_logs( | ||
_request_context: RequestContext<Self::Context>, | ||
) -> Result<HttpResponseOk<SledDiagnosticsLogs>, HttpError> { | ||
// We return an empty file index for support bundle lifecycle testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other tests are enough to cover the logic of finding and validating log paths. If we want we could add something like a single log path to the index and for downloading the file we could send back a predefined set of bytes. Going a step further would mean we would need to mock more of what oxlog does which kind of feels wrong.
) -> HashMap<String, CockroachExtraLog> { | ||
// Known logging paths for cockroachdb: | ||
// https://www.cockroachlabs.com/docs/stable/logging-overview#logging-destinations | ||
let cockroach_logs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we discussed briefly on the weekly sync calls where this list of files should live but we never came to a conclusion. I am happy to move this to oxlog itself or to do so in a future PR when I end up shuffling some other oxlog internals around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay punting this, but agreed that it does seem more like a fit within oxlog.
@@ -772,6 +802,44 @@ async fn sha2_hash(file: &mut tokio::fs::File) -> anyhow::Result<ArtifactHash> { | |||
Ok(ArtifactHash(digest.as_slice().try_into()?)) | |||
} | |||
|
|||
/// For a given zone and service, save it's log into a support bundle path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// For a given zone and service, save it's log into a support bundle path. | |
/// For a given zone and service, save its log into a support bundle path. |
|
||
// Stream the log file contents into the support bundle log file on disk | ||
// | ||
// TODO MTZ: is there a better way to do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to convert e
to a string, right? https://doc.rust-lang.org/std/io/struct.Error.html#method.other already is generic.
FWIW, we use the tokio_util::io::StreamReader
with our TUF artifact downloading too, and we use a similar "map_err" pattern (see: update-common/src/artifacts/update_plan.rs
for context)
use thiserror::Error; | ||
|
||
pub type Zone = String; | ||
pub type Service = BTreeMap<String, Vec<Utf8PathBuf>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda a confusing type. Is this supposed to be "service name" to "all logs within that service"?
We might benefit from documentation.
Also: Is the path the path relative to the zone? Or the path to that log from the global zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment around this for clarification.
The path is from the view of oxlog which is from the global zone point of view. The relationship looks like zone->service->[log files]
and is inherited from how oxlog represents the log files it returns.
/// (sled agent API) | ||
#[derive(Deserialize, JsonSchema)] | ||
pub struct SledDiagnosticsLogsFilePathParam { | ||
/// The path of the file to be included in the support bundle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this path need to match the path emitted by the support_logs
GET api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's the idea, you retrieve the index of all the log files and then we fire off a request for each one.
#[derive(Deserialize, JsonSchema)] | ||
pub struct SledDiagnosticsLogsFilePathParam { | ||
/// The path of the file to be included in the support bundle | ||
pub file: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a Utf8PathBuf?
normalize_path("/find/the/./../secret/file"), | ||
Utf8PathBuf::from(r"/find/the/secret/file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong to me though. I understand wanting to not parse ..
here, but claiming this is a "normalized" variant of the path does not seem correct.
I wish there was something in the stdlib for this - canonicalize
exists, but it tries to walk symlinks, which we don't necessarily need.
But https://github.com/rust-lang/cargo/blob/fede83ccf973457de319ba6fa0e36ead454d2e20/src/cargo/util/paths.rs#L61-L86 is how cargo defines normalization, and it treats ..
as a signal to pop from the path. We don't need to do that, but as mentioned earlier, ignoring it seems incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea canonicalize
as far as I remember actually examines the filesystem rather than attempting to normalize the path. As stated above should we just return an error if we encounter a ..
path component?
let valid_dirs = | ||
tokio::task::spawn_blocking(|| sled_diagnostics::valid_log_dirs()) | ||
.await | ||
.map_err(Error::Join)? | ||
.map_err(Error::Logs)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I'm curious about the performance impact here. Accessing each individual log file requires constructing this directory of all log directories - not a bad choice for a first implementation, but I'm curious how costly that is relative to reading out the logs themselves.
We don't have to do this yet but as one example, we could cache this result, and refresh it + retry if validate_log_dir
returns an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will grab some data here again but iirc grabbing the directories themselves is relatively fast compared to retrieving all of the log files themselves.
We had discussed earlier keeping this directory in memory and refreshing it but for whatever reason we decided that wasn't a good idea - I would be happy to to implement such a cache in the get_log_file code path.
) -> HashMap<String, CockroachExtraLog> { | ||
// Known logging paths for cockroachdb: | ||
// https://www.cockroachlabs.com/docs/stable/logging-overview#logging-destinations | ||
let cockroach_logs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay punting this, but agreed that it does seem more like a fit within oxlog.
let mut sorted: HashMap<String, CockroachExtraLog> = HashMap::new(); | ||
for file in extra { | ||
if let Some(file_name) = file.path.file_name() { | ||
if let Some(file_parts) = file_name.split_once(".") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to split the file_name like this? Can't we:
- Check that the
file_name
starts with a prefix incockroach_logs
- Sort by the whole
file_name
? It's going to be sorted by prefix anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing it like this because the paths look like:
cockroach-health.log
cockroach-health.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2025-01-31T21_43_26Z.011435.log
cockroach-health.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2025-02-01T01_51_53Z.011486.log
...snip...
They get broken out into buckets so that we have the current generation and then all the rotated log files which should be preserving the order we find things in so that we can take the last n items.
}; | ||
let metadata = file.metadata().await?; | ||
|
||
if !metadata.is_file() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm - are any of our logs symbolic links, or do we expect them all to be plain files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect them to all be plain files but I can double check.
Looks good, happy to see that we could re-use so much of oxlog for this purpose. My main issue here is with the potential TOCTTOU issue if a log is rotated between "querying for the list of logs" and "downloading logs". While it's usually okay to return an imperfect view of logs, it would be a bummer to be so prone to losing recent logs because the list changes. |
|
||
// Archived | ||
logs.entry(svc.clone()).or_default().extend( | ||
svclogs.archived.into_iter().rev().take(5).map(|s| s.path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rev().take(5)
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recall that we don't yet have any log rotation cleanup logic and environments like dogfood contain many log files, so we are taking the most recent 5 log rotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we using https://github.com/oxidecomputer/omicron/blob/0f81eb138b85f9fd1444a4cdedf86ea655b7670c/smf/logadm/logadm.conf ? I'm trying to map that to https://illumos.org/man/8/logadm , but doesn't that use the -C
argument, which limits the total count of logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not disagreeing with you, just trying to better grok which logs are unbounded)
This PR is superseded by #7973 |
With this PR support bundles now include log files in the final bundle.zip.
We rely on oxlog to find the log files for gz and ngz as well as for all other oxide services. Sled-agent then exposes an endpoint to get a listing of all known log files, while an additional endpoint allows you to get the contents of these log files while validating the path so that we do not allow arbitrary file retrieval through the internal API.