Skip to content

Conversation

papertigers
Copy link
Contributor

@papertigers papertigers commented Feb 13, 2025

With this PR support bundles now include log files in the final bundle.zip.

We rely on oxlog to find the log files for gz and ngz as well as for all other oxide services. Sled-agent then exposes an endpoint to get a listing of all known log files, while an additional endpoint allows you to get the contents of these log files while validating the path so that we do not allow arbitrary file retrieval through the internal API.

papertigers and others added 14 commits February 13, 2025 20:27
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@papertigers papertigers marked this pull request as ready for review March 5, 2025 19:05

// Stream the log file contents into the support bundle log file on disk
//
// TODO MTZ: is there a better way to do this?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there's a better solution here. Open to other ideas if someone knows of a better way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to convert e to a string, right? https://doc.rust-lang.org/std/io/struct.Error.html#method.other already is generic.

FWIW, we use the tokio_util::io::StreamReader with our TUF artifact downloading too, and we use a similar "map_err" pattern (see: update-common/src/artifacts/update_plan.rs for context)

async fn support_logs(
_request_context: RequestContext<Self::Context>,
) -> Result<HttpResponseOk<SledDiagnosticsLogs>, HttpError> {
// We return an empty file index for support bundle lifecycle testing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other tests are enough to cover the logic of finding and validating log paths. If we want we could add something like a single log path to the index and for downloading the file we could send back a predefined set of bytes. Going a step further would mean we would need to mock more of what oxlog does which kind of feels wrong.

@papertigers papertigers changed the title WIP oxlog in support bundles Support bundles should include log files Mar 5, 2025
) -> HashMap<String, CockroachExtraLog> {
// Known logging paths for cockroachdb:
// https://www.cockroachlabs.com/docs/stable/logging-overview#logging-destinations
let cockroach_logs = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we discussed briefly on the weekly sync calls where this list of files should live but we never came to a conclusion. I am happy to move this to oxlog itself or to do so in a future PR when I end up shuffling some other oxlog internals around.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay punting this, but agreed that it does seem more like a fit within oxlog.

@@ -772,6 +802,44 @@ async fn sha2_hash(file: &mut tokio::fs::File) -> anyhow::Result<ArtifactHash> {
Ok(ArtifactHash(digest.as_slice().try_into()?))
}

/// For a given zone and service, save it's log into a support bundle path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// For a given zone and service, save it's log into a support bundle path.
/// For a given zone and service, save its log into a support bundle path.


// Stream the log file contents into the support bundle log file on disk
//
// TODO MTZ: is there a better way to do this?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to convert e to a string, right? https://doc.rust-lang.org/std/io/struct.Error.html#method.other already is generic.

FWIW, we use the tokio_util::io::StreamReader with our TUF artifact downloading too, and we use a similar "map_err" pattern (see: update-common/src/artifacts/update_plan.rs for context)

use thiserror::Error;

pub type Zone = String;
pub type Service = BTreeMap<String, Vec<Utf8PathBuf>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda a confusing type. Is this supposed to be "service name" to "all logs within that service"?

We might benefit from documentation.

Also: Is the path the path relative to the zone? Or the path to that log from the global zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment around this for clarification.
The path is from the view of oxlog which is from the global zone point of view. The relationship looks like zone->service->[log files] and is inherited from how oxlog represents the log files it returns.

/// (sled agent API)
#[derive(Deserialize, JsonSchema)]
pub struct SledDiagnosticsLogsFilePathParam {
/// The path of the file to be included in the support bundle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this path need to match the path emitted by the support_logs GET api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's the idea, you retrieve the index of all the log files and then we fire off a request for each one.

#[derive(Deserialize, JsonSchema)]
pub struct SledDiagnosticsLogsFilePathParam {
/// The path of the file to be included in the support bundle
pub file: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a Utf8PathBuf?

Comment on lines +207 to +208
normalize_path("/find/the/./../secret/file"),
Utf8PathBuf::from(r"/find/the/secret/file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong to me though. I understand wanting to not parse .. here, but claiming this is a "normalized" variant of the path does not seem correct.

I wish there was something in the stdlib for this - canonicalize exists, but it tries to walk symlinks, which we don't necessarily need.

But https://github.com/rust-lang/cargo/blob/fede83ccf973457de319ba6fa0e36ead454d2e20/src/cargo/util/paths.rs#L61-L86 is how cargo defines normalization, and it treats .. as a signal to pop from the path. We don't need to do that, but as mentioned earlier, ignoring it seems incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea canonicalize as far as I remember actually examines the filesystem rather than attempting to normalize the path. As stated above should we just return an error if we encounter a .. path component?

Comment on lines +84 to +88
let valid_dirs =
tokio::task::spawn_blocking(|| sled_diagnostics::valid_log_dirs())
.await
.map_err(Error::Join)?
.map_err(Error::Logs)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I'm curious about the performance impact here. Accessing each individual log file requires constructing this directory of all log directories - not a bad choice for a first implementation, but I'm curious how costly that is relative to reading out the logs themselves.

We don't have to do this yet but as one example, we could cache this result, and refresh it + retry if validate_log_dir returns an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will grab some data here again but iirc grabbing the directories themselves is relatively fast compared to retrieving all of the log files themselves.

We had discussed earlier keeping this directory in memory and refreshing it but for whatever reason we decided that wasn't a good idea - I would be happy to to implement such a cache in the get_log_file code path.

) -> HashMap<String, CockroachExtraLog> {
// Known logging paths for cockroachdb:
// https://www.cockroachlabs.com/docs/stable/logging-overview#logging-destinations
let cockroach_logs = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay punting this, but agreed that it does seem more like a fit within oxlog.

let mut sorted: HashMap<String, CockroachExtraLog> = HashMap::new();
for file in extra {
if let Some(file_name) = file.path.file_name() {
if let Some(file_parts) = file_name.split_once(".") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to split the file_name like this? Can't we:

  1. Check that the file_name starts with a prefix in cockroach_logs
  2. Sort by the whole file_name? It's going to be sorted by prefix anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing it like this because the paths look like:

cockroach-health.log
cockroach-health.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2025-01-31T21_43_26Z.011435.log
cockroach-health.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2025-02-01T01_51_53Z.011486.log
...snip...

They get broken out into buckets so that we have the current generation and then all the rotated log files which should be preserving the order we find things in so that we can take the last n items.

};
let metadata = file.metadata().await?;

if !metadata.is_file() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm - are any of our logs symbolic links, or do we expect them all to be plain files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect them to all be plain files but I can double check.

@smklein
Copy link
Collaborator

smklein commented Mar 5, 2025

Looks good, happy to see that we could re-use so much of oxlog for this purpose.

My main issue here is with the potential TOCTTOU issue if a log is rotated between "querying for the list of logs" and "downloading logs". While it's usually okay to return an imperfect view of logs, it would be a bummer to be so prone to losing recent logs because the list changes.

@papertigers papertigers changed the base branch from spr/papertigers/main.wip-oxlog-in-support-bundles to main March 5, 2025 22:19

// Archived
logs.entry(svc.clone()).or_default().extend(
svclogs.archived.into_iter().rev().take(5).map(|s| s.path),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rev().take(5) for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recall that we don't yet have any log rotation cleanup logic and environments like dogfood contain many log files, so we are taking the most recent 5 log rotations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we using https://github.com/oxidecomputer/omicron/blob/0f81eb138b85f9fd1444a4cdedf86ea655b7670c/smf/logadm/logadm.conf ? I'm trying to map that to https://illumos.org/man/8/logadm , but doesn't that use the -C argument, which limits the total count of logs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not disagreeing with you, just trying to better grok which logs are unbounded)

@twinfees twinfees assigned wfchandler and unassigned wfchandler Mar 18, 2025
@papertigers
Copy link
Contributor Author

This PR is superseded by #7973

@papertigers papertigers deleted the spr/papertigers/wip-oxlog-in-support-bundles branch April 15, 2025 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants