Support bundles should include log files #7540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

papertigers wants to merge 14 commits into main from spr/papertigers/wip-oxlog-in-support-bundles

Contributor

papertigers commented Feb 13, 2025 •

edited

Loading

With this PR support bundles now include log files in the final bundle.zip.

We rely on oxlog to find the log files for gz and ngz as well as for all other oxide services. Sled-agent then exposes an endpoint to get a listing of all known log files, while an additional endpoint allows you to get the contents of these log files while validating the path so that we do not allow arbitrary file retrieval through the internal API.

papertigers and others added 14 commits

February 13, 2025 20:27


          [spr] changes to main this commit is based on

6606a50

Created using spr 1.3.6-beta.1

[skip ci]


          [spr] initial version

2b5feb4

Created using spr 1.3.6-beta.1


          [spr] changes introduced through rebase

3eb2d6a

Created using spr 1.3.6-beta.1

[skip ci]


          clippy + rebase

d5f4986

Created using spr 1.3.6-beta.1


          [spr] changes introduced through rebase

35bcb6c

Created using spr 1.3.6-beta.1

[skip ci]


          rebase

ecf6aa7

Created using spr 1.3.6-beta.1


          [spr] changes introduced through rebase

28ad878

Created using spr 1.3.6-beta.1

[skip ci]


          Collect logs in the support bundle itself

f4a95bb

Created using spr 1.3.6-beta.1


          [spr] changes introduced through rebase

7eecafd

Created using spr 1.3.6-beta.1

[skip ci]


          update Rust style edition to 2024

19bc9c8

Created using spr 1.3.6-beta.1


          [spr] changes introduced through rebase

0a200ec

Created using spr 1.3.6-beta.1

[skip ci]


          rustup 1.28.0 fallout

8bc5c20

Created using spr 1.3.6-beta.1


          [spr] changes introduced through rebase

9fb538e

Created using spr 1.3.6-beta.1

[skip ci]


          Cleanup and test fix

d0ef896

Created using spr 1.3.6-beta.1

papertigers marked this pull request as ready for review

March 5, 2025 19:05

papertigers requested review from smklein and wfchandler

March 5, 2025 19:06

papertigers commented

View reviewed changes

nexus/src/app/background/tasks/support_bundle_collector.rs

+                  // Stream the log file contents into the support bundle log file on disk
+                  //
+                  // TODO MTZ: is there a better way to do this?

Contributor Author

papertigers Mar 5, 2025

I don't know if there's a better solution here. Open to other ideas if someone knows of a better way.

Collaborator

smklein Mar 5, 2025

I don't think it's necessary to convert e to a string, right? https://doc.rust-lang.org/std/io/struct.Error.html#method.other already is generic.

FWIW, we use the tokio_util::io::StreamReader with our TUF artifact downloading too, and we use a similar "map_err" pattern (see: update-common/src/artifacts/update_plan.rs for context)

sled-agent/src/sim/http_entrypoints.rs

+                  async fn support_logs(
+                      _request_context: RequestContext<Self::Context>,
+                  ) -> Result<HttpResponseOk<SledDiagnosticsLogs>, HttpError> {
+                      // We return an empty file index for support bundle lifecycle testing

Contributor Author

papertigers Mar 5, 2025

I think the other tests are enough to cover the logic of finding and validating log paths. If we want we could add something like a single log path to the index and for downloading the file we could send back a predefined set of bytes. Going a step further would mean we would need to mock more of what oxlog does which kind of feels wrong.

papertigers changed the title ~~WIP oxlog in support bundles~~ Support bundles should include log files

papertigers commented

View reviewed changes

sled-diagnostics/src/logs.rs

+              ) -> HashMap<String, CockroachExtraLog> {
+                  // Known logging paths for cockroachdb:
+                  // https://www.cockroachlabs.com/docs/stable/logging-overview#logging-destinations
+                  let cockroach_logs = [

Contributor Author

papertigers Mar 5, 2025

I know we discussed briefly on the weekly sync calls where this list of files should live but we never came to a conclusion. I am happy to move this to oxlog itself or to do so in a future PR when I end up shuffling some other oxlog internals around.

Collaborator

smklein Mar 5, 2025

I'm okay punting this, but agreed that it does seem more like a fit within oxlog.

smklein reviewed

View reviewed changes

nexus/src/app/background/tasks/support_bundle_collector.rs

                   Ok(ArtifactHash(digest.as_slice().try_into()?))
               }
+              /// For a given zone and service, save it's log into a support bundle path.

Collaborator

smklein Mar 5, 2025

Suggested change

      
            /// For a given zone and service, save it's log into a support bundle path.
          
            /// For a given zone and service, save its log into a support bundle path.

nexus/src/app/background/tasks/support_bundle_collector.rs

+                  // Stream the log file contents into the support bundle log file on disk
+                  //
+                  // TODO MTZ: is there a better way to do this?

Collaborator

smklein Mar 5, 2025

I don't think it's necessary to convert e to a string, right? https://doc.rust-lang.org/std/io/struct.Error.html#method.other already is generic.

FWIW, we use the tokio_util::io::StreamReader with our TUF artifact downloading too, and we use a similar "map_err" pattern (see: update-common/src/artifacts/update_plan.rs for context)

sled-diagnostics/src/logs.rs

+              use thiserror::Error;
+              pub type Zone = String;
+              pub type Service = BTreeMap<String, Vec<Utf8PathBuf>>;

Collaborator

smklein Mar 5, 2025

This is kinda a confusing type. Is this supposed to be "service name" to "all logs within that service"?

We might benefit from documentation.

Also: Is the path the path relative to the zone? Or the path to that log from the global zone?

Contributor Author

papertigers Mar 5, 2025

I will add a comment around this for clarification.
The path is from the view of oxlog which is from the global zone point of view. The relationship looks like zone->service->[log files] and is inherited from how oxlog represents the log files it returns.

sled-agent/api/src/lib.rs

+              /// (sled agent API)
+              #[derive(Deserialize, JsonSchema)]
+              pub struct SledDiagnosticsLogsFilePathParam {
+                  /// The path of the file to be included in the support bundle

Collaborator

smklein Mar 5, 2025

Does this path need to match the path emitted by the support_logs GET api?

Contributor Author

papertigers Mar 5, 2025

Yea that's the idea, you retrieve the index of all the log files and then we fire off a request for each one.

sled-agent/api/src/lib.rs

+              #[derive(Deserialize, JsonSchema)]
+              pub struct SledDiagnosticsLogsFilePathParam {
+                  /// The path of the file to be included in the support bundle
+                  pub file: String,

Collaborator

smklein Mar 5, 2025

Could this be a Utf8PathBuf?

sled-agent/src/support_bundle/logs.rs

Comment on lines +207 to +208

		normalize_path("/find/the/./../secret/file"),
		Utf8PathBuf::from(r"/find/the/secret/file")

Collaborator

smklein Mar 5, 2025

This seems wrong to me though. I understand wanting to not parse .. here, but claiming this is a "normalized" variant of the path does not seem correct.

I wish there was something in the stdlib for this - canonicalize exists, but it tries to walk symlinks, which we don't necessarily need.

But https://github.com/rust-lang/cargo/blob/fede83ccf973457de319ba6fa0e36ead454d2e20/src/cargo/util/paths.rs#L61-L86 is how cargo defines normalization, and it treats .. as a signal to pop from the path. We don't need to do that, but as mentioned earlier, ignoring it seems incorrect.

Contributor Author

papertigers Mar 5, 2025

Yea canonicalize as far as I remember actually examines the filesystem rather than attempting to normalize the path. As stated above should we just return an error if we encounter a .. path component?

sled-agent/src/support_bundle/logs.rs

Comment on lines +84 to +88

+                      let valid_dirs =
+                          tokio::task::spawn_blocking(|| sled_diagnostics::valid_log_dirs())
+                              .await
+                              .map_err(Error::Join)?
+                              .map_err(Error::Logs)?;

Collaborator

smklein Mar 5, 2025

minor: I'm curious about the performance impact here. Accessing each individual log file requires constructing this directory of all log directories - not a bad choice for a first implementation, but I'm curious how costly that is relative to reading out the logs themselves.

We don't have to do this yet but as one example, we could cache this result, and refresh it + retry if validate_log_dir returns an error.

Contributor Author

papertigers Mar 5, 2025

I will grab some data here again but iirc grabbing the directories themselves is relatively fast compared to retrieving all of the log files themselves.

We had discussed earlier keeping this directory in memory and refreshing it but for whatever reason we decided that wasn't a good idea - I would be happy to to implement such a cache in the get_log_file code path.

sled-diagnostics/src/logs.rs

+              ) -> HashMap<String, CockroachExtraLog> {
+                  // Known logging paths for cockroachdb:
+                  // https://www.cockroachlabs.com/docs/stable/logging-overview#logging-destinations
+                  let cockroach_logs = [

Collaborator

smklein Mar 5, 2025

I'm okay punting this, but agreed that it does seem more like a fit within oxlog.

sled-diagnostics/src/logs.rs

+                  let mut sorted: HashMap<String, CockroachExtraLog> = HashMap::new();
+                  for file in extra {
+                      if let Some(file_name) = file.path.file_name() {
+                          if let Some(file_parts) = file_name.split_once(".") {

Collaborator

smklein Mar 5, 2025

Why do we need to split the file_name like this? Can't we:

Check that the file_name starts with a prefix in cockroach_logs
Sort by the whole file_name? It's going to be sorted by prefix anyway

Contributor Author

papertigers Mar 5, 2025

I ended up doing it like this because the paths look like:

cockroach-health.log
cockroach-health.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2025-01-31T21_43_26Z.011435.log
cockroach-health.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2025-02-01T01_51_53Z.011486.log
...snip...

They get broken out into buckets so that we have the current generation and then all the rotated log files which should be preserving the order we find things in so that we can take the last n items.

sled-agent/src/support_bundle/logs.rs

+                      };
+                      let metadata = file.metadata().await?;
+                      if !metadata.is_file() {

Collaborator

smklein Mar 5, 2025

Just to confirm - are any of our logs symbolic links, or do we expect them all to be plain files?

Contributor Author

papertigers Mar 5, 2025

I expect them to all be plain files but I can double check.

Collaborator

smklein commented Mar 5, 2025

Looks good, happy to see that we could re-use so much of oxlog for this purpose.

My main issue here is with the potential TOCTTOU issue if a log is rotated between "querying for the list of logs" and "downloading logs". While it's usually okay to return an imperfect view of logs, it would be a bummer to be so prone to losing recent logs because the list changes.

papertigers changed the base branch from spr/papertigers/main.wip-oxlog-in-support-bundles to main

March 5, 2025 22:19

smklein reviewed

View reviewed changes

sled-diagnostics/src/logs.rs

+                          // Archived
+                          logs.entry(svc.clone()).or_default().extend(
+                              svclogs.archived.into_iter().rev().take(5).map(|s| s.path),

Collaborator

smklein Mar 5, 2025

What's the rev().take(5) for?

Contributor Author

papertigers Mar 6, 2025

Recall that we don't yet have any log rotation cleanup logic and environments like dogfood contain many log files, so we are taking the most recent 5 log rotations.

Collaborator

smklein Mar 6, 2025

Aren't we using https://github.com/oxidecomputer/omicron/blob/0f81eb138b85f9fd1444a4cdedf86ea655b7670c/smf/logadm/logadm.conf ? I'm trying to map that to https://illumos.org/man/8/logadm , but doesn't that use the -C argument, which limits the total count of logs?

Collaborator

smklein Mar 6, 2025

(I'm not disagreeing with you, just trying to better grok which logs are unbounded)

twinfees assigned wfchandler and unassigned wfchandler

Contributor Author

papertigers commented Apr 15, 2025

This PR is superseded by #7973

papertigers closed this

papertigers deleted the spr/papertigers/wip-oxlog-in-support-bundles branch

April 15, 2025 01:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet