Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-240] stripped google sheet information from open telemetry span status and message #1573

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

linglp
Copy link
Contributor

@linglp linglp commented Feb 6, 2025

Problem

When an error involving Google Sheets is logged, the message may include the URL of the Google Sheet being processed and the google sheet link can be found in signoz both in log and in traces.

Solution

Create a custom log processor to filter out the sensitive information in the log

Evidence that this is working

Screenshot 2025-02-11 at 3 38 46 PM

@linglp linglp requested a review from a team as a code owner February 6, 2025 20:57
@andrewelamb
Copy link
Contributor

Perhaps we should split the Jira issue out, with another to further investiagte doing this via OTEL/Signoz instead of the code?

try:
wb.set_dataframe(manifest_df, (1, 1), fit=True)
except HttpError as ex:
pattern = r"https://sheets\.googleapis\.com/v4/spreadsheets/[\w-]+"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this catch all google sheet urls?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also my question. Could a change to the google api, that would otherwise be non-breaking, change the format of the URL used so that it's no longer appropriately found in the message string?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change to the format of the URL would break this regex pattern. The regex could be updated to look for any /v. instead of just /v4. This is the only thing I think we could reasonable anticipate that may change, and we can resolve that potential issue now ahead of time.

@linglp
Copy link
Contributor Author

linglp commented Feb 6, 2025

@andrewelamb I'm not entirely sure if I'm on the right track, so I'd like to wait until @BryanFauble returns to confirm. This implementation removes Google Sheet links in spans, but I'm unsure if the ticket also requires a more general approach to stripping all sensitive information. I'll make further modifications as I discuss more with Bryan.

@thomasyu888
Copy link
Member

thomasyu888 commented Feb 7, 2025

not entirely sure if I'm on the right track... This implementation removes Google Sheet links in spans, but I'm unsure if the ticket also requires a more general approach to stripping all sensitive information.

@linglp this is a good line of thought. Some questions to guide you. Is this the only place googlesheets are logged? Off the top of your head, do you know of any other sensitive information that is logged? Is there functionality with OTEL that filters out logs? (e,g https://signoz.io/blog/sending-and-filtering-python-logs-with-opentelemetry/#how-the-default-filter-keeps-out-unwanted-logs?)

@andrewelamb what are your thoughts? What we want to avoid is sensitive data being transferred to signoz cloud.

I'll add my personal views after this discussion.

wb.set_dataframe(manifest_df, (1, 1), fit=True)
try:
wb.set_dataframe(manifest_df, (1, 1), fit=True)
except HttpError as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending further discussion of whether we think this is the right approach, this new exception that's being caught should have a unit test.

This is a good example of tiny design before doing the work would be helpful, that said, sometimes you have to do a little bit before knowing your options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an exception being caught in this way I would rather not have us modify any application code to implement a solution here.

I think the idea of using a log or span processor in the OTEL Python SDK is the appropriate solution.

The reason why I say this is:
Using the OTEL SDK gives us one spot where all redaction logic lives, we do not need to hunt around the code base to find all the spots we need to modify. It also gives us the template to follow and apply to other projects where we want to implement similar functionality.

Additional thoughts @andrewelamb @SageGJ ?

@andrewelamb
Copy link
Contributor

@thomasyu888 I don't have anythign to add at the moment on data getting into Signoz that shouldn't(Except to agree that it's bad :) ). This is somethign I'll keep in mind as I start working more cloesly with Signoz however.

@SageGJ
Copy link
Collaborator

SageGJ commented Feb 7, 2025

What are y'all's thoughts about wrapping/modifying sys.excepthook, catching HttpErrors within, and sanitizing the messages there?

Something like*

import sys
def custom_except_hook(type, value, traceback):
  if type == HttpError:
    # message sanitizing
    # message raising
  else:
    sys.__excepthook__(type, value, traceback)
sys.excepthook = custom_except_hook
  • would apply to multiple locations where an error is raised that includes a google sheets url
  • could be extended to modify different error types that contain other sensitive information
  • avoids having to wrap multiple blocks of code in try: catch: statements

@linglp @andrewelamb @thomasyu888

*modified from here

@linglp
Copy link
Contributor Author

linglp commented Feb 7, 2025

Thanks for all the discussion here. In retrospect, a design document could help clarify things further. To summarize, the Google Sheet link was originally found in SigNoz traces, and this solution specifically removes it from traces and spans within a single function call. My plan was to confirm with Bryan whether removing sensitive information from traces is necessary, as the ticket only mentioned "logs," and whether handling it directly in the function (rather than using a custom trace processor) is an acceptable approach. I can also document what I’ve tried and why those approaches didn’t work separately.

Since the error originates from Google APIs, any part of the system that interacts with them could potentially trigger it. If Bryan confirms this approach is acceptable, I can proceed with adding unit tests and considering how to wrap the exception.

@linglp linglp marked this pull request as draft February 7, 2025 20:26
@thomasyu888
Copy link
Member

thomasyu888 commented Feb 10, 2025

What are y'all's thoughts about wrapping/modifying sys.excepthook, catching HttpErrors within, and sanitizing the messages there?

@SageGJ here are my thoughts. Using sys.excepthook to filter out sensitive Google API URLs from exception messages can work and is creative, but some considerations:

  1. Setting sys.excepthook changes how all unhandled exceptions are processed globally in one module. I wonder if it would have issues when it's run in a multi-threaded or multi-module environment.
  2. This only catches uncaught exceptions. If the HttpError is handled somewhere else (e.g., inside a try-except block), this function won't intercept it.
  3. There can be potential suppression of useful debug Info. This is a general issue, but we actually want these logs when this is run in the CLI/library, but we just don't want to send the information to SigNoz. @linglp . For example, users of schematic CLI should have these googlesheet links returned to them.

@SageGJ
Copy link
Collaborator

SageGJ commented Feb 11, 2025

@thomasyu888 thanks for adding!
For the points you've added:

  1. I agree with the concern about multi-threaded or multi module environments. I was envisioning applying this across all of schematic, in the __init__.py file since we could specifically catch and modify the appropriate HTTP errors with sheets urls in them and pass the other exceptions to the regular handler.
  2. If the error is caught and handled without being raised will it still be logged in signoz?
  3. I agree. There's also the concern when signoz is used locally with the library/cli where we'd want to censor this information from reaching signoz.

Given points 1 and 3, and if we decide we'd want to handle this within schematic and not within OTEL itself, we could modify __init__.py where the tracing is currently set up so that it also includes something like

import sys
import os

def custom_except_hook(type, value, traceback):
  if type == HttpError:
    # message sanitizing
    # message raising
  else:
    sys.__excepthook__(type, value, traceback)


signoz_enabled = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT")
if signoz_enabled:
    sys.excepthook = custom_except_hook

I noticed tracing is still enabled in the absence of the OTEL headers so it might be better to check for the presence of TRACING_EXPORT_FORMAT or LOGGING_EXPORT_FORMAT for tracing in general.
I also realize changing how we process all exceptions could be a bit much so if we go this route we'd want to be really strict in selecting which ones are caught and to minimize side effects.

self._exporter = exporter
self._shutdown = False

def redact_google_sheet(self, message: str) -> str:
Copy link
Member

@thomasyu888 thomasyu888 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important we keep the logs when people are running the CLI. If I'm not mistaken, I think some CLI commands rely on the gsheets link being returned.

My guess is that this would fail some of the CLI tests.

See:

google_sheet_result = [
result
for result in result_list
if result.startswith("https://docs.google.com/spreadsheets/d/")
]
assert len(google_sheet_result) == 1
google_sheet_url = google_sheet_result[0]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a concern for CLI usage. The reason is that the log messages that are sent via OTEL is a (deep?) copy of the data sent to stdout/console.

If I understand this solution it will only affect the messages making their way into SigNoz, but leave them as is in something like cloudwatch. Which, is probably fine since if someone had access to cloud watch, we're already in serious trouble. But, access to SigNoz is going to be given more freely, so we want to curate it carefully.

However, it should be tested to verify these assumption with the cli usage

@BryanFauble
Copy link
Collaborator

Thanks for the discussion everyone (@linglp @andrewelamb @thomasyu888 @linglp )

#1573 (comment)

Captures my thoughts. I would rather we utilize the OTEL Python SDK to handle the sanitization of the data, rather than the traditional native python ways. By implementing the logic via processors in the Python SDK it will work similar to the concept of how it would be implemented within the OpenTelemtry Collector shown here: https://opentelemetry.io/docs/collector/architecture/#pipelines

Specifically it has the concept of using one or more processors to handle data transformation before that data is exported. Since we are developing code locally without a collector that may always be present, It's important to be able to filter this data out before it leaves the machine where the data is produced. Technically when this code is running in AWS we could use the collector approach, but i wanted us to figure out if we could do it in code before the collector (Lingling is proving that we can in this pull request!)

Comment on lines 142 to 145
if span.status.status_code == trace.StatusCode.ERROR:
if span.events:
redacted_span = self._create_redacted_span(span)
self.export.export([redacted_span])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if span.status.status_code == trace.StatusCode.ERROR:
if span.events:
redacted_span = self._create_redacted_span(span)
self.export.export([redacted_span])
if span.status.status_code == trace.StatusCode.ERROR:
if span.events:
redacted_span = self._create_redacted_span(span)
self.export.export([redacted_span])
return
self.export.export([span])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I'm not sure tbh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to work with how we have this set up because of:
https://github.com/open-telemetry/opentelemetry-python/blob/a7fe4f8bac7fa36291c6acf86982bbb356e3ae6d/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py#L173-L175

Specifically:

  1. We have a BatchSpanProcessor that was already responsible for queuing up and exporting the spans to SigNoz
  2. Based on the code at the link above the span is sent to each processor, in this case the data is likely being sent twice. Once in the BatchSpanProcessor that is still being called, and once here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a hack how we can get around this issue:

  1. Instead of using a processor to handle this logic, as we can see, won't work we could monkey patch the _readable_span function call: https://github.com/open-telemetry/opentelemetry-python/blob/a7fe4f8bac7fa36291c6acf86982bbb356e3ae6d/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py#L906-L921
  2. In the monkey patch we: 1) Start by calling our sensitive data redaction process to strip data out of the span, 2) Return a call to the original function.

By monkey patching this, it would allow us to essentially "slip" in logic before a read-only span has been created.

Let me know what questions you have.

schematic/__init__.py Outdated Show resolved Hide resolved
schematic/__init__.py Outdated Show resolved Hide resolved
schematic/__init__.py Outdated Show resolved Hide resolved
@linglp linglp marked this pull request as ready for review February 19, 2025 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants