-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCHEMATIC-240] stripped google sheet information from open telemetry span status and message #1573
base: develop
Are you sure you want to change the base?
Conversation
Perhaps we should split the Jira issue out, with another to further investiagte doing this via OTEL/Signoz instead of the code? |
schematic/manifest/generator.py
Outdated
try: | ||
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
except HttpError as ex: | ||
pattern = r"https://sheets\.googleapis\.com/v4/spreadsheets/[\w-]+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this catch all google sheet urls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also my question. Could a change to the google api, that would otherwise be non-breaking, change the format of the URL used so that it's no longer appropriately found in the message string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A change to the format of the URL would break this regex pattern. The regex could be updated to look for any /v.
instead of just /v4
. This is the only thing I think we could reasonable anticipate that may change, and we can resolve that potential issue now ahead of time.
@andrewelamb I'm not entirely sure if I'm on the right track, so I'd like to wait until @BryanFauble returns to confirm. This implementation removes Google Sheet links in spans, but I'm unsure if the ticket also requires a more general approach to stripping all sensitive information. I'll make further modifications as I discuss more with Bryan. |
@linglp this is a good line of thought. Some questions to guide you. Is this the only place googlesheets are logged? Off the top of your head, do you know of any other sensitive information that is logged? Is there functionality with OTEL that filters out logs? (e,g https://signoz.io/blog/sending-and-filtering-python-logs-with-opentelemetry/#how-the-default-filter-keeps-out-unwanted-logs?) @andrewelamb what are your thoughts? What we want to avoid is sensitive data being transferred to signoz cloud. I'll add my personal views after this discussion. |
schematic/manifest/generator.py
Outdated
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
try: | ||
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
except HttpError as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending further discussion of whether we think this is the right approach, this new exception that's being caught should have a unit test.
This is a good example of tiny design before doing the work would be helpful, that said, sometimes you have to do a little bit before knowing your options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of an exception being caught in this way I would rather not have us modify any application code to implement a solution here.
I think the idea of using a log or span processor in the OTEL Python SDK is the appropriate solution.
The reason why I say this is:
Using the OTEL SDK gives us one spot where all redaction logic lives, we do not need to hunt around the code base to find all the spots we need to modify. It also gives us the template to follow and apply to other projects where we want to implement similar functionality.
Additional thoughts @andrewelamb @SageGJ ?
@thomasyu888 I don't have anythign to add at the moment on data getting into Signoz that shouldn't(Except to agree that it's bad :) ). This is somethign I'll keep in mind as I start working more cloesly with Signoz however. |
What are y'all's thoughts about wrapping/modifying sys.excepthook, catching Something like*
@linglp @andrewelamb @thomasyu888 *modified from here |
Thanks for all the discussion here. In retrospect, a design document could help clarify things further. To summarize, the Google Sheet link was originally found in SigNoz traces, and this solution specifically removes it from traces and spans within a single function call. My plan was to confirm with Bryan whether removing sensitive information from traces is necessary, as the ticket only mentioned "logs," and whether handling it directly in the function (rather than using a custom trace processor) is an acceptable approach. I can also document what I’ve tried and why those approaches didn’t work separately. Since the error originates from Google APIs, any part of the system that interacts with them could potentially trigger it. If Bryan confirms this approach is acceptable, I can proceed with adding unit tests and considering how to wrap the exception. |
@SageGJ here are my thoughts. Using
|
@thomasyu888 thanks for adding!
Given points 1 and 3, and if we decide we'd want to handle this within schematic and not within OTEL itself, we could modify
I noticed tracing is still enabled in the absence of the OTEL headers so it might be better to check for the presence of |
schematic/__init__.py
Outdated
self._exporter = exporter | ||
self._shutdown = False | ||
|
||
def redact_google_sheet(self, message: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important we keep the logs when people are running the CLI. If I'm not mistaken, I think some CLI commands rely on the gsheets link being returned.
My guess is that this would fail some of the CLI tests.
See:
schematic/tests/integration/test_commands.py
Lines 675 to 681 in c50c599
google_sheet_result = [ | |
result | |
for result in result_list | |
if result.startswith("https://docs.google.com/spreadsheets/d/") | |
] | |
assert len(google_sheet_result) == 1 | |
google_sheet_url = google_sheet_result[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a concern for CLI usage. The reason is that the log messages that are sent via OTEL is a (deep?) copy of the data sent to stdout/console.
If I understand this solution it will only affect the messages making their way into SigNoz, but leave them as is in something like cloudwatch. Which, is probably fine since if someone had access to cloud watch, we're already in serious trouble. But, access to SigNoz is going to be given more freely, so we want to curate it carefully.
However, it should be tested to verify these assumption with the cli usage
Thanks for the discussion everyone (@linglp @andrewelamb @thomasyu888 @linglp ) Captures my thoughts. I would rather we utilize the OTEL Python SDK to handle the sanitization of the data, rather than the traditional native python ways. By implementing the logic via processors in the Python SDK it will work similar to the concept of how it would be implemented within the OpenTelemtry Collector shown here: https://opentelemetry.io/docs/collector/architecture/#pipelines Specifically it has the concept of using one or more processors to handle data transformation before that data is exported. Since we are developing code locally without a collector that may always be present, It's important to be able to filter this data out before it leaves the machine where the data is produced. Technically when this code is running in AWS we could use the collector approach, but i wanted us to figure out if we could do it in code before the collector (Lingling is proving that we can in this pull request!) |
schematic/__init__.py
Outdated
if span.status.status_code == trace.StatusCode.ERROR: | ||
if span.events: | ||
redacted_span = self._create_redacted_span(span) | ||
self.export.export([redacted_span]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if span.status.status_code == trace.StatusCode.ERROR: | |
if span.events: | |
redacted_span = self._create_redacted_span(span) | |
self.export.export([redacted_span]) | |
if span.status.status_code == trace.StatusCode.ERROR: | |
if span.events: | |
redacted_span = self._create_redacted_span(span) | |
self.export.export([redacted_span]) | |
return | |
self.export.export([span]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I'm not sure tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to work with how we have this set up because of:
https://github.com/open-telemetry/opentelemetry-python/blob/a7fe4f8bac7fa36291c6acf86982bbb356e3ae6d/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py#L173-L175
Specifically:
- We have a
BatchSpanProcessor
that was already responsible for queuing up and exporting the spans to SigNoz - Based on the code at the link above the span is sent to each processor, in this case the data is likely being sent twice. Once in the
BatchSpanProcessor
that is still being called, and once here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a hack how we can get around this issue:
- Instead of using a processor to handle this logic, as we can see, won't work we could monkey patch the
_readable_span
function call: https://github.com/open-telemetry/opentelemetry-python/blob/a7fe4f8bac7fa36291c6acf86982bbb356e3ae6d/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py#L906-L921 - In the monkey patch we: 1) Start by calling our sensitive data redaction process to strip data out of the span, 2) Return a call to the original function.
By monkey patching this, it would allow us to essentially "slip" in logic before a read-only span has been created.
Let me know what questions you have.
|
Problem
When an error involving Google Sheets is logged, the message may include the URL of the Google Sheet being processed and the google sheet link can be found in signoz both in log and in traces.
Solution
Create a custom log processor to filter out the sensitive information in the log
Evidence that this is working