-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log stack traces on data nodes before they are cleared for transport #125732
Conversation
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @benchaplin, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, thanks @benchaplin !
header = Boolean.parseBoolean(threadPool.getThreadContext().getHeaderOrDefault("error_trace", "false")); | ||
} | ||
if (header == false) { | ||
return listener.delegateResponse((l, e) -> { | ||
logger.debug( | ||
() -> format("[%s]%s Clearing stack trace before transport:", clusterService.localNode().getId(), request.shardId()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the best place to add the logging indeed, because we ensure that we do the additional logging exclusively for the cases where we suppress the stack trace, before doing so.
If we log this at debug, we are not going to see it with the default log level, are we? I think we should use warn instead at least?
The error message looks a little misleading also, all we are interested in is the error itself, so I would log the same that we'd get on the coord node, but this time we'd get the stacktrace.
There's a couple more aspects that deserve attention I think:
- if we keep on logging on the coord node, we should probably only log in the data nodes when the error trace is not requested, otherwise we just add redundant logging?
- if we keep on logging on the coord node, it may happen that the node acting as coord node acts as a data node as well as part of serving a search request. That would lead to duplicated logging on that node, that may be ok but not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the log message to be more clear for users and raised the level to WARN
on the same condition that the rest suppressed logger logs at WARN
.
- Agreed, and that is the current behavior as this log is only wrapped in
if (header == false) {
. - That is true. I think the shard failure logs on the coord node (see my example below) are important, but an argument could be made to remove the rest suppressed log if
error_trace=false
. Then again rest.suppressed is only one log line. But I imagine removing any of these logs would count as a breaking change (?), as alerts out there (like our own) might rely on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think changing the way we log is a breaking change. But I think this could be a follow-up.
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
@@ -32,24 +41,41 @@ | |||
public class SearchErrorTraceIT extends HttpSmokeTestCase { | |||
private BooleanSupplier hasStackTrace; | |||
|
|||
private static final String loggerName = "org.elasticsearch.search.SearchService"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably use SearchService.class
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done.
Map<String, Object> responseEntity = performRequestAndGetResponseEntityAfterDelay(searchRequest, TimeValue.ZERO); | ||
String asyncExecutionId = (String) responseEntity.get("id"); | ||
Request request = new Request("GET", "/_async_search/" + asyncExecutionId); | ||
while (responseEntity.get("is_running") instanceof Boolean isRunning && isRunning) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use assertBusy
here?
I think our main problem with this is that users are giving us (or we giving ourselves) logs with useful part of the backtrace removed. So I wonder if this patch really fixes that? Would the users see the missing part in the data node logs? Would they know how to get it and how to give it to us? |
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
For reference, here's a brief example of what logs we have today + what I'm adding. I've thrown a NPE in SearchService to trigger. Setup: 3 nodes, 3 primary shards, 3 replicas. (Coord node: we get 6 of these, one per shard)
(Coord node: after 6 above failures)
(Coord node:
(Data node: this PR's new log - we get 6 of these spread across the nodes, one per shard)
Edit - after b34afc1, the new log will match the level of the r.suppressed log, so it would be |
9ef7abb
to
9f527eb
Compare
(apologies for the force push, messed up my upstream merge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of minors, LGTM otherwise
qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
Outdated
Show resolved
Hide resolved
qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM great work, thanks!
💔 Backport failed
You can use sqren/backport to manually backport by running |
…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java
…nsport (#125732) (#126246) * Log stack traces on data nodes before they are cleared for transport (#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)
…sport (#125732) (#126245) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)
…sport (#125732) (#126243) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)
#118266 cleared stack traces on data nodes before transport back to the coordinating node when
error_trace=false
. However, all logging of exceptions happens on the coordinating node. This change made it impossible to debug errors via stack trace whenerror_trace=false
.Here, I've logged the exception on the data node right before the stack trace is cleared. It's prefixed with
[nodeId][indexName][shard]
to match therest.suppressed
shard failures log on the coordinating node, allowing for easy error tracing from the coordinating node to the responsible data node.Might this flood the (debug level) logs?
This change has the potential to log [# of shards] times for each index in a search. However, this log:
elasticsearch/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Line 405 in 937bcd9