You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last week, a few bad destination nodes made the token-push run time out. We saw the following (edited) log lines emitted:
2025-02-20 09:08:50 {"account":"sbndci","destPath":"/tmp/vt_****","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
2025-02-20 09:08:50
{"account":"sbndci","destPath":"/tmp/vt_***-sbnd_ci","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
2025-02-20 09:08:01
{"caller":"runAdminNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","time":"2025-02-20T09:08:01-06:00"}
2025-02-20 09:08:01
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"sbnd-data-globus_production","time":"2025-02-20T09:08:01-06:00"}
2025-02-20 09:08:01
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"hypot-gwms-test_production","time":"2025-02-20T09:08:01-06:00"}
2025-02-20 09:08:01
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"dune-ci_ci","time":"2025-02-20T09:08:01-06:00"}
2025-02-20 09:08:01
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"annie_production","time":"2025-02-20T09:08:01-06:00"}
2025-02-20 09:07:52
{"experiment":"sbnd","level":"error","msg":"Error pushing vault tokens to destination node","node":"NODE","role":"production","time":"2025-02-20T09:07:52-06:00"}
So there were a bunch of notification handlers that seemed to be waiting on a single bad service. We need to refactor the notification handler so that if there's a timeout with one service, we just go ahead and send the rest of the messages (I suspect the service-level messages are fine, but it's just admin that's not), and then clean up the hanging goroutines properly.
The text was updated successfully, but these errors were encountered:
Last week, a few bad destination nodes made the
token-push
run time out. We saw the following (edited) log lines emitted:So there were a bunch of notification handlers that seemed to be waiting on a single bad service. We need to refactor the notification handler so that if there's a timeout with one service, we just go ahead and send the rest of the messages (I suspect the service-level messages are fine, but it's just admin that's not), and then clean up the hanging goroutines properly.
The text was updated successfully, but these errors were encountered: