Synchronization issue when context times out #119

shreyb · 2025-02-24T16:48:31Z

Last week, a few bad destination nodes made the token-push run time out. We saw the following (edited) log lines emitted:

2025-02-20 09:08:50	{"account":"sbndci","destPath":"/tmp/vt_****","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:50	
{"account":"sbndci","destPath":"/tmp/vt_***-sbnd_ci","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:01	
{"caller":"runAdminNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"sbnd-data-globus_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"hypot-gwms-test_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"dune-ci_ci","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"annie_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:07:52	
{"experiment":"sbnd","level":"error","msg":"Error pushing vault tokens to destination node","node":"NODE","role":"production","time":"2025-02-20T09:07:52-06:00"}

So there were a bunch of notification handlers that seemed to be waiting on a single bad service. We need to refactor the notification handler so that if there's a timeout with one service, we just go ahead and send the rest of the messages (I suspect the service-level messages are fine, but it's just admin that's not), and then clean up the hanging goroutines properly.

The text was updated successfully, but these errors were encountered:

shreyb added the bug label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization issue when context times out #119

Synchronization issue when context times out #119

shreyb commented Feb 24, 2025

Synchronization issue when context times out #119

Synchronization issue when context times out #119

Comments

shreyb commented Feb 24, 2025