Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization issue when context times out #119

Open
shreyb opened this issue Feb 24, 2025 · 0 comments
Open

Synchronization issue when context times out #119

shreyb opened this issue Feb 24, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@shreyb
Copy link
Collaborator

shreyb commented Feb 24, 2025

Last week, a few bad destination nodes made the token-push run time out. We saw the following (edited) log lines emitted:

2025-02-20 09:08:50	{"account":"sbndci","destPath":"/tmp/vt_****","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:50	
{"account":"sbndci","destPath":"/tmp/vt_***-sbnd_ci","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:01	
{"caller":"runAdminNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"sbnd-data-globus_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"hypot-gwms-test_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"dune-ci_ci","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"annie_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:07:52	
{"experiment":"sbnd","level":"error","msg":"Error pushing vault tokens to destination node","node":"NODE","role":"production","time":"2025-02-20T09:07:52-06:00"}

So there were a bunch of notification handlers that seemed to be waiting on a single bad service. We need to refactor the notification handler so that if there's a timeout with one service, we just go ahead and send the rest of the messages (I suspect the service-level messages are fine, but it's just admin that's not), and then clean up the hanging goroutines properly.

@shreyb shreyb added the bug Something isn't working label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant