-
Notifications
You must be signed in to change notification settings - Fork 894
Description
What happened?
Context: We're hosting VMSS in Azure with images built with runner-images code as Azure DevOps agents. This is the source code for Azure DevOps MS hosted agents. We use the operating systems: Windows 2022, Windows 2019, Ubuntu 22.04, Ubuntu 20.04 and they are all affected by this issue. The issue happens intermittently across scale sets hosted in multiple regions across all the operating systems mentioned.
We intermittently are seeing VMs failing with the error for at least the last month in the Diagnostics tab for each pool in Azure DevOps:
Pipeline agent went offline unexpectedly
Which then will cause the VM to go offline and skip the jobs it may be running. This is causing big problems for us as VMs unexpectedly go offline and if unlucky, this could be while they are running a business critical job. This has been happening for at least a month.
We have checked our proxy/all firewalls for networking blocks and there are no blocks reported from our VMSS subnets to any destination or port. All the outbound traffic being executed is allowed.
Saving an unhealthy Ubuntu 22.04 agent for investigation and investigating the logs under /agent/_diag shows that there is a 401 error (scrubbed excepts attached in log box, I don't want to share sensitive information). See the logs box for relevant logs.
We have similarly looked at the log files under:
- /var/log/waagent.log
- /var/log/azure/Microsoft.VisualStudio.Services.TeamServicesAgentLinux/enableagent.log
- /var/log/azure/Microsoft.VisualStudio.Services.TeamServicesAgentLinux/extension.log
All of which report nothing out of the ordinary.
There are similar issues reporting this problem here. #4826
As reported in #4826 , running the ./run.sh --diagnostics command, also reports an error writing to the log.
Versions
Azure DevOps Services
Images built with runner-images on October 15th 2024
Azure Pipelines Agent v3.246.0
WA Linux Agent v2.11.1.12
Environment type (Please select at least one enviroment where you face this issue)
- Self-Hosted
- Microsoft Hosted
- VMSS Pool
- Container
Azure DevOps Server type
dev.azure.com (formerly visualstudio.com)
Azure DevOps Server Version (if applicable)
No response
Operation system
Ubuntu 22.04. Ubuntu 20.04, Windows 2022, Windows 2019
Version controll system
No response
Relevant log output
[2024-10-25 10:35:40Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z ERR MessageListener] System.TimeoutException: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z INFO MessageListener] Retriable exception: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z ERR Terminal] WRITE ERROR: 2024-10-25 10:36:15Z: Agent connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
[2024-10-25 11:20:38Z INFO MessageListener] Sleeping for 13.802 seconds before retrying.
[2024-10-25 11:21:42Z INFO MessageListener] Sleeping for 7.404 seconds before retrying.
[2024-10-25 11:21:43Z INFO MessageListener] Sent GetAgentMessage to keep alive agent 926980, session '{scrubbed}'.
[2024-10-25 11:22:13Z WARN VisualStudioServices] Authentication failed with status code 401.
[2024-10-25 11:31:36Z INFO MessageListener] No message retrieved from session '{scrubbed}' within last 30 minutes.
Results from ./run.sh --diagnostics:
System.UnauthorizedAccessException: Access to the path '/agent/_diag/Agent_20241025-190446-utc.log' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func
2 errorRewriter)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at Microsoft.VisualStudio.Services.Agent.HostTraceListener.CreatePageLogWriter() in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 178
at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 50
at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135
at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28
./run.sh: line 68: 24614 Aborted (core dumped) "$DIR"/bin/Agent.Listener run $*
`