Skip to content

[BUG]: VMSS agents go offline unexpectedly due to 401 auth error #5023

@angaaruriakhil

Description

@angaaruriakhil

What happened?

Context: We're hosting VMSS in Azure with images built with runner-images code as Azure DevOps agents. This is the source code for Azure DevOps MS hosted agents. We use the operating systems: Windows 2022, Windows 2019, Ubuntu 22.04, Ubuntu 20.04 and they are all affected by this issue. The issue happens intermittently across scale sets hosted in multiple regions across all the operating systems mentioned.

We intermittently are seeing VMs failing with the error for at least the last month in the Diagnostics tab for each pool in Azure DevOps:

Pipeline agent went offline unexpectedly

Which then will cause the VM to go offline and skip the jobs it may be running. This is causing big problems for us as VMs unexpectedly go offline and if unlucky, this could be while they are running a business critical job. This has been happening for at least a month.

We have checked our proxy/all firewalls for networking blocks and there are no blocks reported from our VMSS subnets to any destination or port. All the outbound traffic being executed is allowed.

Saving an unhealthy Ubuntu 22.04 agent for investigation and investigating the logs under /agent/_diag shows that there is a 401 error (scrubbed excepts attached in log box, I don't want to share sensitive information). See the logs box for relevant logs.

We have similarly looked at the log files under:

  • /var/log/waagent.log
  • /var/log/azure/Microsoft.VisualStudio.Services.TeamServicesAgentLinux/enableagent.log
  • /var/log/azure/Microsoft.VisualStudio.Services.TeamServicesAgentLinux/extension.log

All of which report nothing out of the ordinary.

There are similar issues reporting this problem here. #4826

As reported in #4826 , running the ./run.sh --diagnostics command, also reports an error writing to the log.

Versions

Azure DevOps Services
Images built with runner-images on October 15th 2024
Azure Pipelines Agent v3.246.0
WA Linux Agent v2.11.1.12

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

Ubuntu 22.04. Ubuntu 20.04, Windows 2022, Windows 2019

Version controll system

No response

Relevant log output

[2024-10-25 10:35:40Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z ERR  MessageListener] System.TimeoutException: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z INFO MessageListener] Retriable exception: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z ERR  Terminal] WRITE ERROR: 2024-10-25 10:36:15Z: Agent connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
[2024-10-25 11:20:38Z INFO MessageListener] Sleeping for 13.802 seconds before retrying.
[2024-10-25 11:21:42Z INFO MessageListener] Sleeping for 7.404 seconds before retrying.
[2024-10-25 11:21:43Z INFO MessageListener] Sent GetAgentMessage to keep alive agent 926980, session '{scrubbed}'.
[2024-10-25 11:22:13Z WARN VisualStudioServices] Authentication failed with status code 401.

[2024-10-25 11:31:36Z INFO MessageListener] No message retrieved from session '{scrubbed}' within last 30 minutes.
Results from ./run.sh --diagnostics:

System.UnauthorizedAccessException: Access to the path '/agent/_diag/Agent_20241025-190446-utc.log' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func2 errorRewriter)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at Microsoft.VisualStudio.Services.Agent.HostTraceListener.CreatePageLogWriter() in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 178
at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 50
at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135
at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28
./run.sh: line 68: 24614 Aborted (core dumped) "$DIR"/bin/Agent.Listener run $*
`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions