Skip to content

bug: token refresh failure after container inactivity in Serverless environment using ycm.WithCredentials() #1465

Open
@jaasco

Description

@jaasco

Bug Report

YDB GO SDK version:
3.76.5

Environment

Yandex Cloud Serverless Container
OS: Linux (within container)
CPU Architecture: x86_64

Current behavior:

When using ycm.WithCredentials() from the package github.com/ydb-platform/ydb-go-yc-metadata in a Serverless Container with provision_policy { min_instances = 1 }, the application runs normally for 12 hours. After 12 hours, the following error occurs during YDB requests:
transport/Unauthenticated (code = 16, source error = "rpc error: code = Unauthenticated desc = Unauthenticated", address: "ydb.serverless.yandexcloud.net:2135")
This issue arises because the Go routine responsible for refreshing the token stops when the container becomes inactive, and it does not restart upon the container becoming active again.

Expected behavior:

The token refresh routine should continue to function, even after container inactivity, ensuring that token expiration is properly handled and avoiding unauthenticated errors.

Steps to reproduce:

  1. Deploy an application using YDB GO SDK with ycm.WithCredentials() in a Yandex Cloud Serverless Container.
package main

import (
  "context"
  "os"

  "github.com/ydb-platform/ydb-go-sdk/v3"
  ycm "github.com/ydb-platform/ydb-go-yc-metadata"
)

func main() {
  ctx, cancel := context.WithCancel(context.Background())
  defer cancel()
  db, err := ydb.Open(ctx,
    os.Getenv("YDB_CONNECTION_STRING"),
    ycm.WithCredentials(),
    ycm.WithInternalCA(), // append Yandex Cloud certificates
  )
  if err != nil {
    panic(err)
  }
  defer db.Close(ctx)
  ...
}
  1. Set provision_policy { min_instances = 1 } for the container.
  2. Wait for 12 hours while the container becomes inactive and is reactivated.
  3. Observe the token expiration and the subsequent Unauthenticated errors in logs when attempting to access YDB.

Related code:

func (m *InstanceServiceAccountCredentials) refreshLoop() {
	defer m.timer.Stop()

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	for {
		select {
		case <-m.done:
			// Set up error
			m.mu.Lock()
			m.token, m.err = "", &createTokenError{
				Cause:  errClosed,
				Reason: errClosed.Error(),
			}
			m.mu.Unlock()
			return
		case <-m.timer.C:
			m.refreshOnce(ctx)
		}
	}
}

Other information:

The issue was mitigated in our project by rewriting the logic around ycm.WithCredentials() to ensure the token refresh is retriggered even after container inactivity like this:

func (m *InstanceServiceAccountLazyCredentials) Token(ctx context.Context) (token string, err error) {
	onDone := trace.TraceOnGetToken(m.trace, &ctx)
	defer func() {
		onDone(m.token, m.err)
	}()

	// check token before launching loop
	m.mu.RLock()
	token, err = m.token, m.err
	expiry := m.expiry
	m.mu.RUnlock()

	fmt.Printf("Token() call: token is \"%s\", err is %v, expiry is %s\n", masked(token), err, expiry.Format("2006-01-02 15:04:05"))

	// there was some error with the token, so it is unusable now. Avoid refreshing as it might cause log explosion
	if err != nil {
		fmt.Printf("Token() call: err is %v, return empty token\n", err)
		return "", err
	}

	fmt.Printf("Token() call: err is nil, expiry is %s, time.Now() is %s\n", expiry.Format("2006-01-02 15:04:05"), time.Now().Format("2006-01-02 15:04:05"))

	// if the token is empty or is near its expiration time, refresh it
	if token == "" || time.Now().Unix() > expiry.Unix() {
		fmt.Printf("Token() call: calling refresh\n")
		m.refreshOnce(ctx) // updates m.token, m.err, m.expiry
	} else {
		fmt.Printf("Token() call: token is not empty, now is %d, expiry is %d\n", time.Now().Unix(), expiry.Unix())
	}

	return m.token, m.err
}

...

// Perform single refresh iteration.
// If token was obtained:
// 1. Clear current err;
// 2. Set up new token and expiration;
// Otherwise, if current token has expired, clear it and set up err.
func (m *InstanceServiceAccountLazyCredentials) refreshOnce(ctx context.Context) {
	now := time.Now()
	fmt.Printf("Token() call: calling metaCall %s\n", m.metadataURL)
	tok, err := m.metaCall(ctx, m.metadataURL, false)
	fmt.Printf("Token() call: metadata token: %s, expires in %v, error: %v\n", masked(tok.Token), tok.ExpiresIn, err)

	// Call has been performed, now updating fields
	m.mu.Lock()
	defer m.mu.Unlock()

	if err != nil {
		m.token = ""
		m.err = err
		return
	}

	// Renew values.
	m.token, m.expiry, m.err = tok.Token, now.Add(tok.ExpiresIn).Add(-time.Minute*time.Duration(715)), nil
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions