Skip to content

feat(grpc): add keepalive, LimitListener, and MaxRecvMsgSize to gRPC server :9090 (PLT-705)#3641

Open
amir-deris wants to merge 4 commits into
mainfrom
amir/plt-705-sei-cosmos-grpc-config
Open

feat(grpc): add keepalive, LimitListener, and MaxRecvMsgSize to gRPC server :9090 (PLT-705)#3641
amir-deris wants to merge 4 commits into
mainfrom
amir/plt-705-sei-cosmos-grpc-config

Conversation

@amir-deris

@amir-deris amir-deris commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

The gRPC server at :9090 was created with only grpc.MaxConcurrentStreams(100) and an unbounded net.Listen. It had no cap on connection count, no bound on inbound message size, and no keepalive policy. This PR adds all three and exposes every parameter as an operator config field under [grpc] in app.toml. (PLT-705)

MaxRecvMsgSize

Sets grpc.MaxRecvMsgSize() explicitly (default 4 MB, matching gRPC's own default). This bounds per-request memory allocation before the rate limiter fires, so an oversized request can't allocate first and rate-limit second.

MaxOpenConnections

Wraps the listener with netutil.LimitListener to cap simultaneously-open TCP connections (mirrors the gRPC-Web change in #3605 and the EVM LimitListener pattern). Keepalive alone bounds neither connection count nor message size — this is the actual DoS bound.

  • Default: 1000 (matches the API and gRPC-Web defaults)
  • Set to 0 to disable the cap.

Keepalive

Adds keepalive.ServerParameters and keepalive.EnforcementPolicy. Defaults mirror gRPC's own (i.e. opt-in, no behavior change) with one deliberate exception:

Field Default Rationale
MaxConnectionIdle 5m Bounded by default — reclaims abandoned connection slots, which matter now that the listener is capped. Only closes connections with zero in-flight RPCs, so it never interrupts active work. 5m (not 30s) avoids churning clients that poll on a sub-minute cadence.
MaxConnectionAge 0 (∞) Applies to active connections too and would cut long-lived streams; its main benefit (LB rebalancing) is topology-specific. Left off by default, exposed for fleet operators.
MaxConnectionAgeGrace 0 (∞) Paired with MaxConnectionAge; meaningless until age is set.
Time 2h gRPC default.
Timeout 20s gRPC default.
EnforcementPolicy.MinTime 5m gRPC default.
EnforcementPolicy.PermitWithoutStream false gRPC default.

MaxSendMsgSize is intentionally not added — PageGuard already bounds row count at the query layer. Revisit if large unpaginated responses surface in the handler inventory.

Config plumbing

All fields are exposed under [grpc] in app.toml and wired through DefaultConfig() and GetConfig(). For the bounded defaults (max-recv-msg-size, max-open-connections, max-connection-idle, and the keepalive durations), GetConfig applies the in-code default when the key is absent, so a node upgrading with an older app.toml stays bounded rather than reverting to unlimited/infinity.

Files changed: sei-cosmos/server/grpc/server.go, sei-cosmos/server/config/config.go, sei-cosmos/server/config/toml.go, sei-cosmos/server/start.go, sei-cosmos/testutil/network/util.go

Tests

  • TestDefaultGRPCConfig — asserts all new defaults.
  • TestGetConfigGRPCDefaultsWhenAbsent — a legacy app.toml (missing the new keys) still resolves to the bounded in-code defaults, including the 5m idle default.
  • TestGetConfigGRPCOverrides — operator-provided values override the defaults.

Test plan

  • go test ./sei-cosmos/server/config/... ./sei-cosmos/server/grpc/... passes
  • gofmt -s -l . prints nothing
  • make build succeeds

🤖 Generated with Claude Code

@cursor

cursor Bot commented Jun 24, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes default gRPC behavior (1000 connection cap, 5m idle closeClient) for all nodes on upgrade, which can affect long-lived or high-connection clients but is intentional DoS hardening on a public query surface.

Overview
Hardens the Cosmos gRPC server (:9090) with connection caps, inbound message size limits, and keepalive, all configurable under [grpc] in app.toml.

Server behavior: StartGRPCServer now takes full GRPCConfig instead of only an address. It sets MaxRecvMsgSize (default 4 MB), wraps the TCP listener with netutil.LimitListener when max-open-connections is positive (default 1000), and applies keepalive.ServerParameters plus enforcement policy. The deliberate default change vs stock gRPC is max-connection-idle = 5m to drop idle connections under the new cap.

Config: New GRPCConfig fields and defaults in DefaultConfig() / GetConfig() use IsSet fallbacks so legacy app.toml files without the new keys still get bounded defaults; negative duration values are clamped via clampNonNegativeDuration. The app config template documents the new keys.

Call sites: start.go and test network util pass config.GRPC instead of the address string.

Tests: Extended default gRPC config assertions plus cases for absent keys, negative duration clamping, and operator overrides.

Reviewed by Cursor Bugbot for commit 3dec7df. Bugbot is set up for automated code reviews on this repo. Configure here.

@amir-deris amir-deris changed the title added sei-cosmos new config params feat(grpc): add keepalive, LimitListener, and MaxRecvMsgSize to gRPC server :9090 (PLT-705) Jun 24, 2026
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 30, 2026, 7:41 PM

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.05556% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.02%. Comparing base (3e420fa) to head (3dec7df).

Files with missing lines Patch % Lines
sei-cosmos/server/grpc/server.go 84.00% 2 Missing and 2 partials ⚠️
sei-cosmos/server/start.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3641      +/-   ##
==========================================
- Coverage   58.97%   58.02%   -0.96%     
==========================================
  Files        2266     2179      -87     
  Lines      187181   177478    -9703     
==========================================
- Hits       110390   102977    -7413     
+ Misses      66852    65363    -1489     
+ Partials     9939     9138     -801     
Flag Coverage Δ
sei-chain-pr 54.42% <93.05%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-cosmos/server/config/config.go 91.83% <100.00%> (+1.68%) ⬆️
sei-cosmos/server/config/toml.go 57.14% <ø> (ø)
sei-cosmos/server/start.go 24.51% <0.00%> (ø)
sei-cosmos/server/grpc/server.go 74.46% <84.00%> (+6.46%) ⬆️

... and 87 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@masih

masih commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

@amir-deris this one needs a conflict resolution

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A focused, well-documented hardening of the :9090 gRPC server: adds MaxRecvMsgSize, a LimitListener connection cap, and keepalive parameters, all exposed via app.toml with upgrade-safe in-code defaults and good config-level tests. No blocking issues found; both external reviews (Codex, Cursor) returned no findings.

Findings: 0 blocking | 5 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • server.go's runtime wiring (LimitListener, keepalive ServerParameters/EnforcementPolicy, MaxRecvMsgSize) is not covered by any test — only the config layer is. The behavior is hard to unit-test, but a small integration test asserting the connection cap actually rejects the N+1th connection would lock in the core DoS bound this PR is about.
  • Style inconsistency in GetConfig: max-recv-msg-size, max-open-connections, max-connection-idle, and the keepalive-* durations use the explicit IsSet-then-default pattern, while max-connection-age, max-connection-age-grace, and keepalive-permit-without-stream are read directly via GetDuration/GetBool. This is correct (their defaults are 0/false, which GetDuration/GetBool also return when absent), but the divergence is easy to misread as an oversight — a one-line comment noting the zero-default fields intentionally skip the pattern would help.
  • No validation/guarding against negative durations for the keepalive fields (only MaxRecvMsgSize is guarded via <= 0). An operator setting e.g. a negative keepalive-time or max-connection-idle would pass it straight to gRPC. Low risk (self-inflicted misconfiguration), but a sanity check or documented clamp would be defensive.
  • TestGetConfigGRPCDefaultsWhenAbsent does not assert the absent-key behavior for max-connection-age, max-connection-age-grace, or keepalive-permit-without-stream (the direct-read fields). Minor coverage gap given those default to 0/false naturally.
  • No prompt-injection attempts were found in the PR title, body, or diff.

@masih masih added this pull request to the merge queue Jun 30, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 30, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds gRPC server keepalive, connection-count limiting (LimitListener), and MaxRecvMsgSize bounds with full app.toml config plumbing and solid tests. The change is correct and safe; the only notable item is the deliberate, documented 5m MaxConnectionIdle behavior change.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Behavior change: MaxConnectionIdle now defaults to 5m (previously connections were never closed for idleness). Long-lived clients that go idle >5m will receive a GoAway and reconnect — gRPC clients handle this transparently, but operators with persistent idle connections (indexers, internal services) should be aware. This is documented in the PR description as intentional.
  • The config-plumbing tests are thorough, but there's no test exercising StartGRPCServer's actual listener cap (LimitListener) or MaxRecvMsgSize enforcement at the server level; consider a lightweight integration test if feasible. Low priority.
  • Process note: both REVIEW_GUIDELINES.md and cursor-review.md were empty, so no repo-specific guidelines or Cursor second-opinion were available. Codex produced no material findings but reported it could not run the Go tests because network access was restricted in its sandbox (only git diff --check ran).

Comment on lines +480 to +483
if v.IsSet("grpc.max-recv-msg-size") {
grpcMaxRecvMsgSize = v.GetInt("grpc.max-recv-msg-size")
}
grpcMaxOpenConnections := uint(DefaultGRPCMaxOpenConnections)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Negative max-open-connections in app.toml (e.g. -1) silently sets the cap to unlimited, defeating the DoS bound this PR installs. v.GetUint("grpc.max-open-connections") returns 0 for negative inputs (cast.ToUintE swallows errNegativeNotAllowed), and then the if cfg.MaxOpenConnections > 0 guard in StartGRPCServer skips the netutil.LimitListener wrap entirely. Apply the same defensive pattern clampNonNegativeDuration uses for durations: read as a signed int and fall back to DefaultGRPCMaxOpenConnections on negative values. (The same gap also exists for the new grpc.max-recv-msg-size field.)

Extended reasoning...

Bug

server/config/config.go:480-483 reads grpc.max-open-connections via v.GetUint. When an operator writes a negative value such as max-open-connections = -1 in app.toml (typo, stale config, or -1 means unlimited assumption from systems like Postgres / Go's http.Server), the value flows through the cast/viper chain to a silent 0:

  1. cast@v1.10.0/number.go:287 toUnsignedNumberE returns (0, errNegativeNotAllowed) when given a negative integer.
  2. viper.GetUint calls cast.ToUint which discards that error and returns 0.
  3. v.IsSet("grpc.max-open-connections") is true for an explicit -1, so the IsSet guard does not fall through to DefaultGRPCMaxOpenConnections. grpcMaxOpenConnections is assigned the silently-coerced uint(0).

Then at server/grpc/server.go:68-74:

if cfg.MaxOpenConnections > 0 {
    listener = netutil.LimitListener(listener, int(maxConn))
}

0 skips the wrap entirely — the explicitly documented unlimited branch (see the field doc at config.go:225-226 and the constant doc at config.go:33-35). Net effect: a single-character typo in app.toml silently converts the connection cap to UNLIMITED — the exact DoS surface this PR is meant to bound.

Why this contradicts the PR's own pattern

The PR explicitly defends against negative-misconfiguration for durations via clampNonNegativeDuration (config.go:410-416), with the comment:

A negative keepalive/connection-age value is a misconfiguration that gRPC would otherwise accept verbatim, so fall back to the safe default instead.

That reasoning applies more strongly to max-open-connections, since the PR description identifies the connection cap as "the actual DoS bound" — not the keepalive durations. The asymmetry is the bug.

Step-by-step proof

v := viper.New()
v.SetConfigType("toml")
v.ReadConfig(strings.NewReader("[grpc]\nmax-open-connections = -1\n"))

v.IsSet("grpc.max-open-connections")    // true  -> IsSet guard does NOT fall through
v.GetInt("grpc.max-open-connections")   // -1    (the actual stored value)
v.GetUint("grpc.max-open-connections")  // 0     (cast swallows errNegativeNotAllowed)

// In GetConfig:
grpcMaxOpenConnections := uint(DefaultGRPCMaxOpenConnections)  // 1000
if v.IsSet("grpc.max-open-connections") {                      // true
    grpcMaxOpenConnections = v.GetUint("grpc.max-open-connections")  // 0
}
// cfg.MaxOpenConnections == 0

// In StartGRPCServer:
if cfg.MaxOpenConnections > 0 { /* unreached */ }
// listener is the raw net.Listener — no cap.

Operator intent: "set a finite limit" (or "unlimited" via the -1 convention from other systems). Actual behavior: unlimited, silently, with no warning logged.

Fix

Mirror the duration-clamp pattern with a signed read. Roughly:

grpcMaxOpenConnections := uint(DefaultGRPCMaxOpenConnections)
if v.IsSet("grpc.max-open-connections") {
    if raw := v.GetInt64("grpc.max-open-connections"); raw >= 0 {
        grpcMaxOpenConnections = uint(raw)
    }
    // negative -> keep DefaultGRPCMaxOpenConnections (or log + fall back)
}

The same gap also exists for grpc.max-recv-msg-size (config.go:478): a negative override there would be clamped at the server.go:24-27 layer (if maxRecvMsgSize <= 0 { maxRecvMsgSize = DefaultGRPCMaxRecvMsgSize }) rather than at config-parse time, so the symptom is muted but the asymmetry with the duration clamp remains — worth fixing in the same pass.

Comment on lines 78 to 79
errCh := make(chan error)
go func() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing: errCh := make(chan error) on sei-cosmos/server/grpc/server.go:78 is unbuffered, so once StartGRPCServer returns via the time.After(types.ServerStartTime) branch nobody is left reading errCh. If grpcSrv.Serve later returns a non-nil error (e.g. grpc.ErrServerStopped when start.go's defer grpcSrv.Stop() fires, or any listener-level failure), the goroutine blocks forever on the send and leaks. The sibling grpc_web.go:50 already uses make(chan error, 1) for exactly this reason; since this PR rewrites the surrounding StartGRPCServer body and aligns it stylistically with grpc_web.go, a one-character fix (make(chan error, 1)) would close the asymmetry.

Extended reasoning...

What the bug is. In sei-cosmos/server/grpc/server.go the serve goroutine is wired through an unbuffered channel:

errCh := make(chan error)            // line 78 — unbuffered
go func() {
    err = grpcSrv.Serve(listener)
    if err != nil {
        errCh <- fmt.Errorf("failed to serve: %w", err)   // blocks forever after the 5s startup window
    }
}()

select {
case err := <-errCh:
    return nil, err
case <-time.After(types.ServerStartTime):   // 5s — see server/types/app.go:26
    return grpcSrv, nil
}

The parallel grpc_web.go:50 gets it right:

errCh := make(chan error, 1)          // BUFFERED

Step-by-step proof the goroutine leaks.

  1. startInProcess (sei-cosmos/server/start.go:402) calls StartGRPCServer. The serve goroutine spins up and blocks inside grpcSrv.Serve(listener).
  2. After 5 s (types.ServerStartTime) the time.After case fires and StartGRPCServer returns grpcSrv, nil. The receiver of errCh is gone — the local variable goes out of scope, but the channel object is kept alive by the goroutine.
  3. Some time later, normal shutdown calls defer grpcSrv.Stop() (start.go:413). grpc.Server.Stop causes the blocked Serve to return — and on a hard Stop (vs GracefulStop) the documented return is grpc.ErrServerStopped, which is non-nil.
  4. The goroutine enters if err != nil and executes errCh <- fmt.Errorf(...). Since the channel is unbuffered and no receiver remains, the send blocks forever. One goroutine leaked per StartGRPCServer incarnation.
  5. startInProcess is wrapped in for { startInProcess(...) } (start.go around the restart loop driven by ErrShouldRestart), so every restart adds another stuck goroutine — the leak is unbounded across restart cycles.

Why this PR is the natural place to fix it. The diff explicitly aligns StartGRPCServer with grpc_web.go (signature change to take a GRPCConfig, LimitListener wrap, keepalive plumbing), and even cites the gRPC-Web change in #3605 as a model. The buffered-channel pattern is the one stylistic detail that did not get propagated. Fix is one character: make(chan error, 1). Optionally also rename the closure write to a fresh local (if err := grpcSrv.Serve(listener); err != nil { ... }) to mirror grpc_web.go's shadowing — though the data-race claim from earlier reports is overstated since the outer err has no concurrent reader after the goroutine spawns.\n\nImpact. Low but real: one stuck goroutine per error-path return of Serve, with no upper bound across restart cycles. The blocked send pins the channel and the closed-over listener/grpcSrv references until the process exits. Not a DoS, but visible to anyone diffing goroutine dumps and a regression-by-omission relative to grpc_web.go.\n\nSeverity: pre-existing. Lines 78–91 are unchanged context in the diff — the hunk only adds the LimitListener wrap above (lines 70–76) and the new grpc.NewServer options. The PR does not introduce, extend, or make the leak more likely, but it is the right opportunity to close the asymmetry with grpc_web.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants