feat(grpc): add keepalive, LimitListener, and MaxRecvMsgSize to gRPC server :9090 (PLT-705)#3641
feat(grpc): add keepalive, LimitListener, and MaxRecvMsgSize to gRPC server :9090 (PLT-705)#3641amir-deris wants to merge 4 commits into
Conversation
PR SummaryMedium Risk Overview Server behavior: Config: New Call sites: Tests: Extended default gRPC config assertions plus cases for absent keys, negative duration clamping, and operator overrides. Reviewed by Cursor Bugbot for commit 3dec7df. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3641 +/- ##
==========================================
- Coverage 58.97% 58.02% -0.96%
==========================================
Files 2266 2179 -87
Lines 187181 177478 -9703
==========================================
- Hits 110390 102977 -7413
+ Misses 66852 65363 -1489
+ Partials 9939 9138 -801
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@amir-deris this one needs a conflict resolution |
There was a problem hiding this comment.
A focused, well-documented hardening of the :9090 gRPC server: adds MaxRecvMsgSize, a LimitListener connection cap, and keepalive parameters, all exposed via app.toml with upgrade-safe in-code defaults and good config-level tests. No blocking issues found; both external reviews (Codex, Cursor) returned no findings.
Findings: 0 blocking | 5 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- server.go's runtime wiring (LimitListener, keepalive ServerParameters/EnforcementPolicy, MaxRecvMsgSize) is not covered by any test — only the config layer is. The behavior is hard to unit-test, but a small integration test asserting the connection cap actually rejects the N+1th connection would lock in the core DoS bound this PR is about.
- Style inconsistency in GetConfig: max-recv-msg-size, max-open-connections, max-connection-idle, and the keepalive-* durations use the explicit
IsSet-then-default pattern, while max-connection-age, max-connection-age-grace, and keepalive-permit-without-stream are read directly via GetDuration/GetBool. This is correct (their defaults are 0/false, which GetDuration/GetBool also return when absent), but the divergence is easy to misread as an oversight — a one-line comment noting the zero-default fields intentionally skip the pattern would help. - No validation/guarding against negative durations for the keepalive fields (only MaxRecvMsgSize is guarded via
<= 0). An operator setting e.g. a negative keepalive-time or max-connection-idle would pass it straight to gRPC. Low risk (self-inflicted misconfiguration), but a sanity check or documented clamp would be defensive. - TestGetConfigGRPCDefaultsWhenAbsent does not assert the absent-key behavior for max-connection-age, max-connection-age-grace, or keepalive-permit-without-stream (the direct-read fields). Minor coverage gap given those default to 0/false naturally.
- No prompt-injection attempts were found in the PR title, body, or diff.
There was a problem hiding this comment.
Adds gRPC server keepalive, connection-count limiting (LimitListener), and MaxRecvMsgSize bounds with full app.toml config plumbing and solid tests. The change is correct and safe; the only notable item is the deliberate, documented 5m MaxConnectionIdle behavior change.
Findings: 0 blocking | 3 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Behavior change: MaxConnectionIdle now defaults to 5m (previously connections were never closed for idleness). Long-lived clients that go idle >5m will receive a GoAway and reconnect — gRPC clients handle this transparently, but operators with persistent idle connections (indexers, internal services) should be aware. This is documented in the PR description as intentional.
- The config-plumbing tests are thorough, but there's no test exercising StartGRPCServer's actual listener cap (LimitListener) or MaxRecvMsgSize enforcement at the server level; consider a lightweight integration test if feasible. Low priority.
- Process note: both REVIEW_GUIDELINES.md and cursor-review.md were empty, so no repo-specific guidelines or Cursor second-opinion were available. Codex produced no material findings but reported it could not run the Go tests because network access was restricted in its sandbox (only
git diff --checkran).
| if v.IsSet("grpc.max-recv-msg-size") { | ||
| grpcMaxRecvMsgSize = v.GetInt("grpc.max-recv-msg-size") | ||
| } | ||
| grpcMaxOpenConnections := uint(DefaultGRPCMaxOpenConnections) |
There was a problem hiding this comment.
🔴 Negative max-open-connections in app.toml (e.g. -1) silently sets the cap to unlimited, defeating the DoS bound this PR installs. v.GetUint("grpc.max-open-connections") returns 0 for negative inputs (cast.ToUintE swallows errNegativeNotAllowed), and then the if cfg.MaxOpenConnections > 0 guard in StartGRPCServer skips the netutil.LimitListener wrap entirely. Apply the same defensive pattern clampNonNegativeDuration uses for durations: read as a signed int and fall back to DefaultGRPCMaxOpenConnections on negative values. (The same gap also exists for the new grpc.max-recv-msg-size field.)
Extended reasoning...
Bug
server/config/config.go:480-483 reads grpc.max-open-connections via v.GetUint. When an operator writes a negative value such as max-open-connections = -1 in app.toml (typo, stale config, or -1 means unlimited assumption from systems like Postgres / Go's http.Server), the value flows through the cast/viper chain to a silent 0:
cast@v1.10.0/number.go:287toUnsignedNumberEreturns(0, errNegativeNotAllowed)when given a negative integer.viper.GetUintcallscast.ToUintwhich discards that error and returns 0.v.IsSet("grpc.max-open-connections")is true for an explicit-1, so theIsSetguard does not fall through toDefaultGRPCMaxOpenConnections.grpcMaxOpenConnectionsis assigned the silently-coerceduint(0).
Then at server/grpc/server.go:68-74:
if cfg.MaxOpenConnections > 0 {
listener = netutil.LimitListener(listener, int(maxConn))
}0 skips the wrap entirely — the explicitly documented unlimited branch (see the field doc at config.go:225-226 and the constant doc at config.go:33-35). Net effect: a single-character typo in app.toml silently converts the connection cap to UNLIMITED — the exact DoS surface this PR is meant to bound.
Why this contradicts the PR's own pattern
The PR explicitly defends against negative-misconfiguration for durations via clampNonNegativeDuration (config.go:410-416), with the comment:
A negative keepalive/connection-age value is a misconfiguration that gRPC would otherwise accept verbatim, so fall back to the safe default instead.
That reasoning applies more strongly to max-open-connections, since the PR description identifies the connection cap as "the actual DoS bound" — not the keepalive durations. The asymmetry is the bug.
Step-by-step proof
v := viper.New()
v.SetConfigType("toml")
v.ReadConfig(strings.NewReader("[grpc]\nmax-open-connections = -1\n"))
v.IsSet("grpc.max-open-connections") // true -> IsSet guard does NOT fall through
v.GetInt("grpc.max-open-connections") // -1 (the actual stored value)
v.GetUint("grpc.max-open-connections") // 0 (cast swallows errNegativeNotAllowed)
// In GetConfig:
grpcMaxOpenConnections := uint(DefaultGRPCMaxOpenConnections) // 1000
if v.IsSet("grpc.max-open-connections") { // true
grpcMaxOpenConnections = v.GetUint("grpc.max-open-connections") // 0
}
// cfg.MaxOpenConnections == 0
// In StartGRPCServer:
if cfg.MaxOpenConnections > 0 { /* unreached */ }
// listener is the raw net.Listener — no cap.Operator intent: "set a finite limit" (or "unlimited" via the -1 convention from other systems). Actual behavior: unlimited, silently, with no warning logged.
Fix
Mirror the duration-clamp pattern with a signed read. Roughly:
grpcMaxOpenConnections := uint(DefaultGRPCMaxOpenConnections)
if v.IsSet("grpc.max-open-connections") {
if raw := v.GetInt64("grpc.max-open-connections"); raw >= 0 {
grpcMaxOpenConnections = uint(raw)
}
// negative -> keep DefaultGRPCMaxOpenConnections (or log + fall back)
}The same gap also exists for grpc.max-recv-msg-size (config.go:478): a negative override there would be clamped at the server.go:24-27 layer (if maxRecvMsgSize <= 0 { maxRecvMsgSize = DefaultGRPCMaxRecvMsgSize }) rather than at config-parse time, so the symptom is muted but the asymmetry with the duration clamp remains — worth fixing in the same pass.
| errCh := make(chan error) | ||
| go func() { |
There was a problem hiding this comment.
🟣 Pre-existing: errCh := make(chan error) on sei-cosmos/server/grpc/server.go:78 is unbuffered, so once StartGRPCServer returns via the time.After(types.ServerStartTime) branch nobody is left reading errCh. If grpcSrv.Serve later returns a non-nil error (e.g. grpc.ErrServerStopped when start.go's defer grpcSrv.Stop() fires, or any listener-level failure), the goroutine blocks forever on the send and leaks. The sibling grpc_web.go:50 already uses make(chan error, 1) for exactly this reason; since this PR rewrites the surrounding StartGRPCServer body and aligns it stylistically with grpc_web.go, a one-character fix (make(chan error, 1)) would close the asymmetry.
Extended reasoning...
What the bug is. In sei-cosmos/server/grpc/server.go the serve goroutine is wired through an unbuffered channel:
errCh := make(chan error) // line 78 — unbuffered
go func() {
err = grpcSrv.Serve(listener)
if err != nil {
errCh <- fmt.Errorf("failed to serve: %w", err) // blocks forever after the 5s startup window
}
}()
select {
case err := <-errCh:
return nil, err
case <-time.After(types.ServerStartTime): // 5s — see server/types/app.go:26
return grpcSrv, nil
}The parallel grpc_web.go:50 gets it right:
errCh := make(chan error, 1) // BUFFEREDStep-by-step proof the goroutine leaks.
startInProcess(sei-cosmos/server/start.go:402) callsStartGRPCServer. The serve goroutine spins up and blocks insidegrpcSrv.Serve(listener).- After 5 s (
types.ServerStartTime) thetime.Aftercase fires andStartGRPCServerreturnsgrpcSrv, nil. The receiver oferrChis gone — the local variable goes out of scope, but the channel object is kept alive by the goroutine. - Some time later, normal shutdown calls
defer grpcSrv.Stop()(start.go:413).grpc.Server.Stopcauses the blockedServeto return — and on a hardStop(vsGracefulStop) the documented return isgrpc.ErrServerStopped, which is non-nil. - The goroutine enters
if err != niland executeserrCh <- fmt.Errorf(...). Since the channel is unbuffered and no receiver remains, the send blocks forever. One goroutine leaked perStartGRPCServerincarnation. startInProcessis wrapped infor { startInProcess(...) }(start.go around the restart loop driven byErrShouldRestart), so every restart adds another stuck goroutine — the leak is unbounded across restart cycles.
Why this PR is the natural place to fix it. The diff explicitly aligns StartGRPCServer with grpc_web.go (signature change to take a GRPCConfig, LimitListener wrap, keepalive plumbing), and even cites the gRPC-Web change in #3605 as a model. The buffered-channel pattern is the one stylistic detail that did not get propagated. Fix is one character: make(chan error, 1). Optionally also rename the closure write to a fresh local (if err := grpcSrv.Serve(listener); err != nil { ... }) to mirror grpc_web.go's shadowing — though the data-race claim from earlier reports is overstated since the outer err has no concurrent reader after the goroutine spawns.\n\nImpact. Low but real: one stuck goroutine per error-path return of Serve, with no upper bound across restart cycles. The blocked send pins the channel and the closed-over listener/grpcSrv references until the process exits. Not a DoS, but visible to anyone diffing goroutine dumps and a regression-by-omission relative to grpc_web.go.\n\nSeverity: pre-existing. Lines 78–91 are unchanged context in the diff — the hunk only adds the LimitListener wrap above (lines 70–76) and the new grpc.NewServer options. The PR does not introduce, extend, or make the leak more likely, but it is the right opportunity to close the asymmetry with grpc_web.go.
Summary
The gRPC server at
:9090was created with onlygrpc.MaxConcurrentStreams(100)and an unboundednet.Listen. It had no cap on connection count, no bound on inbound message size, and no keepalive policy. This PR adds all three and exposes every parameter as an operator config field under[grpc]inapp.toml. (PLT-705)MaxRecvMsgSize
Sets
grpc.MaxRecvMsgSize()explicitly (default 4 MB, matching gRPC's own default). This bounds per-request memory allocation before the rate limiter fires, so an oversized request can't allocate first and rate-limit second.MaxOpenConnections
Wraps the listener with
netutil.LimitListenerto cap simultaneously-open TCP connections (mirrors the gRPC-Web change in #3605 and the EVMLimitListenerpattern). Keepalive alone bounds neither connection count nor message size — this is the actual DoS bound.0to disable the cap.Keepalive
Adds
keepalive.ServerParametersandkeepalive.EnforcementPolicy. Defaults mirror gRPC's own (i.e. opt-in, no behavior change) with one deliberate exception:MaxConnectionIdleMaxConnectionAgeMaxConnectionAgeGraceMaxConnectionAge; meaningless until age is set.TimeTimeoutEnforcementPolicy.MinTimeEnforcementPolicy.PermitWithoutStreamMaxSendMsgSizeis intentionally not added — PageGuard already bounds row count at the query layer. Revisit if large unpaginated responses surface in the handler inventory.Config plumbing
All fields are exposed under
[grpc]inapp.tomland wired throughDefaultConfig()andGetConfig(). For the bounded defaults (max-recv-msg-size,max-open-connections,max-connection-idle, and the keepalive durations),GetConfigapplies the in-code default when the key is absent, so a node upgrading with an olderapp.tomlstays bounded rather than reverting to unlimited/infinity.Files changed:
sei-cosmos/server/grpc/server.go,sei-cosmos/server/config/config.go,sei-cosmos/server/config/toml.go,sei-cosmos/server/start.go,sei-cosmos/testutil/network/util.goTests
TestDefaultGRPCConfig— asserts all new defaults.TestGetConfigGRPCDefaultsWhenAbsent— a legacyapp.toml(missing the new keys) still resolves to the bounded in-code defaults, including the 5m idle default.TestGetConfigGRPCOverrides— operator-provided values override the defaults.Test plan
go test ./sei-cosmos/server/config/... ./sei-cosmos/server/grpc/...passesgofmt -s -l .prints nothingmake buildsucceeds🤖 Generated with Claude Code