Skip to content

fix(webapp): recover from ClickHouse JSON parse failures on out-of-range integers#3759

Open
matt-aitken wants to merge 2 commits into
mainfrom
feature/tri-9755-runs-replication-bigint-recovery
Open

fix(webapp): recover from ClickHouse JSON parse failures on out-of-range integers#3759
matt-aitken wants to merge 2 commits into
mainfrom
feature/tri-9755-runs-replication-bigint-recovery

Conversation

@matt-aitken
Copy link
Copy Markdown
Member

Summary

Second class of poisoned-row failure in the runs replication path. PR #3708 plugged lone UTF-16 surrogates; this one handles bare JSON integer literals outside ClickHouse's Int64..UInt64 range. Recovery stays purely reactive — the existing sanitizeRows walker just gains an extra branch, so the hot replication path pays nothing on healthy rows.

Fixes the still-firing customer-facing symptom from TRI-9755: scan-social-profiles runs continued to be stranded in EXECUTING on the Tasks page after #3708 deployed. CloudWatch showed Dropped batch — ClickHouse JSON parse error but sanitizer found nothing to fix firing 8/8 times since the previous deploy (zero successful sanitizations). Root cause: upstream JS Number precision loss on a 21-digit Google Plus ID (117039831458782873093117039831458782870000) — the precision-lossy value still serialises as a bare integer that exceeds UInt64.MAX, which ClickHouse rejects with INCORRECT_DATA.

How the bug ships

The customer task emits an output containing a Poshmark profile's spec_format:

{"key":"gp_id","proper_key":"Gp Id","value":117039831458782870000,"type":"int"}

That value is 1.17e20 — comfortably above UInt64.MAX (1.84e19) but comfortably below 1e21. Number.prototype.toString only switches to exponential form at |value| >= 1e21, so JSON.stringify emits the bare token 117039831458782870000 and the ClickHouse JSON(max_dynamic_paths) column fails with:

Code: 117. DB::Exception: Cannot parse JSON object here: {…}: (while reading the value of key output): (at row 1)
: While executing ParallelParsingBlockInputFormat. (INCORRECT_DATA) (version 25.12.x)

Same error verbatim as prod. The same number quoted ("117039831458782870000") inserts fine — ClickHouse's dynamic JSON column accepts a String subtype on the same path.

What changed

apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts:

  • New private isUnsafeJsonInteger(value) helper — true iff value is a finite integer-valued JS Number where |value| < 1e21 (so JSON.stringify emits integer form, not exponent) and value falls outside [Int64.MIN, UInt64.MAX].
  • sanitizeUnknownInPlace gains a number-branch: when the predicate holds, replace the Number with String(value). The downstream JSON column dynamic-types the path as String for that row — fine, since the value was already precision-lossy upstream (no JS Number above 2^53 is numerically meaningful anyway).
  • Float-valued numbers, large floats (>= 1e21), NaN and Infinity are left alone — JSON.stringify emits them with exponents or as null, both of which ClickHouse accepts.

apps/webapp/test/sanitizeRowsOnParseError.test.ts: four new unit tests + an extension to sanitizeRows covering surrogate + integer fixes counted together across rows. The unit suite now covers:

  • Positive value above UInt64.MAX (117039831458782870000 — the actual prod value)
  • Negative value below Int64.MIN
  • Boundary values pass through (42, Number.MAX_SAFE_INTEGER, 2^63)
  • Non-integer numbers untouched (floats, 1e25, NaN, Infinity)
  • The actual scan-social-profiles nested shape — finds the offending gp_id deep inside output.data.profiles[].spec_format[].platform_variables[].value

.server-changes/runs-replication-bigint-recovery.md — release notes entry.

Why reactive, not pre-flight

#prepareJson runs millions of times per day on the replication hot path. Walking every JSON tree to look for oversized integers would add bounded-but-real CPU on every healthy row. sanitizeRows only fires after a ClickHouse parse-error rejection, which is a few times a day platform-wide. Extending it costs effectively zero on healthy traffic and gains us recovery on the rare poisoned row.

Verification

  • Reproduced 1:1 in a throwaway Docker clickhouse/clickhouse-server:25.12.11.4 (closest available to the prod 25.12.1.1579 build). Pre-sanitize JSON fails with the exact prod error; post-sanitize JSON inserts cleanly and the row is readable with gp_id stored as a String subtype.
  • pnpm --filter webapp exec vitest run test/sanitizeRowsOnParseError.test.ts — 22/22 passing (18 existing + 4 new).
  • pnpm run typecheck --filter webapp — clean.

Test plan

  • pnpm run typecheck --filter webapp
  • Unit tests pass against new + existing cases
  • End-to-end Docker ClickHouse repro confirms recovery
  • Post-deploy: confirm Sanitizing batch after ClickHouse JSON parse error warns fire instead of Dropped batch … errors when scan-social-profiles outputs trip CH again
  • Post-deploy: confirm permanentlyDroppedBatches counter stops climbing in /stp/trigger-app-prod/ecs/replication/service-container/process-logs

What this does NOT do

🤖 Generated with Claude Code

…nge integers

Second class of poisoned-row failure in the runs replication path. PR
#3708 handled lone UTF-16 surrogates; this one handles bare JSON integer
literals that exceed ClickHouse's Int64/UInt64 range.

ClickHouse's `JSON(max_dynamic_paths=...)` column fits each bare integer
token into Int64 (signed) or UInt64 (unsigned). Bare integers strictly
outside `[-2^63, 2^64 - 1]` are rejected with `INCORRECT_DATA` (no
silent fallback to Float64). JS Numbers that are integer-valued but
above `Number.MAX_SAFE_INTEGER` still serialise via JSON.stringify as
bare integer tokens (no exponent) while `|value| < 1e21`, so any such
Number lands on the wire as a token CH cannot accept.

Customer-facing symptom: `scan-social-profiles` runs continued to be
stranded in `EXECUTING` on the Tasks page even after the surrogate fix
landed. CloudWatch showed `Dropped batch — ClickHouse JSON parse error
but sanitizer found nothing to fix` firing 8/8 times since the previous
deploy. Root cause: upstream JS Number precision loss on a 21-digit
Google Plus ID (`117039831458782873093` → `117039831458782870000`) —
the precision-lossy value still serialises as a bare integer that
exceeds UInt64.MAX, which CH rejects. Reproduced end-to-end against
ClickHouse 25.12.11.4 in Docker with the exact `Cannot parse JSON
object here` error from prod.

`apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts`:

- New private `isUnsafeJsonInteger(value)` helper — true iff value is
  a finite, integer-valued JS Number where `|value| < 1e21` (i.e.
  JSON.stringify emits integer form, not exponent) AND `value` falls
  outside `[Int64.MIN, UInt64.MAX]`.
- `sanitizeUnknownInPlace` gains a number-branch: when the predicate
  holds, replace the Number with its string form. CH's dynamic JSON
  column accepts a `String` subtype on the same path, so the row
  inserts cleanly on retry. The numeric value was already
  precision-lossy upstream (JS Number can't represent integers above
  2^53 faithfully), so type-flipping to string is information-preserving
  relative to what arrived.
- Float-valued numbers and large floats (>= 1e21, NaN, Infinity) are
  left alone — JSON.stringify emits them with exponents or as `null`,
  both of which CH accepts.

Recovery stays purely reactive — no extra cost on the hot replication
path. The sanitizer only runs after a ClickHouse parse-error rejection,
so healthy rows pay nothing.

`apps/webapp/test/sanitizeRowsOnParseError.test.ts`: four new unit tests
covering positive/negative out-of-range integers, boundary values
(MAX_SAFE_INTEGER, 2^63, UInt64.MAX itself), non-integer numbers, and
the actual `scan-social-profiles` nested shape with `gp_id:
117039831458782870000`. Plus an extension to `sanitizeRows` that
verifies surrogate and integer fixes are counted together across rows.

`.server-changes/runs-replication-bigint-recovery.md` — release notes.

Refs TRI-9755.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 27, 2026

⚠️ No Changeset found

Latest commit: b2cc562

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e7a5fccd-1806-4852-9e28-6b110de00f15

📥 Commits

Reviewing files that changed from the base of the PR and between 73f80f8 and b2cc562.

📒 Files selected for processing (2)
  • apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts
  • apps/webapp/test/sanitizeRowsOnParseError.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • apps/webapp/test/sanitizeRowsOnParseError.test.ts
  • apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)

Hidden review stack artifact:

Walkthrough

This pull request extends the runs-replication row sanitizer to handle out-of-range integers that JSON-serialize as bare integer tokens. The change adds an isUnsafeJsonInteger predicate to detect JavaScript Numbers outside ClickHouse's Int64/UInt64 range, and updates the sanitizeUnknownInPlace walker to convert such values to their string form, preventing INCORRECT_DATA parser failures on dynamic JSON columns. The implementation includes comprehensive test coverage for safe/unsafe boundary conditions and nested object scanning, plus a changelog entry documenting the reactive recovery mechanism.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR objectives are entirely unrelated to the linked issue #3 about documentation setup (Mintlify configuration). The actual PR implements ClickHouse JSON parse error recovery with no connection to documentation. Clarify the actual linked issue(s) this PR addresses. The linked issue #3 appears to be incorrect; verify the correct issue number for this ClickHouse recovery fix (likely TRI-9755).
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately and concisely describes the main change: adding recovery from ClickHouse JSON parse failures caused by out-of-range integers.
Description check ✅ Passed The description is comprehensive and well-structured, covering summary, root cause analysis, implementation details, testing, and verification, though it does not follow the provided template structure exactly.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the ClickHouse JSON parse error recovery: modifications to the sanitizer, comprehensive test coverage, and a release notes entry directly support the stated objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-9755-runs-replication-bigint-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

…ness

Address review (CodeRabbit + Devin convergent finding): the literal
`18446744073709551615` in JS source rounds to 2^64 in float64 (the
spacing around 2^64 is 2048), so `value > 18446744073709551615` was
effectively `value > 2**64` and missed the case where a JS Number's
float64 value is *exactly* 2^64. `JSON.stringify(2**64)` emits
"18446744073709552000" — a bare integer above UInt64.MAX that
ClickHouse rejects — so the sanitizer would have let that row through
unchanged and the batch would still drop.

Switch to BigInt comparison against named `UINT64_MAX` / `INT64_MIN`
constants. `BigInt(value)` is safe because we already gate on
`Number.isInteger(value)`. The negative side is unaffected (Int64.MIN
= -2^63 is exactly representable in float64) but the BigInt form keeps
both sides symmetric and self-documenting.

Regression test added: `sanitizeUnknownInPlace(2 ** 64)` must produce
"18446744073709552000" with fixed=1. A naïve `>` literal comparison
would not catch this.

23/23 sanitizer tests green; webapp typecheck clean.
@matt-aitken matt-aitken force-pushed the feature/tri-9755-runs-replication-bigint-recovery branch from 73f80f8 to b2cc562 Compare May 27, 2026 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant