Commit eefb96c
fix(webapp): recover from ClickHouse JSON parse failures in runs replication (#3708)
## Summary
On a ClickHouse `Cannot parse JSON object` rejection,
`RunsReplicationService` now sanitizes lone UTF-16 surrogates across the
failing batch via the existing `sanitizeRows` helper and retries once.
If the sanitizer found nothing or the retry also fails, the batch is
dropped loudly with a counter increment, so the surrounding
`#insertWithRetry` layer doesn't spin three more times on a
deterministic failure. Non-parse errors propagate unchanged.
Mirrors the pattern from #3659 (for `ClickhouseEventRepository`) — same
root cause (lone UTF-16 surrogates in user-provided JSON), same recovery
shape, **reusing the same shared helpers** (`sanitizeRows`,
`isClickHouseJsonParseError`, `parseRowNumberFromError`).
Fixes the customer-facing symptom from
[TRI-9755](https://linear.app/triggerdotdev/issue/TRI-9755): a single
row's poisoned `output` JSON used to take down the
`COMPLETED_SUCCESSFULLY` UPDATE events for its 50+ batch-mates,
stranding them in `EXECUTING` in ClickHouse forever and inflating
"Running" counts on the Tasks page. Confirmed in production this is
ongoing — ~120k stale rows accumulated in a single 5-hour burst on
2026-05-18; smaller continuous leak before and after.
## What changed
`apps/webapp/app/services/runsReplicationService.server.ts`:
- Imports the three helpers from
`~/v3/eventRepository/sanitizeRowsOnParseError.server` (no duplication;
no move).
- New private `#insertWithJsonParseRecovery<T>(rows, doInsert,
contextLabel, attempt)` — generic over `TaskRunInsertArray[]` and
`PayloadInsertArray[]`, structurally identical to
`ClickhouseEventRepository.#insertWithJsonParseRecovery`. Try → on parse
error sanitize the whole batch (the `at row N` hint is logged but not
used to slice — semantics under `input_format_parallel_parsing` aren't
stable) → retry once → drop with loud log if sanitizer found nothing OR
retry still fails.
- `#insertTaskRunInserts` and `#insertPayloadInserts` extract a
`doInsert` closure and hand it to the wrapper. Existing error logging,
span recording, and `recordSpanError` are preserved inside the closure.
- New `private _permanentlyDroppedBatches = 0` counter with a public
getter, for ops dashboards and tests (matches the events-repo
convention). One shared counter for both insert sites — granularity
comes from the `contextLabel` (`task_runs_v2` /
`raw_task_runs_payload_v1`) on every log line.
`.server-changes/runs-replication-utf16-recovery.md` — release notes
entry.
## Why no new tests
The shared helpers already have full unit + real-ClickHouse contract
coverage from #3659
(`apps/webapp/test/sanitizeRowsOnParseError.test.ts`,
`apps/webapp/test/otlpUtf16Sanitization.integration.test.ts`). The new
wrapper is a line-for-line structural port. Adding a parallel
integration test would require synthesizing bad data that *escapes* the
preemptive `detectBadJsonStrings` check in `#prepareJson` but still
trips ClickHouse — non-trivial without hand-crafted fixtures and
wouldn't cover any new logic.
## What this does NOT do
- Doesn't touch the ~120k existing stale `EXECUTING` rows in production.
That needs a reconciliation/backfill sweep (separate ticket — TRI-9755
fix #3).
- Doesn't sanitize the `error` column path
(`runsReplicationService.server.ts:932 const errorData = { data:
run.error };`). Reactive recovery will catch it if it ever poisons a
batch, but feeding it through `#prepareJson` like `output` is a cheap
follow-up.
## Test plan
- [x] `pnpm run typecheck --filter webapp` — clean
- [ ] Post-deploy: confirm `permanentlyDroppedBatches` counter stays at
zero (or near-zero) in
`/stp/trigger-app-prod/ecs/replication/service-container/process-logs`,
and watch for `Sanitizing batch after ClickHouse JSON parse error` warns
to confirm recovery is firing on real traffic
- [ ] Post-deploy: confirm the rate of new
"EXECUTING-but-actually-COMPLETED" zombies in ClickHouse flattens
(current rate ≈ tens-to-hundreds per hour platform-wide)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>1 parent 9f64bf4 commit eefb96c
2 files changed
Lines changed: 189 additions & 38 deletions
File tree
- .server-changes
- apps/webapp/app/services
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
Lines changed: 169 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
41 | 46 | | |
42 | 47 | | |
43 | 48 | | |
| |||
129 | 134 | | |
130 | 135 | | |
131 | 136 | | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
132 | 146 | | |
133 | 147 | | |
134 | 148 | | |
| |||
283 | 297 | | |
284 | 298 | | |
285 | 299 | | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
286 | 305 | | |
287 | 306 | | |
288 | 307 | | |
| |||
658 | 677 | | |
659 | 678 | | |
660 | 679 | | |
661 | | - | |
| 680 | + | |
662 | 681 | | |
663 | 682 | | |
664 | 683 | | |
| |||
667 | 686 | | |
668 | 687 | | |
669 | 688 | | |
670 | | - | |
| 689 | + | |
671 | 690 | | |
672 | 691 | | |
673 | 692 | | |
| |||
676 | 695 | | |
677 | 696 | | |
678 | 697 | | |
679 | | - | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
680 | 703 | | |
681 | 704 | | |
682 | | - | |
| 705 | + | |
683 | 706 | | |
684 | 707 | | |
685 | 708 | | |
| |||
837 | 860 | | |
838 | 861 | | |
839 | 862 | | |
840 | | - | |
841 | | - | |
842 | | - | |
843 | | - | |
844 | | - | |
845 | | - | |
846 | | - | |
847 | | - | |
848 | | - | |
849 | | - | |
850 | | - | |
851 | | - | |
852 | | - | |
853 | | - | |
854 | | - | |
855 | | - | |
856 | | - | |
857 | | - | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
858 | 885 | | |
859 | 886 | | |
860 | 887 | | |
| |||
867 | 894 | | |
868 | 895 | | |
869 | 896 | | |
870 | | - | |
871 | | - | |
872 | | - | |
873 | | - | |
874 | | - | |
875 | | - | |
876 | | - | |
877 | | - | |
878 | | - | |
879 | | - | |
880 | | - | |
881 | | - | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
882 | 921 | | |
883 | | - | |
884 | | - | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
885 | 983 | | |
886 | 984 | | |
887 | | - | |
888 | | - | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
889 | 1020 | | |
890 | 1021 | | |
891 | 1022 | | |
| |||
0 commit comments