[fix](variant) Keep first duplicate Variant JSON path#63082
[fix](variant) Keep first duplicate Variant JSON path#63082eldenmoon wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a backend config to change how VARIANT parsing handles duplicate JSON object keys: when enabled, it keeps the first occurrence instead of erroring on duplicate paths. It also adds test coverage for this behavior, while removing several existing VARIANT nested regression suites and adding a new export/load regression for wide VARIANT data.
Changes:
- Add BE config
variant_enable_duplicate_json_path_check(default off) and implement “keep first duplicate key” behavior during JSON traversal for VARIANT extraction. - Add BE unit tests and a VARIANT regression test + data file to validate duplicate-key behavior (insert + stream load).
- Add an export/load regression test for VARIANT with many potential subcolumns; remove multiple VARIANT nested regression suites.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/nested2.groovy | Deleted an existing VARIANT nested-type conflict regression suite. |
| regression-test/suites/variant_p0/nested/sql/q01.sql | Deleted SQL outputs for a removed nested VARIANT load suite. |
| regression-test/suites/variant_p0/nested/nested_in_top_array.groovy | Deleted a regression suite covering nested-in-top-array VARIANT behavior. |
| regression-test/suites/variant_p0/nested/load.groovy | Deleted a (mostly commented) nested VARIANT load regression suite. |
| regression-test/suites/variant_p0/nested.groovy | Deleted a large VARIANT nested regression suite. |
| regression-test/suites/variant_p0/duplicate_json_path.groovy | Added regression coverage for keeping the first duplicate JSON key when the new BE config is enabled. |
| regression-test/suites/export_p2/test_export_variant_10k_columns.groovy | Added export→S3→reload regression for wide/sparse VARIANT; currently has correctness and cleanup issues. |
| regression-test/data/variant_p0/duplicate_json_path.json | Added stream-load input with duplicate JSON keys. |
| be/test/storage/segment/variant_util_test.cpp | Added unit tests validating “keep first duplicate key” behavior in both subcolumn-only and doc-mode parsing. |
| be/src/util/json/json_parser.cpp | Added duplicate-key skipping logic during object traversal when the config is enabled. |
| be/src/common/config.h | Declared the new BE config flag. |
| be/src/common/config.cpp | Defined the new BE config flag with default false. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!config::variant_enable_duplicate_json_path_check) { | ||
| for (auto it = object.begin(); it != object.end(); ++it) { | ||
| const auto& [key, value] = *it; | ||
| traverse_object_member(key, value); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| std::unordered_set<PathInData, PathInData::Hash> visited_member_paths; | ||
| visited_member_paths.reserve(object.size()); | ||
| for (auto it = object.begin(); it != object.end(); ++it) { | ||
| const auto& [key, value] = *it; | ||
| check_key_length(key); | ||
| ctx.builder.append(key, false); | ||
| PathInData member_path(ctx.builder.get_parts()); | ||
| if (!visited_member_paths.insert(std::move(member_path)).second) { | ||
| ctx.builder.pop_back(); | ||
| continue; | ||
| } | ||
| traverse(value, ctx); | ||
| ctx.builder.pop_back(); | ||
| } |
| suite("duplicate_json_path", "p0") { | ||
| def customBeConfig = [ | ||
| variant_enable_duplicate_json_path_check: true | ||
| ] | ||
| setBeConfigTemporary(customBeConfig) { |
a7bc101 to
abc6911
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional blocking correctness issue beyond the already-known review threads. The enabled duplicate-path check still allows flat dotted keys and nested object paths to produce the same downstream Variant path string, so the load/WAL failure this PR is meant to prevent can still occur for inputs such as {"a.b":1,"a":{"b":2}}.
Critical checkpoint conclusions:
- Goal/test: The goal is to optionally keep the first duplicate Variant JSON path and avoid duplicate/conflicting subcolumn paths; current tests cover repeated object member names but not dotted-key versus nested-path conflicts, and the implementation does not fully accomplish the goal.
- Scope: The change is small and focused, but the local duplicate key check is too narrow for Variant path identity.
- Concurrency/lifecycle: No new concurrency or special lifecycle management is introduced; the mutable config follows existing BE config patterns.
- Configuration: A disabled-by-default mutable BE config is added; behavior is observed at parse time.
- Compatibility: No storage/protocol format change was found.
- Parallel paths: Existing review thread already covers the JSONB conversion path separately; this comment covers a distinct subcolumn/doc-value path conflict.
- Tests: Added unit/regression coverage misses the dotted-key conflict case; no tests were run by me in this automated review.
- Observability/performance: No additional observability requirement found; the extra check is not on an obviously problematic hot path when disabled.
- User focus: No additional user-provided review focus was supplied.
| const auto& [key, value] = *it; | ||
| check_key_length(key); | ||
| ctx.builder.append(key, false); | ||
| PathInData member_path(ctx.builder.get_parts()); |
There was a problem hiding this comment.
This check deduplicates by PathInData parts, but downstream Variant duplicate detection and doc-value serialization use PathInData::get_path() (the dot-joined string). A flat key and a nested path can therefore bypass this set while still producing the same logical path string. For example, with the config enabled, {"a.b":1,"a":{"b":2}} traverses top-level members a.b (one part) and a (one part), then emits leaf paths whose get_path() values are both a.b; OnlySubcolumns later fails ColumnVariant::sanitize() in the writer and OnlyDocValueColumn hits the seen_paths duplicate exception. That is the same class of load/WAL failure the new config is intended to avoid, but it is not covered by the added tests. Please deduplicate on the emitted Variant path identity (or otherwise handle flat-dot/nested conflicts) so the first value is kept for this case too.
abc6911 to
89faa50
Compare
### What problem does this PR solve?
Issue Number: CIR-19941
Related PR: None
Problem Summary: Variant JSON parsing could emit duplicate or conflicting subcolumn paths when a JSON object contains duplicate keys, including flat dotted keys and nested object paths that normalize to the same downstream Variant path. This may later fail load or WAL replay with subcolumn type size mismatch. Add a disabled-by-default BE config to enable duplicate JSON path checking and, when enabled, keep the first object member for a duplicated path.
### Release note
Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object path during Variant parsing when enabled.
### Check List (For Author)
- Test: Unit Test and Regression test
- Unit Test: ./run-be-ut.sh --run --filter='VariantUtilTest.ParseDuplicateJsonPaths*'
- Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,172.19.0.1 no_proxy=127.0.0.1,localhost,172.19.0.1 ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s duplicate_json_path
- Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate JSON object paths keep the first member and skip later duplicate members; default is false.
- Does this need documentation: No
89faa50 to
baa8522
Compare
Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object key during Variant parsing when enabled.