Skip to content

[fix](variant) Keep first duplicate Variant JSON path#63082

Open
eldenmoon wants to merge 1 commit intoapache:masterfrom
eldenmoon:codex/cir-19941-duplicate-json-path
Open

[fix](variant) Keep first duplicate Variant JSON path#63082
eldenmoon wants to merge 1 commit intoapache:masterfrom
eldenmoon:codex/cir-19941-duplicate-json-path

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object key during Variant parsing when enabled.

Copilot AI review requested due to automatic review settings May 8, 2026 10:21
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a backend config to change how VARIANT parsing handles duplicate JSON object keys: when enabled, it keeps the first occurrence instead of erroring on duplicate paths. It also adds test coverage for this behavior, while removing several existing VARIANT nested regression suites and adding a new export/load regression for wide VARIANT data.

Changes:

  • Add BE config variant_enable_duplicate_json_path_check (default off) and implement “keep first duplicate key” behavior during JSON traversal for VARIANT extraction.
  • Add BE unit tests and a VARIANT regression test + data file to validate duplicate-key behavior (insert + stream load).
  • Add an export/load regression test for VARIANT with many potential subcolumns; remove multiple VARIANT nested regression suites.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
regression-test/suites/variant_p0/nested2.groovy Deleted an existing VARIANT nested-type conflict regression suite.
regression-test/suites/variant_p0/nested/sql/q01.sql Deleted SQL outputs for a removed nested VARIANT load suite.
regression-test/suites/variant_p0/nested/nested_in_top_array.groovy Deleted a regression suite covering nested-in-top-array VARIANT behavior.
regression-test/suites/variant_p0/nested/load.groovy Deleted a (mostly commented) nested VARIANT load regression suite.
regression-test/suites/variant_p0/nested.groovy Deleted a large VARIANT nested regression suite.
regression-test/suites/variant_p0/duplicate_json_path.groovy Added regression coverage for keeping the first duplicate JSON key when the new BE config is enabled.
regression-test/suites/export_p2/test_export_variant_10k_columns.groovy Added export→S3→reload regression for wide/sparse VARIANT; currently has correctness and cleanup issues.
regression-test/data/variant_p0/duplicate_json_path.json Added stream-load input with duplicate JSON keys.
be/test/storage/segment/variant_util_test.cpp Added unit tests validating “keep first duplicate key” behavior in both subcolumn-only and doc-mode parsing.
be/src/util/json/json_parser.cpp Added duplicate-key skipping logic during object traversal when the config is enabled.
be/src/common/config.h Declared the new BE config flag.
be/src/common/config.cpp Defined the new BE config flag with default false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/util/json/json_parser.cpp Outdated
Comment on lines 111 to 132
if (!config::variant_enable_duplicate_json_path_check) {
for (auto it = object.begin(); it != object.end(); ++it) {
const auto& [key, value] = *it;
traverse_object_member(key, value);
}
return;
}

std::unordered_set<PathInData, PathInData::Hash> visited_member_paths;
visited_member_paths.reserve(object.size());
for (auto it = object.begin(); it != object.end(); ++it) {
const auto& [key, value] = *it;
check_key_length(key);
ctx.builder.append(key, false);
PathInData member_path(ctx.builder.get_parts());
if (!visited_member_paths.insert(std::move(member_path)).second) {
ctx.builder.pop_back();
continue;
}
traverse(value, ctx);
ctx.builder.pop_back();
}
Comment on lines +18 to +22
suite("duplicate_json_path", "p0") {
def customBeConfig = [
variant_enable_duplicate_json_path_check: true
]
setBeConfigTemporary(customBeConfig) {
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from a7bc101 to abc6911 Compare May 8, 2026 10:42
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking correctness issue beyond the already-known review threads. The enabled duplicate-path check still allows flat dotted keys and nested object paths to produce the same downstream Variant path string, so the load/WAL failure this PR is meant to prevent can still occur for inputs such as {"a.b":1,"a":{"b":2}}.

Critical checkpoint conclusions:

  • Goal/test: The goal is to optionally keep the first duplicate Variant JSON path and avoid duplicate/conflicting subcolumn paths; current tests cover repeated object member names but not dotted-key versus nested-path conflicts, and the implementation does not fully accomplish the goal.
  • Scope: The change is small and focused, but the local duplicate key check is too narrow for Variant path identity.
  • Concurrency/lifecycle: No new concurrency or special lifecycle management is introduced; the mutable config follows existing BE config patterns.
  • Configuration: A disabled-by-default mutable BE config is added; behavior is observed at parse time.
  • Compatibility: No storage/protocol format change was found.
  • Parallel paths: Existing review thread already covers the JSONB conversion path separately; this comment covers a distinct subcolumn/doc-value path conflict.
  • Tests: Added unit/regression coverage misses the dotted-key conflict case; no tests were run by me in this automated review.
  • Observability/performance: No additional observability requirement found; the extra check is not on an obviously problematic hot path when disabled.
  • User focus: No additional user-provided review focus was supplied.

const auto& [key, value] = *it;
check_key_length(key);
ctx.builder.append(key, false);
PathInData member_path(ctx.builder.get_parts());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check deduplicates by PathInData parts, but downstream Variant duplicate detection and doc-value serialization use PathInData::get_path() (the dot-joined string). A flat key and a nested path can therefore bypass this set while still producing the same logical path string. For example, with the config enabled, {"a.b":1,"a":{"b":2}} traverses top-level members a.b (one part) and a (one part), then emits leaf paths whose get_path() values are both a.b; OnlySubcolumns later fails ColumnVariant::sanitize() in the writer and OnlyDocValueColumn hits the seen_paths duplicate exception. That is the same class of load/WAL failure the new config is intended to avoid, but it is not covered by the added tests. Please deduplicate on the emitted Variant path identity (or otherwise handle flat-dot/nested conflicts) so the first value is kept for this case too.

@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from abc6911 to 89faa50 Compare May 8, 2026 11:55
### What problem does this PR solve?

Issue Number: CIR-19941

Related PR: None

Problem Summary: Variant JSON parsing could emit duplicate or conflicting subcolumn paths when a JSON object contains duplicate keys, including flat dotted keys and nested object paths that normalize to the same downstream Variant path. This may later fail load or WAL replay with subcolumn type size mismatch. Add a disabled-by-default BE config to enable duplicate JSON path checking and, when enabled, keep the first object member for a duplicated path.

### Release note

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object path during Variant parsing when enabled.

### Check List (For Author)

- Test: Unit Test and Regression test
    - Unit Test: ./run-be-ut.sh --run --filter='VariantUtilTest.ParseDuplicateJsonPaths*'
    - Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,172.19.0.1 no_proxy=127.0.0.1,localhost,172.19.0.1 ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s duplicate_json_path
- Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate JSON object paths keep the first member and skip later duplicate members; default is false.
- Does this need documentation: No
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from 89faa50 to baa8522 Compare May 8, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants