Skip to content

fix: track merge key in transaction for concurrent merge_insert conflict detection#6051

Draft
ozzieba wants to merge 1 commit intolance-format:mainfrom
purpleplatform:spec/merge-key-in-transaction
Draft

fix: track merge key in transaction for concurrent merge_insert conflict detection#6051
ozzieba wants to merge 1 commit intolance-format:mainfrom
purpleplatform:spec/merge-key-in-transaction

Conversation

@ozzieba
Copy link

@ozzieba ozzieba commented Feb 27, 2026

Summary

Fixes lancedb/lancedb#2463, #4585
Supersedes #6018 (addresses reviewer feedback from @jackye1995)

Concurrent merge_insert operations that insert the same new key silently
produce duplicate rows when the schema lacks unenforced-primary-key metadata.
Per review feedback, the right
fix is to track the merge key in the transaction model rather than relying
solely on the presence of a bloom filter.

Changes

Spec change: transaction.proto

Add repeated int32 merge_key_field_ids = 9 to the Update message. When
non-empty, this indicates the transaction is a merge insert and records which
columns were used as the merge key (the ON columns). This enables conflict
resolution to detect incompatible concurrent merge inserts even before checking
bloom filters.

Update KeyExistenceFilter comments to remove the requirement that field IDs
must match an unenforced primary key — they now represent the merge key.

Conflict resolution

  1. Different merge keys → conflict: If two concurrent merge inserts use
    different ON columns (e.g., one merges on id, another on name), their
    bloom filters are incompatible and cannot be compared. This is now detected
    via merge_key_field_ids and treated as a retryable conflict.

  2. Same merge key → bloom filter check: If the merge keys match, the
    existing bloom filter intersection check determines whether the inserted
    rows overlap.

  3. Asymmetric bloom filters → conflict: (Some, None) and (None, Some)
    are both conservatively treated as conflicts (fixes the original bug where
    (None, Some) fell through silently).

  4. Backward compatible: Empty merge_key_field_ids means "not a merge
    insert" — the existing (None, None) fall-through is preserved for older
    transactions and regular updates.

Always emit bloom filter

The bloom filter is now always included for merge insert operations, regardless
of whether the schema has unenforced-primary-key metadata. The is_primary_key
gate has been removed.

Note on spec change process

Per @jackye1995's comment,
this adds a new field to transaction.proto which is a spec change. Happy to
create a separate discussion for a community vote if needed (similar to
#5485).

The change is backward compatible: older writers produce empty
merge_key_field_ids which is handled correctly by the new conflict resolver.

Test plan

  • Unit tests for merge key conflict detection (5 new tests):
    • Different merge keys → conflict
    • Same key, disjoint bloom filters → OK
    • Same key, overlapping bloom filters → conflict
    • Merge insert vs non-merge update (asymmetric bloom) → conflict
    • Both non-merge updates (None, None) → OK (backward compat)
  • All 40 existing conflict resolver tests pass
  • All 122 existing merge_insert tests pass
  • cargo clippy clean, cargo fmt clean

…ict detection

Add `merge_key_field_ids` to the Update operation in the transaction proto
so conflict resolution can detect incompatible concurrent merge inserts.

- Always include bloom filter for inserted rows regardless of PK metadata
- Different merge keys (ON columns) are treated as conflicts
- Asymmetric bloom filter pairs (Some, None) are treated as conflicts
- Backward compatible: empty merge_key_field_ids for non-merge updates

Refs: lancedb/lancedb#2463, lance-format#4585, lance-format#6018

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added bug Something isn't working python java labels Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: concurrent upsert can generate duplicates

1 participant