Skip to content

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Sep 29, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

🤖 Generated with Claude Code

This PR implements a critical data integrity feature that prevents undrop operations on tables whose data may have been partially or fully cleaned up by vacuum processes. The core principle is: once vacuum has started for a retention period, tables dropped before that period can never be restored, ensuring users cannot accidentally restore tables with incomplete or inconsistent data.

Problem Statement

Previously, there was a dangerous race condition where:

  1. User drops a table
  2. Vacuum process starts cleaning up the table's data
  3. User attempts undrop while vacuum is in progress
  4. Undrop succeeds but table data is incomplete/corrupted

This could lead to silent data corruption and inconsistent database state.

Design Philosophy

Tenant-Level Global Watermark Design

We chose a tenant-level global vacuum watermark approach instead of per-table or per-database granularity for several key reasons:

  1. Simplicity & Clarity: A single timestamp per tenant is easier to reason about, implement, and maintain
  2. Sufficient for Use Case: Vacuum operations are typically tenant-wide administrative tasks, making global coordination natural
  3. Consistent Semantics: All undrop operations within a tenant follow the same rules, reducing cognitive overhead
  4. Performance: Single KV read/write per tenant vs. potentially thousands for per-table tracking
  5. Smooth Evolution Path: The current design can be extended to table/DB-level granularity without breaking changes:
    • Future: VacuumRetentionIdent::new_table(tenant, db_id, table_id)
    • Current: VacuumRetentionIdent::new_global(tenant)
    • The underlying infrastructure supports both patterns seamlessly

Implementation Approach

1. Monotonic Timestamp Mechanism

VacuumWatermark {
    time: DateTime<Utc>, // Monotonically increasing, never decreases
}
  • Vacuum Phase: Sets timestamp when vacuum starts for a retention period
  • Undrop Phase: Compares table's drop_time against vacuum timestamp
  • Safety Rule: drop_time <= vacuum_timestamp → undrop FORBIDDEN

2. Atomic Compare-and-Swap (CAS) Operations

fetch_set_vacuum_timestamp() // Returns OLD value, atomic update
  • Uses crud_upsert_with for race-safe timestamp updates
  • Handles concurrent vacuum operations safely through atomic CAS
  • Only advances timestamp forward (monotonic property)
  • Multiple vacuum processes may race, but updates remain consistent

3. Multi-Layer Concurrent Safety

// Transaction conditions handle race conditions safely
txn.condition.push(txn_cond_eq_seq(&vacuum_ident, vacuum_seq));
  • Concurrent Operations Handled: Multiple vacuum/undrop operations can race
  • Safe Resolution: Atomic transactions ensure consistent outcomes
  • Prevents TOCTOU: Time-of-check-time-of-use vulnerabilities eliminated
  • Read-Check-Write Pattern: Sequence-based validation prevents timing attacks

4. Fail-Safe Error Handling

  • Critical Path Protection: Vacuum ABORTS if timestamp setting fails
  • No Silent Failures: All errors are propagated and logged
  • Consistency First: Never proceed with data cleanup without protection markers

Key Implementation Details

VacuumWatermark Structure (Rust)

pub struct VacuumWatermark {
    pub time: DateTime<Utc>, // Monotonic timestamp
}

Protobuf Serialization Format

message VacuumWatermark {  // Consistent naming with Rust struct
  uint64 ver = 100;        // Protocol versioning
  uint64 min_reader_ver = 101;
  string time = 1;         // Timestamp string
}

Integration Points

  1. Vacuum Trigger: VacuumDropTablesInterpreter::execute2()
    • Sets watermark timestamp BEFORE any data cleanup begins
    • Operation fails if watermark cannot be established
  2. Protection Check: handle_undrop_table() with early validation
    • Compares table drop time against vacuum watermark
    • Atomic transaction with sequence validation
  3. Metadata Storage: VacuumRetentionIdent for tenant-global coordination

MetaStore Storage Details

The vacuum watermark is stored as a KV pair in the MetaStore:

Key Format:   __fd_vacuum_retention_ts/{tenant_name}
Key Example:  __fd_vacuum_retention_ts/default
Value Type:   VacuumWatermark (protobuf serialized)
Scope:        Global per tenant (one watermark per tenant)

Design Rationale: The tenant-scoped key format enables:

  • Tenant Isolation: Each tenant has independent vacuum protection
  • Simple Lookup: Single key read for undrop validation
  • Future Extension: Key format supports adding database/table specificity later

Backwards Compatibility

  • Protobuf v152: New version maintains compatibility with existing deployments
  • Optional Semantics: Option<VacuumWatermark> handles "never set" vs "set to value" correctly
  • Graceful Fallback: Existing undrop behavior preserved when no vacuum timestamp exists

Safety Guarantees

  1. Primary Guarantee: Tables with potentially incomplete data cannot be restored
  2. Temporal Consistency: Vacuum timestamp creates a clear "point of no return"
  3. Concurrent Safety: Race conditions handled safely under high concurrency
  4. Atomic Operations: Partial state updates impossible
  5. Monotonic Progress: Protection level can only increase, never decrease
  6. Fail-Safe Operation: Vacuum aborts if protection cannot be established

Critical Safety Flow

1. User initiates: VACUUM DROP TABLE
2. System sets: vacuum_timestamp = retention_boundary
3. ❌ If step 2 fails → ABORT (no data cleanup)
4. ✅ If step 2 succeeds → proceed with data cleanup
5. Later: User attempts UNDROP TABLE
6. System checks: table.drop_time <= vacuum_timestamp?
7. ❌ If true → REJECT (data may be incomplete)
8. ✅ If false → allow undrop

Undrop Behavior Matrix (Sample Scenarios)

Scenario Vacuum Watermark Table Drop Time Undrop Result Reason
Pre-vacuum State None (never set) Any time ALLOWED No vacuum has run yet - data guaranteed safe
Post-vacuum Risk Set to 2023-12-01 (example) Dropped 2023-11-15 (example) BLOCKED Drop predates vacuum - data may be cleaned
Post-vacuum Safe Set to 2023-12-01 (example) Dropped 2023-12-10 (example) ALLOWED Drop postdates vacuum - data guaranteed intact

Key Insight: When no vacuum watermark exists, undrop operations are both safe and correct - it means no vacuum cleanup has ever been performed on this tenant, so all dropped table data remains intact. This also ensures smooth migration from older versions without breaking existing workflows.

State Flow Diagram

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│                 │    │                  │    │                 │
│   Table Active  │───▶│  Table Dropped   │───▶│ Vacuum Started  │
│                 │    │  (safe to undrop)│    │ (sets watermark)│
└─────────────────┘    └──────────────────┘    └─────────────────┘
         ▲                        │                       │
         │                        │                       ▼
         │                        │            ┌─────────────────┐
         │                        │            │                 │
         │                        │            │ Data Cleaning   │
         │                        │            │   In Progress   │
         │                        │            │                 │
         │                        │            └─────────────────┘
         │                        │                       │
         │                        ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│                 │    │                  │    │                 │
│ Table Restored  │◀───│  Undrop Request  │───▶│ Undrop Blocked  │
│                 │    │                  │    │ (Data Unsafe)   │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         ▲                        │
         │                        │
         └─── Vacuum watermark ───┘
              check passes
          (drop_time > watermark)


Vacuum Watermark Timeline Example:
═══════════════════════════════════════════════════════════════════

Timeline (Scenario: data_retention_time_in_days = 30):

Oct-15        Nov-01       Nov-20       Dec-01       Jan-05
│             │            │            │            │
TableA        │            TableB       VACUUM       UNDROP
Dropped       │            Dropped     EXECUTION    Requests
│             │                         (sets        │
│             │                         watermark    │
│             │                         = Nov-01)    │
│             │                                      │
│             │                                      └─ TableA: ❌ BLOCKED
│             │                                         (Oct-15 ≤ Nov-01)
│             │                          
│             │                                      └─ TableB: ✅ ALLOWED
│             │                                         (Nov-20 > Nov-01)
│             │                          
│             └─ Watermark boundary      
│                (retention cutoff)      
│                                        
└─ TableA dropped before watermark       
   (data potentially cleaned)            

Key insight: Watermark = vacuum_execution_time - data_retention_time_in_days

Legend:
  ❌ BLOCKED  = Undrop rejected (table dropped before vacuum, data potentially cleaned)
  ✅ ALLOWED  = Undrop permitted (table dropped after vacuum watermark, data safe)

Note: The watermark at Dec-01 creates a temporal protection boundary - any table
dropped before this time may have had its data cleaned by vacuum operations.

Implementation Logic Flow

VACUUM DROP TABLE Operation:
┌─────────────────────────────────────────────────────────────┐
│ VacuumDropTablesInterpreter::execute2()                    │
├─────────────────────────────────────────────────────────────┤
│ 1. Calculate retention_time = now() - retention_days       │
│ 2. if (!dry_run) {                                         │
│      try: fetch_set_vacuum_timestamp(tenant, retention_time)│
│      catch: ABORT vacuum (fail-safe)                       │
│    }                                                        │
│ 3. Proceed with data cleanup...                            │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│ fetch_set_vacuum_timestamp() - Meta CRUD Operation         │
├─────────────────────────────────────────────────────────────┤
│ Conditions:                                                 │
│   - Atomic read-modify-write operation                     │
│   - Monotonic timestamp validation                         │
│                                                             │
│ Operations:                                                 │
│   - Read: current = kv_api.get(VacuumRetentionIdent)       │
│   - Check: new_time > current.time OR current.is_none()    │
│   - Write: kv_api.upsert(new_timestamp) if should_update   │
│                                                             │
│ Result:                                                     │
│   - Returns: Some(old_value) if updated, None if no change │
│   - Guarantees: Watermark only advances forward            │
│                                                             │
│ Note: Implementation details simplified for clarity         │
│       Uses MetaStore's CRUD operations                     │
└─────────────────────────────────────────────────────────────┘

UNDROP TABLE Operation:
┌─────────────────────────────────────────────────────────────┐
│ handle_undrop_table()                                       │
├─────────────────────────────────────────────────────────────┤
│ 1. Get table metadata (drop_time)                          │
│ 2. vacuum_watermark = kv_api.get_pb(VacuumRetentionIdent)  │
│ 3. Early validation:                                       │
│    match vacuum_watermark {                                │
│      None => continue,           // ✅ Legacy compatibility │
│      Some(wm) => {                                         │
│        if drop_time <= wm.time {                           │
│          return ERROR_RETENTION_GUARD  // ❌ Data unsafe   │
│        }                                                   │
│      }                                                     │
│    }                                                       │
│ 4. Begin atomic transaction with seq check...              │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│ Atomic Transaction - Mini KV Transaction (Concurrent Safe) │
├─────────────────────────────────────────────────────────────┤
│ Conditions:                                                 │
│   - table_meta.seq == expected_seq                         │
│   - vacuum_watermark.seq == expected_seq  // Key Safety!   │
│   - database_meta.seq == expected_seq                      │
│                                                             │
│ Operations:                                                 │
│   - table_meta.drop_on = None                              │
│   - Update database references                             │
│   - Restore table name mappings                            │
│                                                             │
│ Result:                                                     │
│   - SUCCESS: Table restored safely                         │
│   - FAILURE: Concurrent vacuum detected, retry/abort       │
│                                                             │
│ Note: Uses MetaStore KV transactions                       │
│       All conditions must pass for transaction to succeed  │
└─────────────────────────────────────────────────────────────┘

Concurrent Safety Example

Timeline: T1 → T2 → T3 → T4 → T5

T1: Table dropped
T2: Vacuum process A reads current watermark (None)
T3: Vacuum process B reads current watermark (None)
T4: Process A updates watermark to retention_time → SUCCESS (first update)
T5: Process B attempts same retention_time → NO UPDATE NEEDED (timestamp unchanged)
    ↳ Both operations succeed: A sets the watermark, B detects no change needed

Alternative scenario - Different timestamps:
T4: Process A updates watermark to time_1 → SUCCESS
T5: Process B updates watermark to time_2 > time_1 → SUCCESS (monotonic advance)
    ↳ Watermark advances monotonically: None → time_1 → time_2

Later: Undrop reads watermark with seq, includes seq check in transaction
       If concurrent vacuum updates watermark during undrop → transaction fails

Test Coverage

  • Unit Tests: Core API behavior and monotonic property validation
  • Integration Tests: End-to-end vacuum-undrop workflows
  • Concurrency Tests: Race condition handling validation
  • Compatibility Tests: Protobuf serialization/deserialization (v152)
  • Error Handling: Failure mode validation and fail-safe behavior
  • Timing Tests: Protection boundary verification

Files Modified

  • Core Logic: garbage_collection_api.rs, schema_api.rs
  • Integration: interpreter_vacuum_drop_tables.rs
  • Data Model: vacuum_watermark.rs, vacuum_retention_ident.rs
  • Serialization: vacuum_retention_from_to_protobuf_impl.rs
  • Protobuf: vacuum_watermark.proto
  • Error Handling: app_error.rs, exception_code.rs
  • Tests: schema_api_test_suite.rs, v152_vacuum_retention.rs

This implementation provides rock-solid data integrity guarantees while maintaining performance and operational simplicity. The tenant-level design offers the right balance of safety, simplicity, and future extensibility.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

Add retention guard mechanism to prevent undrop operations after vacuum
has started, ensuring data consistency by blocking restoration of tables
whose data may have been partially or fully cleaned up.

Key changes:
- Add VacuumRetention metadata with monotonic timestamp semantics
- Implement fetch_set_vacuum_timestamp API with correct CAS behavior
- Integrate retention checks in vacuum drop table workflow
- Add retention guard validation in undrop table operations
- Include comprehensive error handling and user-friendly messages
- Add protobuf serialization support with v151 compatibility
- Provide full integration test coverage

Fixes data integrity issue where undrop could succeed on tables with
incomplete S3 data after vacuum cleanup has begun.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Sep 29, 2025
dantengsky and others added 21 commits September 30, 2025 00:18
Make error message more user-friendly by:
- Using clear language about why undrop is blocked
- Including vacuum start timestamp for better context
- Removing technical jargon like 'retention guard' and 'precedes'

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Rename VacuumRetention to VacuumWatermark for clarity
- Remove unnecessary fields: updated_by, updated_at, version
- Keep only essential 'time' field for monotonic timestamp tracking
- Update protobuf conversion and tests accordingly
- Maintain API compatibility and retention guard functionality

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Move vacuum timestamp setting from gc_drop_tables to VacuumDropTablesInterpreter::execute2
- Use actual retention settings instead of hardcoded 7 days
- Set timestamp before vacuum operation starts for better timing
- Simplify gc_drop_tables to focus only on metadata cleanup
- Improve separation of concerns between business logic and cleanup operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Only set vacuum timestamp when NOT in dry run mode
- Maintains consistency with existing dry run behavior for metadata cleanup
- Dry run should not modify any state including vacuum watermark
- Preserves read-only nature of dry run operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Returns Option<VacuumWatermark> instead
…alues

- Replace unwrap_or_default() with explicit Option handling in fetch_set_vacuum_timestamp
- Use clear match semantics: None = never set, Some = previous value
- Update get_vacuum_timestamp to return Option<VacuumWatermark> instead of using epoch default
- Fix schema_api retention guard to only check when vacuum timestamp is actually set
- Update tests to handle proper Option semantics for first-time vs subsequent calls
- Remove dependency on artificial epoch time as "unset" indicator
- Improve type safety by letting Option express the "possibly unset" state

This eliminates confusion around artificial default values and makes the
vacuum watermark semantics clearer through the type system.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Adds vacuum timestamp seq checking to undrop transaction conditions to prevent
race conditions where vacuum and undrop operations could execute concurrently,
leading to data inconsistency.

- Read vacuum timestamp with seq before undrop transaction
- Add vacuum timestamp seq check to transaction conditions
- Ensures undrop fails atomically if vacuum timestamp changes during operation
- Existing test coverage in vacuum_retention_timestamp validates concurrent scenario

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The vacuum_retention_timestamp test was failing because it tried to create
a table without first creating the database. Added util.create_db() call
to fix the test execution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Fixed timing assertion failure by:
- Capturing drop_time after the actual drop operation completes
- Using relative time comparison instead of exact equality
- Ensuring vacuum_time is always after drop_time as expected

The test now properly validates that undrop is blocked when vacuum timestamp
is set after the drop time, without timing precision issues.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Ran `cargo fmt --all` to ensure consistent code formatting across
vacuum retention implementation files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Updated test_decode_v152_vacuum_retention to match the standard proto-conv test format:
- Added complete copyright header
- Added serialized bytes array for backward compatibility testing
- Added test_load_old call to test deserialization from v152 format
- Added proper documentation comments about byte array immutability
- Used correct serialized bytes generated by test framework

Follows the same pattern as other version tests like v150_role_comment.rs
to ensure proper backward compatibility testing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The get_vacuum_timestamp method was only used in tests and had no other
consumers. Removed it to follow YAGNI principle and simplify the API:

- Removed get_vacuum_timestamp from GarbageCollectionApi trait
- Updated test to use direct KV API call (kv_api.get_pb)
- Added VacuumRetentionIdent import to test file

This aligns with the existing pattern in undrop logic which also uses
direct KV API access for reading vacuum timestamps.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Make vacuum timestamp setting a blocking operation that must succeed
before any data cleanup begins. This prevents a critical race condition
where data could be cleaned up without proper undrop protection.

Changes:
- Convert timestamp setting from best-effort to critical operation
- Vacuum operation now fails fast if timestamp cannot be set
- Added detailed error message explaining the safety abort
- Ensures vacuum watermark always precedes any data cleanup

This maintains the core safety guarantee: tables that may have incomplete
data after vacuum can never be restored via undrop.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Improve consistency with existing codebase naming conventions:

1. Rename protobuf message from VacuumRetention to VacuumWatermark
   - Aligns with Rust struct name (VacuumWatermark)
   - Follows existing pattern: Rust struct name = Protobuf message name
   - Examples: DatabaseMeta, CatalogMeta, IndexMeta all use same names

2. Remove unused protobuf fields
   - Removed: updated_by, updated_at, version (all set to empty/0)
   - Kept: ver, min_reader_ver (required for versioning), time (core data)
   - Maintains same serialization bytes (backward compatible)

3. Simplify proto-conv implementation
   - Removed unused field mappings in to_pb()
   - Cleaner, more maintainable conversion code

The protobuf now only contains the essential fields actually used,
following the principle of minimal necessary data representation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Rename protobuf definition file to match the message name inside it.
This improves consistency and makes the codebase easier to navigate:

- File: vacuum_retention.proto → vacuum_watermark.proto
- Message: VacuumWatermark (unchanged)
- Rust struct: VacuumWatermark (unchanged)

Following the common pattern where protobuf files are named after
their primary message type.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Replace extreme epoch timestamp (1970-01-01) with realistic test value:
- Old: timestamp(0, 0) → 1970-01-01 00:00:00 UTC (too extreme)
- New: timestamp(1702603569, 0) → 2023-12-15 01:26:09 UTC (realistic)

This aligns with other protobuf tests in the codebase that use similar
recent timestamps for better test readability and maintainability.
Updated corresponding serialized bytes array to match new timestamp.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

github-actions bot commented Oct 9, 2025

🤖 Smart Auto-retry Analysis

Workflow: 18381430863

📊 Summary

  • Total Jobs: 23
  • Failed Jobs: 3
  • Retryable: 0
  • Code Issues: 3

NO RETRY NEEDED

All failures appear to be code/test issues requiring manual fixes.

🔍 Job Details

  • linux / build: Not retryable (Code/Test)
  • linux / test_unit: Not retryable (Code/Test)
  • linux / check: Not retryable (Code/Test)

🤖 About

Automated analysis using job annotations to distinguish infrastructure issues (auto-retried) from code/test issues (manual fixes needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant