Skip to content

Conversation

david-leifker
Copy link
Collaborator

@david-leifker david-leifker commented Oct 6, 2025

LoadIndices Upgrade - High-Performance Bulk Metadata Loading

Documentation: https://github.com/datahub-project/datahub/blob/45a2fb3e2236540c31568984a92d3ba98fc15a9f/docs/how/load-indices.md

🎯 Feature Overview

This PR introduces LoadIndices, a high-performance upgrade system designed for bulk loading metadata aspects directly from the database into Elasticsearch/OpenSearch indices. Unlike traditional restore operations that prioritize correctness, LoadIndices is optimized for speed and throughput during initial deployments and large-scale data migrations.

🚀 Core Capabilities

High-Performance Bulk Loading

  • Direct database streaming from metadata_aspect_v2 table to search indices
  • Optimized bulk operations with automatic refresh interval management
  • Bypasses Kafka MCL pipeline for maximum performance during bulk loads
  • Configurable transaction isolation using READ_UNCOMMITTED for faster scanning

Intelligent Index Management

  • Automatic refresh control - disables refresh intervals during bulk loading for optimal performance
  • Smart index optimization - re-enables refresh intervals after completion
  • Comprehensive progress monitoring with real-time reporting and performance metrics
  • Elasticsearch/OpenSearch compatibility with optimized bulk write operations

Enterprise Integration

  • Spring Boot integration with conditional configuration based on command-line arguments
  • Production-ready architecture with proper error handling and recovery mechanisms
  • Comprehensive documentation with usage guidelines and performance considerations

⚡ Performance Benefits

  • Significantly faster than traditional restore operations for large datasets
  • Optimized for initial deployments and data center migrations
  • Bulk operation efficiency with minimal overhead and intelligent request distribution
  • Scalable architecture designed for enterprise-scale metadata volumes

📋 Key Use Cases

  • Initial DataHub deployments with large existing metadata
  • Data center migrations requiring bulk index population
  • Disaster recovery scenarios with large-scale data restoration
  • Development environment setup with comprehensive test data

⚠️ Design Philosophy

LoadIndices makes strategic architectural trade-offs to prioritize performance over consistency:

  • Bypasses Kafka MCL event pipeline for direct database-to-index streaming
  • Uses READ_UNCOMMITTED transactions for faster database scanning
  • Optimized for bulk operations rather than real-time consistency

This feature enables organizations to efficiently populate their DataHub search indices during initial setup or large-scale migrations, providing a fast path to getting DataHub operational with comprehensive metadata search capabilities and enterprise-grade performance.

@github-actions github-actions bot added docs Issues and Improvements to docs product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Oct 6, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 6, 2025
Copy link

codecov bot commented Oct 6, 2025

Bundle Report

Bundle size has no change ✅

Copy link

alwaysmeticulous bot commented Oct 7, 2025

✅ Meticulous spotted 0 visual differences across 971 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit 7f486f9. This comment will update as new commits are pushed.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Oct 8, 2025
@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops PR or Issue related to DataHub backend & deployment docs Issues and Improvements to docs pending-submitter-merge product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants