-
Notifications
You must be signed in to change notification settings - Fork 0
Trino support #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ooke
wants to merge
13
commits into
master
Choose a base branch
from
trino
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Trino support #34
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add Trino table definitions to create_tables.sql (all 8 TPC-H tables) - Add Trino handlers to create_indexes.sql and analyze_tables.sql - Add get_table_column_types() to TPC-H workload for type metadata - Fix TYPE_MISMATCH error in Trino load_data() by converting CSV strings to proper Python types (int, Decimal, date) before INSERT - Centralize Trino connection defaults in _get_connection_defaults() - Update config to use hive catalog with benchmark schema
- Replace slow Python row-by-row INSERT with zero-copy Parquet loading - Generate Parquet files directly to Hive warehouse via tpchgen-cli - Use external tables with format='PARQUET' for instant data access - Add Trino support to TPC-H query templates (Q01, Q11, Q13, Q15, Q22) - Fix trailing semicolon issue in Trino query execution - Fix type mismatch in Q22 (varchar vs integer comparison) - Add markers.py to package creator for remote deployment - Pass hive_warehouse variable to SQL templates
- Fix _should_minimize to check path parts instead of substring (was skipping files when package path contained 'test') - Remove redundant cast(Path, ...) in trino.py
Introduces a storage abstraction module with: - StorageBackend: Abstract base class with factory method - LocalStorage: Local filesystem storage using file:// URLs - S3Storage: S3 storage using s3a:// URLs with boto3 The S3Storage implementation includes: - IAM role-based authentication (default credential chain) - Auto-create bucket if it doesn't exist - Multipart upload support for large files - Dynamic Python dependency declaration (boto3) This enables Trino (and future systems) to use either local or S3 storage for external table data.
Systems and workloads can now declare external table support via SUPPORTS_EXTERNAL_TABLES flag. This enables capability-based branching instead of system-kind checks. SystemUnderTest additions: - SUPPORTS_EXTERNAL_TABLES class attribute - get_storage_backend() method for storage access Workload additions: - SUPPORTS_EXTERNAL_TABLES class attribute - get_external_table_format() for data format (default: PARQUET) - get_external_table_columns() for column definitions - get_table_names() for table enumeration - generate_data_for_external_tables() for data generation - get_data_locations() for storage URL mapping execute_setup_script() now passes data_locations to SQL templates when system supports external tables.
TPC-H workload now supports external tables for systems like Trino:
- SUPPORTS_EXTERNAL_TABLES = True
- Implements get_external_table_columns() using existing type mappings
- Implements generate_data_for_external_tables() with Parquet generation
- Refactored prepare() to use capability-based branching
For external table systems:
1. Generate Parquet files to temp directory
2. Upload to storage backend (local or S3)
3. Create external tables pointing to data location
4. No load_data() needed - data accessed directly
SQL template updated to use dynamic {{ data_locations.table }}
variables instead of hardcoded paths, supporting both file://
and s3a:// URLs transparently.
Trino now supports S3 storage for external tables: - SUPPORTS_EXTERNAL_TABLES = True - Storage backend created from config (local or S3) - Multinode clusters require S3 storage (validated at init) Key changes: - _create_storage_backend() creates LocalStorage or S3Storage - _validate_multinode_storage() enforces S3 for node_count > 1 - get_storage_backend() returns configured storage - Removed deprecated hive.allow-drop-table properties Validation provides clear error message when multinode is configured without S3, suggesting the required configuration.
Package creator now: - Copies storage module when Trino is configured - Adds Trino to kind_to_class mapping for proper initialization New _copy_storage_module_if_needed() method copies all storage module files when any configured system needs storage backends (currently Trino).
Configure S3 storage for multinode Trino cluster as required by the new storage validation.
- Add s3_buckets variable to Terraform for IAM role provisioning - Create IAM role, policy, and instance profile when S3 buckets configured - Grant S3 permissions: Get, Put, Delete, List, HeadObject, CreateBucket - Attach instance profile to EC2 instances automatically - Add _collect_s3_buckets() to infrastructure manager to detect S3 storage - Include storage-specific dependencies (boto3) in package requirements - Pass S3 bucket list to Terraform as variable
- Include storage config in extract_workload_connection_info for remote packages - Remove Hive Metastore service dependency (file-based metastore works standalone) - Simplify multinode setup by using file-based metastore on all nodes - Remove coordinator-specific Hive Metastore installation - Set has_local_metastore=False for systemd service (no external dependency) This enables Trino multinode clusters to work with S3 storage without requiring a running Hive Metastore service.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.