Skip to content

Conversation

@ooke
Copy link
Member

@ooke ooke commented Jan 5, 2026

No description provided.

ooke added 4 commits January 6, 2026 17:11
- Add Trino table definitions to create_tables.sql (all 8 TPC-H tables)
- Add Trino handlers to create_indexes.sql and analyze_tables.sql
- Add get_table_column_types() to TPC-H workload for type metadata
- Fix TYPE_MISMATCH error in Trino load_data() by converting CSV
  strings to proper Python types (int, Decimal, date) before INSERT
- Centralize Trino connection defaults in _get_connection_defaults()
- Update config to use hive catalog with benchmark schema
- Replace slow Python row-by-row INSERT with zero-copy Parquet loading
- Generate Parquet files directly to Hive warehouse via tpchgen-cli
- Use external tables with format='PARQUET' for instant data access
- Add Trino support to TPC-H query templates (Q01, Q11, Q13, Q15, Q22)
- Fix trailing semicolon issue in Trino query execution
- Fix type mismatch in Q22 (varchar vs integer comparison)
- Add markers.py to package creator for remote deployment
- Pass hive_warehouse variable to SQL templates
ooke added 9 commits January 6, 2026 18:15
- Fix _should_minimize to check path parts instead of substring
  (was skipping files when package path contained 'test')
- Remove redundant cast(Path, ...) in trino.py
Introduces a storage abstraction module with:
- StorageBackend: Abstract base class with factory method
- LocalStorage: Local filesystem storage using file:// URLs
- S3Storage: S3 storage using s3a:// URLs with boto3

The S3Storage implementation includes:
- IAM role-based authentication (default credential chain)
- Auto-create bucket if it doesn't exist
- Multipart upload support for large files
- Dynamic Python dependency declaration (boto3)

This enables Trino (and future systems) to use either local
or S3 storage for external table data.
Systems and workloads can now declare external table support via
SUPPORTS_EXTERNAL_TABLES flag. This enables capability-based
branching instead of system-kind checks.

SystemUnderTest additions:
- SUPPORTS_EXTERNAL_TABLES class attribute
- get_storage_backend() method for storage access

Workload additions:
- SUPPORTS_EXTERNAL_TABLES class attribute
- get_external_table_format() for data format (default: PARQUET)
- get_external_table_columns() for column definitions
- get_table_names() for table enumeration
- generate_data_for_external_tables() for data generation
- get_data_locations() for storage URL mapping

execute_setup_script() now passes data_locations to SQL templates
when system supports external tables.
TPC-H workload now supports external tables for systems like Trino:
- SUPPORTS_EXTERNAL_TABLES = True
- Implements get_external_table_columns() using existing type mappings
- Implements generate_data_for_external_tables() with Parquet generation
- Refactored prepare() to use capability-based branching

For external table systems:
1. Generate Parquet files to temp directory
2. Upload to storage backend (local or S3)
3. Create external tables pointing to data location
4. No load_data() needed - data accessed directly

SQL template updated to use dynamic {{ data_locations.table }}
variables instead of hardcoded paths, supporting both file://
and s3a:// URLs transparently.
Trino now supports S3 storage for external tables:
- SUPPORTS_EXTERNAL_TABLES = True
- Storage backend created from config (local or S3)
- Multinode clusters require S3 storage (validated at init)

Key changes:
- _create_storage_backend() creates LocalStorage or S3Storage
- _validate_multinode_storage() enforces S3 for node_count > 1
- get_storage_backend() returns configured storage
- Removed deprecated hive.allow-drop-table properties

Validation provides clear error message when multinode is
configured without S3, suggesting the required configuration.
Package creator now:
- Copies storage module when Trino is configured
- Adds Trino to kind_to_class mapping for proper initialization

New _copy_storage_module_if_needed() method copies all storage
module files when any configured system needs storage backends
(currently Trino).
Configure S3 storage for multinode Trino cluster as required
by the new storage validation.
- Add s3_buckets variable to Terraform for IAM role provisioning
- Create IAM role, policy, and instance profile when S3 buckets configured
- Grant S3 permissions: Get, Put, Delete, List, HeadObject, CreateBucket
- Attach instance profile to EC2 instances automatically
- Add _collect_s3_buckets() to infrastructure manager to detect S3 storage
- Include storage-specific dependencies (boto3) in package requirements
- Pass S3 bucket list to Terraform as variable
- Include storage config in extract_workload_connection_info for remote packages
- Remove Hive Metastore service dependency (file-based metastore works standalone)
- Simplify multinode setup by using file-based metastore on all nodes
- Remove coordinator-specific Hive Metastore installation
- Set has_local_metastore=False for systemd service (no external dependency)

This enables Trino multinode clusters to work with S3 storage without
requiring a running Hive Metastore service.
@ooke ooke marked this pull request as ready for review January 9, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants