Trino support #34

ooke · 2026-01-05T09:43:02Z

No description provided.

- Add Trino table definitions to create_tables.sql (all 8 TPC-H tables) - Add Trino handlers to create_indexes.sql and analyze_tables.sql - Add get_table_column_types() to TPC-H workload for type metadata - Fix TYPE_MISMATCH error in Trino load_data() by converting CSV strings to proper Python types (int, Decimal, date) before INSERT - Centralize Trino connection defaults in _get_connection_defaults() - Update config to use hive catalog with benchmark schema

- Replace slow Python row-by-row INSERT with zero-copy Parquet loading - Generate Parquet files directly to Hive warehouse via tpchgen-cli - Use external tables with format='PARQUET' for instant data access - Add Trino support to TPC-H query templates (Q01, Q11, Q13, Q15, Q22) - Fix trailing semicolon issue in Trino query execution - Fix type mismatch in Q22 (varchar vs integer comparison) - Add markers.py to package creator for remote deployment - Pass hive_warehouse variable to SQL templates

- Fix _should_minimize to check path parts instead of substring (was skipping files when package path contained 'test') - Remove redundant cast(Path, ...) in trino.py

Introduces a storage abstraction module with: - StorageBackend: Abstract base class with factory method - LocalStorage: Local filesystem storage using file:// URLs - S3Storage: S3 storage using s3a:// URLs with boto3 The S3Storage implementation includes: - IAM role-based authentication (default credential chain) - Auto-create bucket if it doesn't exist - Multipart upload support for large files - Dynamic Python dependency declaration (boto3) This enables Trino (and future systems) to use either local or S3 storage for external table data.

Systems and workloads can now declare external table support via SUPPORTS_EXTERNAL_TABLES flag. This enables capability-based branching instead of system-kind checks. SystemUnderTest additions: - SUPPORTS_EXTERNAL_TABLES class attribute - get_storage_backend() method for storage access Workload additions: - SUPPORTS_EXTERNAL_TABLES class attribute - get_external_table_format() for data format (default: PARQUET) - get_external_table_columns() for column definitions - get_table_names() for table enumeration - generate_data_for_external_tables() for data generation - get_data_locations() for storage URL mapping execute_setup_script() now passes data_locations to SQL templates when system supports external tables.

TPC-H workload now supports external tables for systems like Trino: - SUPPORTS_EXTERNAL_TABLES = True - Implements get_external_table_columns() using existing type mappings - Implements generate_data_for_external_tables() with Parquet generation - Refactored prepare() to use capability-based branching For external table systems: 1. Generate Parquet files to temp directory 2. Upload to storage backend (local or S3) 3. Create external tables pointing to data location 4. No load_data() needed - data accessed directly SQL template updated to use dynamic {{ data_locations.table }} variables instead of hardcoded paths, supporting both file:// and s3a:// URLs transparently.

Trino now supports S3 storage for external tables: - SUPPORTS_EXTERNAL_TABLES = True - Storage backend created from config (local or S3) - Multinode clusters require S3 storage (validated at init) Key changes: - _create_storage_backend() creates LocalStorage or S3Storage - _validate_multinode_storage() enforces S3 for node_count > 1 - get_storage_backend() returns configured storage - Removed deprecated hive.allow-drop-table properties Validation provides clear error message when multinode is configured without S3, suggesting the required configuration.

Package creator now: - Copies storage module when Trino is configured - Adds Trino to kind_to_class mapping for proper initialization New _copy_storage_module_if_needed() method copies all storage module files when any configured system needs storage backends (currently Trino).

Configure S3 storage for multinode Trino cluster as required by the new storage validation.

- Add s3_buckets variable to Terraform for IAM role provisioning - Create IAM role, policy, and instance profile when S3 buckets configured - Grant S3 permissions: Get, Put, Delete, List, HeadObject, CreateBucket - Attach instance profile to EC2 instances automatically - Add _collect_s3_buckets() to infrastructure manager to detect S3 storage - Include storage-specific dependencies (boto3) in package requirements - Pass S3 bucket list to Terraform as variable

- Include storage config in extract_workload_connection_info for remote packages - Remove Hive Metastore service dependency (file-based metastore works standalone) - Simplify multinode setup by using file-based metastore on all nodes - Remove coordinator-specific Hive Metastore installation - Set has_local_metastore=False for systemd service (no external dependency) This enables Trino multinode clusters to work with S3 storage without requiring a running Hive Metastore service.

ooke added 4 commits January 6, 2026 17:11

Trino support.

58d41ae

Some fixes.

9088944

ooke force-pushed the trino branch from 5a64143 to cb01902 Compare January 6, 2026 17:11

ooke added 9 commits January 6, 2026 18:15

Fix package code minimizer skipping files incorrectly

84fd8b1

- Fix _should_minimize to check path parts instead of substring (was skipping files when package path contained 'test') - Remove redundant cast(Path, ...) in trino.py

Add S3 storage config to Trino multinode test config

305abaa

Configure S3 storage for multinode Trino cluster as required by the new storage validation.

ooke marked this pull request as ready for review January 9, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trino support #34

Trino support #34

Uh oh!

ooke commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Trino support #34

Are you sure you want to change the base?

Trino support #34

Uh oh!

Conversation

ooke commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants