Database Benchmark: Multi-Engine Analytics with Apache Iceberg

A comprehensive benchmarking environment for cybersecurity analytics, featuring native database formats, Splunk DB Connect overhead testing, and Apache Iceberg multi-engine architecture. Designed for Apple Silicon (M3/M3 Pro/M3 Max) with AWS deployment options for fair production comparisons.

🎯 Purpose

This project benchmarks three key architectural patterns for cybersecurity log analysis:

1. Native Performance Baseline

Compare raw query performance across database engines on their native formats:

PostgreSQL 16 - Traditional relational (ARM64 native)
ClickHouse 24.1 - Columnar OLAP with NEON SIMD (ARM64 native)
StarRocks 3.2 - MPP analytics (Rosetta 2)

2. Splunk DB Connect Overhead

Measure performance impact when querying databases via Splunk's dbxquery command:

Direct query: Database → Application (baseline)
Proxy query: Database → Splunk → Application (measure overhead)
Expected overhead: 100-200ms added latency

3. Apache Iceberg Multi-Engine

Test open lakehouse pattern where multiple engines query shared data:

Trino - Create and manage Iceberg tables (full read/write)
ClickHouse - Query Iceberg tables (read-only, ~20x slower than native)
StarRocks - Query and modify Iceberg tables (read/write, ~4x slower)
Trade-off: Flexibility vs Performance

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          Query Engines                              │
├────────────┬────────────┬────────────┬────────────┬─────────────────┤
│ PostgreSQL │ ClickHouse │ StarRocks  │   Trino    │     Splunk      │
│  (Native)  │  (Native)  │  (Native)  │ (Iceberg)  │  (DB Connect)   │
└──────┬─────┴─────┬──────┴──────┬─────┴──────┬─────┴─────────┬───────┘
       │           │             │            │               │
       │           └─────────────┼────────────┘               │
       │                         │                            │
       │           ┌─────────────▼──────────────┐             │
       │           │   Hive Metastore Catalog   │             │
       │           │ (Iceberg metadata: schemas,│             │
       │           │  snapshots, partitions)    │             │
       │           └─────────────┬──────────────┘             │
       │                         │                            │
       │           ┌─────────────▼──────────────┐             │
       │           │  MinIO (S3-compatible)     │             │
       │           │  /warehouse/cybersecurity/ │             │
       │           │    - Parquet data files    │             │
       │           │    - Avro metadata files   │             │
       │           └────────────────────────────┘             │
       │                                                      │
       └──────────────────────────────────────────────────────┘
                   Native Table Storage + Splunk DB Connect

Key Components

Component	Role	Port	Architecture
PostgreSQL	Relational database	5432	ARM64 native
ClickHouse	Columnar OLAP + Iceberg engine	8123	ARM64 native
StarRocks	MPP analytics + Iceberg catalog	9030	Rosetta 2
Splunk	SIEM + DB Connect proxy	8000	Rosetta 2
Trino	Federated SQL + Iceberg coordinator	8080	ARM64 native
MinIO	S3-compatible object storage	9000/9001	ARM64 native
Hive Metastore	Iceberg catalog backend	9083	Rosetta 2

🚀 Quick Start

Prerequisites

Hardware: MacBook Pro M3/M3 Pro/M3 Max
Memory: 24GB recommended (18GB minimum)
Storage: 150GB free space
macOS: 14.0 Sonoma or later
Docker Desktop: 24.0+ with Rosetta 2 enabled
Python: 3.10+ with pip

Installation

# Clone the repository
git clone https://github.com/yourusername/splunk-db-connect-benchmark.git
cd splunk-db-connect-benchmark

# Copy environment template
cp .env.example .env

# Start all services
docker-compose -f docker-compose.m3.yml up -d

# Wait for all services to be healthy (5-10 minutes)
docker-compose -f docker-compose.m3.yml ps

Initialize Data and Iceberg Tables

# Step 1: Load data into native databases (PostgreSQL, ClickHouse, StarRocks)
bash scripts/phase4_load_data.sh

# Step 2: Initialize MinIO and create Iceberg tables
bash scripts/setup_iceberg.sh

# Step 3: Configure ClickHouse Iceberg engine
bash scripts/configure_clickhouse_iceberg.sh

# Step 4: Configure StarRocks Iceberg catalog
bash scripts/configure_starrocks_iceberg.sh

# Step 5: Set up Splunk DB Connect (requires manual DB Connect app installation)
bash scripts/setup_splunk_dbconnect.sh

Run Benchmarks

# Install Python dependencies
pip3 install psycopg2-binary clickhouse-connect pymysql

# Run all 3 benchmarks
cd benchmarks
./run_all.sh

# Or run individually:
python3 01_native_baseline.py          # Native performance baseline
python3 02_splunk_dbxquery_overhead.py # Splunk proxy overhead
python3 03_iceberg_multi_engine.py     # Iceberg multi-engine performance

# View results
ls -lh results/

📊 Benchmark Suite

Test 1: Native Performance Baseline

Purpose: Establish baseline query performance on native database formats

Queries:

Count all records
Aggregate by event type (GROUP BY + ORDER BY)
Filter failed login events (WHERE + HAVING)
Time range aggregation (last 7 days)
Top data transfer events (ORDER BY + LIMIT)

Expected Results:

Database	Simple Query	Complex Aggregation
ClickHouse	10-20 ms	30-50 ms
PostgreSQL	50-100 ms	150-300 ms
StarRocks	30-50 ms	70-120 ms

Script: benchmarks/01_native_baseline.py

Test 2: Splunk DB Connect Overhead

Purpose: Measure latency added by Splunk's dbxquery proxy layer

Method:

Query database directly (baseline)
Query same database via Splunk dbxquery
Calculate overhead: splunk_latency - direct_latency

Expected Overhead:

Database	Direct Query	via dbxquery	Overhead
PostgreSQL	50-100 ms	150-300 ms	+100-200 ms
ClickHouse	10-20 ms	110-220 ms	+100-200 ms
StarRocks	30-50 ms	130-250 ms	+100-200 ms

Key Finding: Splunk dbxquery adds consistent 100-200ms overhead regardless of database speed.

Script: benchmarks/02_splunk_dbxquery_overhead.py

Documentation: docs/SPLUNK_DBXQUERY_LIMITATIONS.md

Test 3: Iceberg Multi-Engine Performance

Purpose: Test Apache Iceberg table format with multiple query engines

Architecture:

Shared Data: Iceberg tables stored in MinIO (Parquet format)
Metadata: Hive Metastore catalog (tracks schemas, snapshots)
Engines: Trino (read/write), ClickHouse (read-only), StarRocks (read/write)

Comparison: Native format vs Iceberg format for same query

Expected Results:

Engine	Native Format	Iceberg Format	Slowdown
ClickHouse	10-20 ms	200-500 ms	20-25x slower
StarRocks	30-50 ms	120-200 ms	4-6x slower

Trade-off:

✅ Benefit: Multi-engine access, ACID transactions, schema evolution, time travel
❌ Cost: 4-20x slower query performance

Script: benchmarks/03_iceberg_multi_engine.py

Documentation: docs/ICEBERG_MULTI_ENGINE.md

📁 Project Structure

splunk-db-connect-benchmark/
├── README.md                        # This file
├── docker-compose.m3.yml            # Complete M3 deployment
├── .env.example                     # Environment variables
│
├── configs/                         # Database configurations
│   ├── postgresql/                  # PostgreSQL configs
│   ├── clickhouse/                  # ClickHouse configs
│   ├── starrocks/                   # StarRocks FE/BE configs
│   ├── splunk/                      # Splunk configs
│   └── trino/                       # Trino configs (NEW)
│       ├── config.properties        # Trino coordinator settings
│       ├── jvm.config               # JVM tuning for M3
│       ├── node.properties          # Node identification
│       ├── log.properties           # Logging configuration
│       └── catalog/
│           └── iceberg.properties   # Iceberg catalog connector
│
├── sql/                             # Schema definitions
│   ├── postgresql_schema.sql        # PostgreSQL tables
│   ├── clickhouse_schema.sql        # ClickHouse tables
│   ├── starrocks_schema.sql         # StarRocks tables
│   └── iceberg_schema.sql           # Iceberg tables (NEW)
│
├── scripts/                         # Setup and utility scripts
│   ├── setup_all.sh                 # Master setup script
│   ├── setup_iceberg.sh             # Initialize Iceberg (NEW)
│   ├── configure_clickhouse_iceberg.sh  # ClickHouse Iceberg engine (NEW)
│   ├── configure_starrocks_iceberg.sh   # StarRocks Iceberg catalog (NEW)
│   ├── setup_splunk_dbconnect.sh    # Splunk DB Connect (NEW)
│   ├── phase1_verify_system.sh      # System verification
│   ├── phase2_configure_docker.sh   # Docker configuration
│   ├── phase3_deploy_containers.sh  # Container deployment
│   ├── phase4_load_data.sh          # Data loading
│   └── cleanup.sh                   # Environment cleanup
│
├── benchmarks/                      # Benchmark scripts (NEW)
│   ├── 01_native_baseline.py        # Native performance baseline
│   ├── 02_splunk_dbxquery_overhead.py  # Splunk proxy overhead
│   ├── 03_iceberg_multi_engine.py   # Iceberg multi-engine
│   ├── run_all.sh                   # Execute all benchmarks
│   └── results/                     # Benchmark results (JSON)
│
├── docs/                            # Documentation
│   ├── SPLUNK_DBXQUERY_LIMITATIONS.md  # Splunk dbxquery analysis (NEW)
│   ├── ICEBERG_MULTI_ENGINE.md      # Iceberg architecture guide (NEW)
│   ├── TROUBLESHOOTING.md           # Common issues
│   └── ARCHITECTURE.md              # System architecture
│
└── data/                            # Persistent data (gitignored)
    ├── postgresql/                  # PostgreSQL data
    ├── clickhouse/                  # ClickHouse data
    ├── starrocks-fe/                # StarRocks frontend
    ├── starrocks-be/                # StarRocks backend
    ├── splunk/                      # Splunk data
    ├── minio/                       # MinIO object storage (NEW)
    └── hive-metastore/              # Hive Metastore metadata (NEW)

🎮 Using the Environment

Access Services

Service	Endpoint	Credentials	Purpose
PostgreSQL	`localhost:5432`	postgres / postgres123	Relational database
ClickHouse HTTP	`http://localhost:8123`	default / (none)	Columnar OLAP
ClickHouse Native	`localhost:9000`	default / (none)	Native protocol
StarRocks FE	`http://localhost:8030`	root / (none)	Web UI
StarRocks MySQL	`localhost:9030`	root / (none)	Query interface
Splunk Web	`http://localhost:8000`	admin / changeme	SIEM UI
Splunk API	`https://localhost:8089`	admin / changeme	Management API
Trino Web	`http://localhost:8080`	(none)	Query UI
MinIO Console	`http://localhost:9001`	admin / password123	Object storage UI
MinIO API	`http://localhost:9000`	admin / password123	S3-compatible API

Query Examples

PostgreSQL (Native)

docker exec -it benchmark-postgres psql -U postgres -d cybersecurity -c \
  "SELECT event_type, COUNT(*) FROM security_logs GROUP BY event_type ORDER BY COUNT(*) DESC LIMIT 10;"

ClickHouse (Native MergeTree)

curl 'http://localhost:8123/' --data-binary \
  "SELECT event_type, COUNT() as count FROM cybersecurity.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10 FORMAT Pretty"

ClickHouse (Iceberg Table Engine)

docker exec -it benchmark-clickhouse clickhouse-client --query \
  "SELECT event_type, COUNT() as count FROM iceberg_db.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"

StarRocks (Native)

docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -D cybersecurity -e \
  "SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10;"

StarRocks (Iceberg External Catalog)

docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "SELECT event_type, COUNT(*) as count FROM iceberg_catalog.cybersecurity.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10;"

Trino (Iceberg)

docker exec -it benchmark-trino trino --server localhost:8080 --catalog iceberg --schema cybersecurity --execute \
  "SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"

Splunk DB Connect (dbxquery)

docker exec benchmark-splunk /opt/splunk/bin/splunk search \
  '| dbxquery connection="postgresql_conn" query="SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"' \
  -auth admin:changeme

📈 Performance Expectations

Native Database Performance (M3 Platform)

Database	Simple Query	Aggregation	Join Query	Architecture
ClickHouse	10-20 ms	30-50 ms	50-100 ms	ARM64 native (optimal)
PostgreSQL	50-100 ms	150-300 ms	300-500 ms	ARM64 native (optimal)
StarRocks	40-60 ms	80-150 ms	150-250 ms	Rosetta 2 (15-20% overhead)
Splunk	200-500 ms	500-2000 ms	1000-5000 ms	Rosetta 2 (30-40% overhead)

Splunk DB Connect Overhead

Additional Latency: +100-200 ms per query
Reason: Splunk search pipeline + JDBC driver + data serialization
Acceptable For: Ad-hoc queries, dashboards (< 100 queries/hour)
Not Suitable For: Real-time analytics (> 1000 queries/hour)

Iceberg Multi-Engine Performance

Engine	Native	Iceberg	Slowdown	Notes
ClickHouse	10-20 ms	200-500 ms	20-25x	Read-only, high overhead
StarRocks	30-50 ms	120-200 ms	4-6x	Read-write, moderate overhead
Trino	-	300-800 ms	(baseline)	Full Iceberg control

Trade-offs:

✅ Benefits: Multi-engine access, ACID transactions, schema evolution, time travel, unified governance
❌ Costs: 4-25x slower than native formats, more complex infrastructure

🔄 Iceberg Use Cases

When to Use Iceberg

✅ Good Use Cases:

Data Governance: Centralized access control across multiple query engines
Schema Evolution: Frequent schema changes without data rewrites
Time Travel: Audit trails, debugging, regulatory compliance
Multi-Engine Federation: Trino for ETL, ClickHouse for dashboards, StarRocks for analytics
Open Architecture: Avoid vendor lock-in, switch engines without data migration

❌ Not Suitable For:

Real-Time Dashboards: Sub-second query latency required
High-Frequency Queries: > 1000 queries/second
OLTP Workloads: Transactional applications with frequent updates
Single Engine: If only using one query engine, native format is faster

Decision Matrix

Requirement	Native Format	Iceberg Format
Performance	✅ Best	❌ 4-25x slower
Multi-engine	❌ Requires ETL	✅ Native support
Schema evolution	❌ Rewrite data	✅ Metadata-only
Time travel	❌ Not supported	✅ Full support
ACID across engines	❌ No	✅ Yes
Setup complexity	✅ Simple	❌ Complex

📚 Documentation

Core Documentation

docs/SPLUNK_DBXQUERY_LIMITATIONS.md - Comprehensive analysis of Splunk DB Connect dbxquery limitations, overhead measurements, and best practices
docs/ICEBERG_MULTI_ENGINE.md - Apache Iceberg multi-engine architecture guide, performance characteristics, use cases
docs/ARCHITECTURE.md - Overall system architecture and design decisions
docs/TROUBLESHOOTING.md - Common issues and solutions

Benchmark Scripts

benchmarks/01_native_baseline.py - Native database performance baseline
benchmarks/02_splunk_dbxquery_overhead.py - Splunk DB Connect overhead measurement
benchmarks/03_iceberg_multi_engine.py - Iceberg multi-engine performance comparison
benchmarks/run_all.sh - Execute all benchmarks and generate report

🛠️ Advanced Configuration

Increase Iceberg Query Performance

-- Trino: Compact small files (reduces metadata overhead)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE optimize;

-- Trino: Expire old snapshots (cleanup history)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE expire_snapshots(retention_threshold => '7d');

-- Trino: Remove orphan files (cleanup unused data)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE remove_orphan_files(older_than => '30d');

Configure Splunk DB Connect Connections

See scripts/setup_splunk_dbconnect.sh for automated configuration.

Manual Configuration:

Install Splunk DB Connect app from Splunkbase
Create database identities (credentials)
Create database connections (PostgreSQL, ClickHouse, StarRocks)
Test connections via Splunk UI

Tune Database Memory (M3)

docker-compose.m3.yml:

services:
  postgresql:
    deploy:
      resources:
        limits:
          memory: 4G    # Increase if needed

  clickhouse:
    deploy:
      resources:
        limits:
          memory: 8G    # High memory for fast queries

  trino:
    deploy:
      resources:
        limits:
          memory: 6G    # JVM heap for Iceberg metadata

🌐 AWS Deployment (Fair Comparison)

Why AWS?

Problem on M3:

StarRocks (Rosetta 2): 15-20% slower than native
Splunk (Rosetta 2): 30-40% slower than native
Unfair comparison between ARM64-native (ClickHouse) and Rosetta (others)

Solution:

Deploy on AWS with native architectures for all databases
ClickHouse: c7g.4xlarge (ARM64 Graviton3)
StarRocks: m6i.4xlarge (x86_64 Intel)
Splunk: m6i.4xlarge (x86_64 Intel)
Fair apples-to-apples comparison

Cost-Optimized AWS Deployment

Use spot instances for short-term benchmarking:

Component	Instance	Spot Cost/Day
PostgreSQL	r7g.xlarge	$3.84
ClickHouse	c7g.2xlarge	$5.02
StarRocks	m6i.2xlarge	$6.67
Splunk	m6i.2xlarge	$6.67
Trino	m7g.2xlarge	$4.70
MinIO	t4g.large	$0.54
Hive Metastore	t3a.medium	$0.27
Total		~$28/day

Run for 3 days: ~$85 total for comprehensive benchmarking

Terraform Deployment: Coming soon (see GitHub Issues)

⚠️ Troubleshooting

Iceberg Tables Not Visible

ClickHouse:

# Verify MinIO connectivity
curl http://localhost:9000/minio/health/live

# Check Iceberg table engine
docker exec -it benchmark-clickhouse clickhouse-client --query \
  "SHOW TABLES FROM iceberg_db"

StarRocks:

# Refresh Iceberg catalog
docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "REFRESH EXTERNAL TABLE iceberg_catalog.cybersecurity.security_logs;"

# Verify catalog
docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "SHOW CATALOGS;"

Splunk DB Connect Not Working

Error: "dbxquery command not found"

Solution:

Install DB Connect app from Splunkbase: https://splunkbase.splunk.com/app/2686
Place .spl file in ./data/splunk/dbconnect/
Restart Splunk: docker-compose -f docker-compose.m3.yml restart splunk
Run setup script: bash scripts/setup_splunk_dbconnect.sh

Trino Queries Timeout

Error: "Query exceeded maximum time limit"

Solution:

# configs/trino/config.properties
query.max-execution-time=30m
query.max-memory-per-node=4GB

Out of Memory on M3

Symptom: Docker containers crash, system freezes

Solution:

Increase Docker Desktop memory: Settings → Resources → Memory → 20GB (for 24GB Mac)
Reduce concurrent services: Stop Splunk/Trino if not actively testing
Lower database memory limits in docker-compose.m3.yml

🤝 Contributing

Contributions are welcome! Areas for improvement:

Benchmark Enhancements:
- Add more query types (window functions, CTEs, subqueries)
- Test with larger datasets (1M, 10M, 100M rows)
- Measure concurrent query performance
Iceberg Features:
- Time travel query benchmarks
- Schema evolution performance
- ACID transaction testing
AWS Deployment:
- Terraform scripts for automated deployment
- CloudWatch monitoring dashboards
- Cost optimization strategies
Documentation:
- Video tutorials
- Best practices guide
- Case studies

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PostgreSQL, ClickHouse, StarRocks, Splunk, Trino teams for excellent database systems
Apache Iceberg community for open table format
Docker for consistent cross-platform development
Apple for M-series ARM64 architecture
AWS for Graviton processors

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: docs/

Built for the cybersecurity and database communities

Tested on: MacBook Pro M3 Pro (24GB) | macOS 15.0 Sequoia | Docker Desktop 4.26

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.claude		.claude
benchmarks		benchmarks
configs		configs
data/minio/.minio.sys		data/minio/.minio.sys
docs		docs
scripts		scripts
sql		sql
tests		tests
.env.example		.env.example
.gitignore		.gitignore
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
README.md		README.md
SPECIFICATION.md		SPECIFICATION.md
docker-compose.m3.yml		docker-compose.m3.yml

JW-Corelight/Splunk-db-connect-benchmark

Folders and files

Latest commit

History

Repository files navigation

Database Benchmark: Multi-Engine Analytics with Apache Iceberg

🎯 Purpose

1. Native Performance Baseline

2. Splunk DB Connect Overhead

3. Apache Iceberg Multi-Engine

🏗️ Architecture Overview

Key Components

🚀 Quick Start

Prerequisites

Installation

Initialize Data and Iceberg Tables

Run Benchmarks

📊 Benchmark Suite

Test 1: Native Performance Baseline

Test 2: Splunk DB Connect Overhead

Test 3: Iceberg Multi-Engine Performance

📁 Project Structure

🎮 Using the Environment

Access Services

Query Examples

PostgreSQL (Native)

ClickHouse (Native MergeTree)

ClickHouse (Iceberg Table Engine)

StarRocks (Native)

StarRocks (Iceberg External Catalog)

Trino (Iceberg)

Splunk DB Connect (dbxquery)

📈 Performance Expectations

Native Database Performance (M3 Platform)

Splunk DB Connect Overhead

Iceberg Multi-Engine Performance

🔄 Iceberg Use Cases

When to Use Iceberg

Decision Matrix

📚 Documentation

Core Documentation

Benchmark Scripts

🛠️ Advanced Configuration

Increase Iceberg Query Performance

Configure Splunk DB Connect Connections

Tune Database Memory (M3)

🌐 AWS Deployment (Fair Comparison)

Why AWS?

Cost-Optimized AWS Deployment

⚠️ Troubleshooting

Iceberg Tables Not Visible

Splunk DB Connect Not Working

Trino Queries Timeout

Out of Memory on M3

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages