Skip to content

Conversation

@cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Nov 18, 2025

Summary

Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

cbb330 added 30 commits November 3, 2025 14:48
- 20 test cases (10 Spark SQL + 10 Java API)
- Pre-assigned to 10 team members
- Comprehensive test prompts and templates
- Automated results collection script
- Reference documentation included
- Fixed grep errors with special characters and emojis
- Corrected bug detection logic (was showing bugs on all tests)
- Added error suppression for grep commands
- Now works correctly on macOS
- Step-by-step instructions for team members (clone, execute, commit)
- Status emoji guide (🔲 → 🔄 → ✅/❌)
- Git workflow (pull before starting, push after completing)
- Progress monitoring section for organizers
- Tips for smooth collaboration
- Makes it crystal clear how to participate in bug bash
New Files:
- QUICKSTART.md: Fast setup guide with 3 options
- start-testing.sh: Interactive setup wizard
- spark-shell-command.sh: One-liner spark-shell launcher

Features:
- Guides team through SSH/ksudo steps
- Auto-generates personalized log directories
- Shows correct spark-shell command with OpenHouse configs
- Displays test assignments and quick reference
- Creates session logs: logs/{name}/session_{timestamp}.log

Updated:
- README.md: Added links to QUICKSTART and new scripts
- File structure: Documented new helper scripts

Makes it fast and easy for team members to:
1. SSH to ltx1-holdemgw03.grid.linkedin.com
2. Authenticate with ksudo -e openhouse
3. Start spark-shell with correct configs
4. Begin testing immediately

Usage:
  ./start-testing.sh (interactive)
  OR
  ./spark-shell-command.sh your-name (on gateway)
…nce, better status guidance

Changes:
- Replace 'Copy and Run' with 'How to Start Testing' (numbered steps)
- Add comprehensive operation table (Spark SQL vs Java API side-by-side)
- Include DataFile creation example for Java tests
- Clarify status update with exact markdown syntax to edit
- Better formatting with bold labels and clear sections

Quick Reference now includes:
- Write data, Create branch, Cherry-pick, Fast-forward, Expire, WAP ops
- Java API DataFiles.builder() example for test file creation
- All operations shown in both SQL and Java

Status update now shows exact syntax:
  **Status:** 🔲 NOT STARTED → 🔄 IN PROGRESS → ✅ PASS/❌ FAIL
cbb330 and others added 30 commits November 18, 2025 15:28
- Script now runs on local machine, shows commands to copy-paste
- No need to clone repo on gateway - work from local repo instead
- Shows personalized test assignments
- Generates exact 3-step workflow: ssh → ksudo → spark-shell
- Updated README and QUICKSTART to reflect local execution

Workflow:
1. Run ./start-testing.sh locally
2. Enter your name
3. Copy-paste the 3 commands shown
4. Start testing on gateway
- Script now automatically SSHs to gateway
- Runs ksudo authentication
- Starts spark-shell with correct config
- All in one command - no manual steps!
- Uses ssh -t for proper pseudo-terminal allocation
- Updated docs to reflect automated workflow

Workflow:
1. Run ./start-testing.sh locally
2. Enter your name
3. Authenticate when prompted (2FA/ksudo)
4. spark-shell starts automatically
5. Start testing!

Much simpler for team members - just one script to run.
Changes:
- start-testing.sh now only shows info (assignments, tips, commands)
- Generates logs/{name}/connect.sh script for actual connection
- Updated ksudo command: ksudo -s OPENHOUSE,HDFS,WEBHDFS,SWEBHDFS,HCAT,RM -e openhouse -- bash -c 'spark-shell...'
- Shows full quick reference table with SQL/Java API commands
- Displays testing tips (table names, status updates, cleanup)
- Two-step workflow:
  1. ./start-testing.sh (setup & info)
  2. logs/{name}/connect.sh (connect & start)

Benefits:
- Users can review all info before connecting
- Separate script can be rerun if connection drops
- Proper ksudo service list for HDFS/HCAT access
- Clean separation of concerns
Added to Quick Reference Commands:
1. Create Table example with dummy columns (id INT, name STRING)
2. Java API imports - all necessary Iceberg and OpenHouse imports
   - org.apache.iceberg._
   - org.apache.iceberg.catalog._
   - org.apache.iceberg.types.Types._
   - org.apache.iceberg.data._
   - org.apache.iceberg.spark._
   - com.linkedin.openhouse.spark.OpenHouseSparkUtils

3. Common types & accessors:
   - How to get catalog from spark session
   - How to load Table
   - How to access Snapshot
   - How to get TableMetadata

4. Query current snapshot ID and parent ID examples

This makes it much easier for team members to get started
without hunting for import statements or type definitions.
The bash -c wrapper was causing spark-shell to receive a quit signal
immediately upon startup.

Changed from:
  ksudo ... -- bash -c 'spark-shell ...'

To:
  ksudo ... -- spark-shell ...

This allows spark-shell to properly receive stdin and stay interactive.
The ksudo -- syntax directly passes the command without shell wrapping.
Changed from single-line ksudo -- spark-shell to multi-line:
1. ksudo authenticates
2. exec spark-shell runs with credentials

Before:
  ksudo ... -- spark-shell ...

After:
  ksudo -s OPENHOUSE,HDFS,WEBHDFS,SWEBHDFS,HCAT,RM -e openhouse
  exec spark-shell --conf ...

This gives spark-shell proper terminal control after authentication.
Using exec replaces the shell process with spark-shell for clean interaction.
The issue: ksudo creates an interactive subshell that waits for input.
Attempts to pipe or exec spark-shell after ksudo don't work because
ksudo's subshell consumes input differently than expected.

New approach: Clear 3-step manual instructions
1. SSH to gateway
2. Run ksudo (creates authenticated subshell)
3. Manually run spark-shell in that subshell

Benefits:
- Works reliably with ksudo's interactive subshell behavior
- spark-shell gets full terminal control
- Clear, simple workflow
- Saves spark-shell command to file for easy reference

The spark-shell command is saved to logs/{name}/spark-shell-cmd.txt
for easy copy-pasting.
Iceberg classes in LinkedIn's OpenHouse are shaded/relocated under:
  com.linkedin.openhouse.relocated.org.apache.iceberg.*

Changed all import statements from:
  import org.apache.iceberg._
  import org.apache.iceberg.catalog._
  import org.apache.iceberg.types.Types._
  import org.apache.iceberg.data._
  import org.apache.iceberg.spark._

To:
  import com.linkedin.openhouse.relocated.org.apache.iceberg._
  import com.linkedin.openhouse.relocated.org.apache.iceberg.catalog._
  import com.linkedin.openhouse.relocated.org.apache.iceberg.types.Types._
  import com.linkedin.openhouse.relocated.org.apache.iceberg.data._
  import com.linkedin.openhouse.relocated.org.apache.iceberg.spark._

Also updated SparkCatalog cast to use the relocated package.

Now imports will work correctly in spark-shell without errors.
Changed from:
  com.linkedin.openhouse.relocated.org.apache.iceberg.*

To:
  liopenhouse.relocated.org.apache.iceberg.*

LinkedIn's internal package structure uses 'liopenhouse' as the base
package for relocated/shaded Iceberg dependencies.
OpenHouseSparkUtils class doesn't exist in the codebase.
Removed the import line from the quick reference.

The core Iceberg imports are sufficient for most testing needs.
Added pointers to existing test files as examples:
- BranchTestSpark3_5.java: Comprehensive Spark SQL multi-branch tests
- WapIdJavaTest.java: Java API WAP workflow example

Team members can reference these files to see working examples
of the operations they need to test.
Updated all table references from:
  openhouse.d1.test_xxx

To:
  openhouse.u_openhouse.test_xxx

This affects:
- CREATE TABLE examples in start-testing.sh
- Identifier.of() examples in Java API section
- All metadata queries (snapshots, refs, branches)
- DROP TABLE cleanup command
- TEMPLATE.md verification queries

Using u_openhouse database for all bug bash testing.
…house

Changed database from d1 to u_openhouse in:
- create-test-files.sh SQL test template
- create-test-files.sh Java test template
- All 20 regenerated test result files (sql-* and java-*)

All test files now use openhouse.u_openhouse as the database.
Added: val timestamp = System.currentTimeMillis()

This allows the ${timestamp} variable in table names to be set dynamically
in spark-shell. Updated create-test-files.sh and regenerated all 10 SQL
test result files.
Changed from raw SQL to Scala spark.sql() calls:
- Changed code block language from 'sql' to 'scala'
- Wrapped all SQL statements in spark.sql(s"...")
- Added string interpolation with 's' prefix for ${timestamp}
- Changed verification queries to use .show(false)
- Updated all 10 SQL test result files

This fixes the 'not found: value CREATE' error when running tests.
Manually updated sql-08-rohit.md, sql-09-selena.md, and sql-10-shanthoosh.md
to match the format of other test files:
- Changed from `sql` to `scala` code blocks
- Added val timestamp = System.currentTimeMillis()
- Wrapped SQL in spark.sql(s"...")
- Changed d1 to u_openhouse
- Updated verification queries to use .show(false)

All 10 SQL test files now use the correct spark-shell syntax.
Removed unnecessary 'USING iceberg' clause that was causing errors:
- Updated create-test-files.sh template
- Regenerated all 10 SQL test files
- Updated start-testing.sh example command

CREATE TABLE now uses simple syntax:
  spark.sql(s"CREATE TABLE openhouse.u_openhouse.test_xxx (name string)")
Added comprehensive Quick Reference section to all 20 test files:

SQL tests (sql-1 through sql-10):
- Common Spark SQL operations
- WAP configuration
- Branch operations
- Cherry-pick and fast-forward commands
- Query examples for snapshots, refs, and branch data

Java tests (java-1 through java-10):
- Java API imports with relocated packages
- Catalog and table access
- Snapshot operations
- Branch reference management
- Table metadata queries

Now when testers open a result file in vim, they have all the
reference commands right there without switching to other docs.
## New Tests (SQL 11-17 + Java 11-17)
New assignees: simbarashe, aastha, jiefan, zhe, kevin, junhao, ruolin

SQL Tests:
- SQL-11: Interleaved WAP and Direct Commits on Same Branch
- SQL-12: Branch from WAP Snapshot Before Cherry-Pick
- SQL-13: Concurrent Branch Commits During Fast-Forward Window
- SQL-14: WAP Branch Target with Non-Existent Branch
- SQL-15: Snapshot Expiration with Cross-Branch Dependencies
- SQL-16: Rename Branch via Ref Management
- SQL-17: WAP ID Collision and Override

Java Tests:
- Java-11: Transactional Multi-Branch Update with Rollback
- Java-12: Branch Creation from Detached Snapshot
- Java-13: Parallel Branch Append with Metadata Conflicts
- Java-14: Snapshot Ref with Custom Metadata Properties
- Java-15: Cross-Table Snapshot Reference Attempt
- Java-16: Bulk Branch Creation and Snapshot Reuse
- Java-17: Snapshot Replace with WAP Metadata Preservation

## Template Improvements
- Added tableName variable in Quick Reference for easier copy-paste
- Simplified from 'Steps Executed' to 'Input' section
- Simplified from complex verification sections to single 'Output' section
- Removed verbose Expected/Actual Results table
- Streamlined Issues Found section
- All existing test files updated to new format

## Updates
- Updated assignments.md: 20 tests → 34 tests
- Updated create-test-files.sh with new tests and improved template
- Fixed cleanup reminder to use u_openhouse instead of d1
- Total: 34 test files (17 SQL + 17 Java)
Test case with face forward race
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants