Skip to content

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 23, 2025

What changes were proposed in this pull request?

This PR adds a SparkDataFramePi Scala example to work both for Connect and Classic

Why are the changes needed?

The SparkPi example, mostly as the first step for users to get to know Spark, should be able to run on Spark Connect mode.

Does this PR introduce any user-facing change?

no

How was this patch tested?

Manually build and test

bin/spark-submit --remote 'sc://localhost' --class org.apache.spark.examples.sql.SparkDataFramePi examples/jars/spark-examples_2.13-4.1.0-SNAPSHOT.jar
WARNING: Using incubator modules: jdk.incubator.vector
25/01/23 15:00:03 INFO BaseAllocator: Debug mode disabled. Enable with the VM option -Darrow.memory.debug.allocator=true.
25/01/23 15:00:03 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type
25/01/23 15:00:03 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class
Pi is roughly 3.1388756943784717
25/01/23 15:00:04 INFO ShutdownHookManager: Shutdown hook called
25/01/23 15:00:04 INFO ShutdownHookManager: Deleting directory /private/var/folders/84/dgr9ykwn6yndcmq1kjxqvk200000gn/T/spark-25ed842e-5888-47ce-bb0b-442385d643cb

Was this patch authored or co-authored using generative AI tooling?

no

@yaooqinn
Copy link
Member Author

cc @cloud-fan @dongjoon-hyun @HyukjinKwon, thank you!

@github-actions github-actions bot added the SQL label Jan 23, 2025
@yaooqinn yaooqinn changed the title [SPARK-50917][EXAMPLES] Make SparkPi Scala example spark-connect compatible [SPARK-50917][EXAMPLES] Add SparkSQLPi Scala example to work both for Connect and Classic Jan 23, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you aim, but this is not SQL in a user perspective, @yaooqinn .

We should distinguish SQL vs Spark Connect because Apache Spark already has Spark SQL modules and user interfaces like JDBC and spark-sql shell. Could you revise the name, 😄 ?

@yaooqinn
Copy link
Member Author

I'd rename it with the FQDN as org.apache.spark.examples.sql.connect.SparkConnectPi

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 24, 2025

Thank you. Please revise the PR title and description accordingly too.

@yaooqinn yaooqinn changed the title [SPARK-50917][EXAMPLES] Add SparkSQLPi Scala example to work both for Connect and Classic [SPARK-50917][EXAMPLES] Add SparkConnectPi Scala example to work both for Connect and Classic Jan 24, 2025
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

/** Computes an approximation to pi with SparkSession/DataFrame APIs */
Copy link
Contributor

@cloud-fan cloud-fan Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the SQL example? My understanding is that the example should just use public SQL/DataFrame APIs and then it will work for both classic and Spark Connect. We should encourage users to use Spark SQL correctly (don't rely on private APIs), and in the example we can enable or disable Spark Connect w.r.t. the arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC,this example seem to be the exact thing you described. Or you were just concerning about the classname?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is why do we need to mention Spark Connect here? This is just a normal Spark SQL program and Spark Connect can support it because it doesn't use private APIs.

@dongjoon-hyun
Copy link
Member

To @yaooqinn , the PR description seems to be outdated still~

bin/spark-submit --remote 'sc://localhost' --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.13-4.1.0-SNAPSHOT.jar

@yaooqinn
Copy link
Member Author

Updated. thank you @dongjoon-hyun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side. Thank you, @yaooqinn .

Since it seems that there exists on-going discussion with @cloud-fan , I'll leave this to you and him.

@cloud-fan
Copy link
Contributor

My point is that we don't need Spark Connect specific examples. All legal SQL examples (not use private APIs) should just work with Spark Connect.

I think it's a good idea to have more examples using DataFrame APIs instead of RDD, how about SparkDataFramePi?

@dongjoon-hyun
Copy link
Member

+1 for the naming change (SparkConnectPi -> SparkDataFramePi).

@yaooqinn
Copy link
Member Author

Thank you @cloud-fan and @dongjoon-hyun, SparkDataFramePi sounds good to me.

@yaooqinn yaooqinn changed the title [SPARK-50917][EXAMPLES] Add SparkConnectPi Scala example to work both for Connect and Classic [SPARK-50917][EXAMPLES] Add Pi Scala example to work both for Connect and Classic Feb 10, 2025
dongjoon-hyun pushed a commit that referenced this pull request Feb 10, 2025
… and Classic

### What changes were proposed in this pull request?

This PR adds a SparkDataFramePi Scala example to work both for Connect and Classic

### Why are the changes needed?

The SparkPi example, mostly as the first step for users to get to know Spark, should be able to run on Spark Connect mode.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Manually build and test

```log
bin/spark-submit --remote 'sc://localhost' --class org.apache.spark.examples.sql.SparkDataFramePi examples/jars/spark-examples_2.13-4.1.0-SNAPSHOT.jar
WARNING: Using incubator modules: jdk.incubator.vector
25/01/23 15:00:03 INFO BaseAllocator: Debug mode disabled. Enable with the VM option -Darrow.memory.debug.allocator=true.
25/01/23 15:00:03 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type
25/01/23 15:00:03 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class
Pi is roughly 3.1388756943784717
25/01/23 15:00:04 INFO ShutdownHookManager: Shutdown hook called
25/01/23 15:00:04 INFO ShutdownHookManager: Deleting directory /private/var/folders/84/dgr9ykwn6yndcmq1kjxqvk200000gn/T/spark-25ed842e-5888-47ce-bb0b-442385d643cb
```

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #49617 from yaooqinn/SPARK-50917.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e823afa)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/4.0.

Thank you, @yaooqinn , @cloud-fan , @HyukjinKwon .

@yaooqinn
Copy link
Member Author

Thank you very much, @dongjoon-hyun , @cloud-fan , @HyukjinKwon .

@yaooqinn yaooqinn deleted the SPARK-50917 branch February 11, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants