-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-50917][EXAMPLES] Add Pi Scala example to work both for Connect and Classic #49617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @cloud-fan @dongjoon-hyun @HyukjinKwon, thank you! |
make sense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you aim, but this is not SQL
in a user perspective, @yaooqinn .
We should distinguish SQL
vs Spark Connect
because Apache Spark already has Spark SQL
modules and user interfaces like JDBC and spark-sql
shell. Could you revise the name, 😄 ?
I'd rename it with the FQDN as org.apache.spark.examples.sql.connect.SparkConnectPi |
Thank you. Please revise the PR title and description accordingly too. |
import org.apache.spark.sql.SparkSession | ||
import org.apache.spark.sql.functions._ | ||
|
||
/** Computes an approximation to pi with SparkSession/DataFrame APIs */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from the SQL example? My understanding is that the example should just use public SQL/DataFrame APIs and then it will work for both classic and Spark Connect. We should encourage users to use Spark SQL correctly (don't rely on private APIs), and in the example we can enable or disable Spark Connect w.r.t. the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC,this example seem to be the exact thing you described. Or you were just concerning about the classname?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is why do we need to mention Spark Connect here? This is just a normal Spark SQL program and Spark Connect can support it because it doesn't use private APIs.
To @yaooqinn , the PR description seems to be outdated still~
|
Updated. thank you @dongjoon-hyun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM from my side. Thank you, @yaooqinn .
Since it seems that there exists on-going discussion with @cloud-fan , I'll leave this to you and him.
My point is that we don't need Spark Connect specific examples. All legal SQL examples (not use private APIs) should just work with Spark Connect. I think it's a good idea to have more examples using DataFrame APIs instead of RDD, how about |
+1 for the naming change (SparkConnectPi -> SparkDataFramePi). |
Thank you @cloud-fan and @dongjoon-hyun, SparkDataFramePi sounds good to me. |
examples/src/main/scala/org/apache/spark/examples/sql/SparkDataFramePi.scala
Outdated
Show resolved
Hide resolved
…aFramePi.scala Co-authored-by: Wenchen Fan <[email protected]>
… and Classic ### What changes were proposed in this pull request? This PR adds a SparkDataFramePi Scala example to work both for Connect and Classic ### Why are the changes needed? The SparkPi example, mostly as the first step for users to get to know Spark, should be able to run on Spark Connect mode. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Manually build and test ```log bin/spark-submit --remote 'sc://localhost' --class org.apache.spark.examples.sql.SparkDataFramePi examples/jars/spark-examples_2.13-4.1.0-SNAPSHOT.jar WARNING: Using incubator modules: jdk.incubator.vector 25/01/23 15:00:03 INFO BaseAllocator: Debug mode disabled. Enable with the VM option -Darrow.memory.debug.allocator=true. 25/01/23 15:00:03 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type 25/01/23 15:00:03 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class Pi is roughly 3.1388756943784717 25/01/23 15:00:04 INFO ShutdownHookManager: Shutdown hook called 25/01/23 15:00:04 INFO ShutdownHookManager: Deleting directory /private/var/folders/84/dgr9ykwn6yndcmq1kjxqvk200000gn/T/spark-25ed842e-5888-47ce-bb0b-442385d643cb ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #49617 from yaooqinn/SPARK-50917. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e823afa) Signed-off-by: Dongjoon Hyun <[email protected]>
Merged to master/4.0. Thank you, @yaooqinn , @cloud-fan , @HyukjinKwon . |
Thank you very much, @dongjoon-hyun , @cloud-fan , @HyukjinKwon . |
What changes were proposed in this pull request?
This PR adds a SparkDataFramePi Scala example to work both for Connect and Classic
Why are the changes needed?
The SparkPi example, mostly as the first step for users to get to know Spark, should be able to run on Spark Connect mode.
Does this PR introduce any user-facing change?
no
How was this patch tested?
Manually build and test
Was this patch authored or co-authored using generative AI tooling?
no