Skip to content

Commit

Permalink
[SPARK-40665][CONNECT] Avoid embedding Spark Connect in the Apache Sp…
Browse files Browse the repository at this point in the history
…ark binary release

### What changes were proposed in this pull request?

This PR proposes

1. Move `connect` to `connector/connect` to be consistent with Kafka and Avro.
2. Do not include this in the default Apache Spark release binary.
3. Fix the module dependency in `modules.py`.
4. Fix the usages in `README.md` with cleaning up.
5. Cleanup PySpark test structure to be consistent with other PySpark tests.

### Why are the changes needed?

To make it consistent with Avro or Kafka, see also https://github.com/apache/spark/pull/37710/files#r978291019

### Does this PR introduce _any_ user-facing change?

No, this isn't released yet.

The usage of this project would be changed from:

```bash
./bin/spark-shell \
  --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

to

```bash
./bin/spark-shell \
  --packages org.apache.spark:spark-connect_2.12:3.4.0 \
  --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

### How was this patch tested?

CI in the PR should verify this.

Closes apache#38109 from HyukjinKwon/SPARK-40665.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
HyukjinKwon committed Oct 6, 2022
1 parent 95cfdc6 commit e217633
Show file tree
Hide file tree
Showing 37 changed files with 80 additions and 130 deletions.
2 changes: 1 addition & 1 deletion .github/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,6 @@ WEB UI:
DEPLOY:
- "sbin/**/*"
CONNECT:
- "connect/**/*"
- "connector/connect/**/*"
- "**/sql/sparkconnect/**/*"
- "python/pyspark/sql/**/connect/**/*"
2 changes: 1 addition & 1 deletion .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,7 @@ jobs:
- >-
pyspark-pandas-slow
- >-
pyspark-sql-connect
pyspark-connect
env:
MODULES_TO_TEST: ${{ matrix.modules }}
HADOOP_PROFILE: ${{ inputs.hadoop }}
Expand Down
5 changes: 0 additions & 5 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,6 @@
<artifactId>spark-repl_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-connect_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>

<!--
Because we don't shade dependencies anymore, we need to restore Guava to compile scope so
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ set -ex
SPARK_HOME="$(cd "`dirname $0`"/../..; pwd)"
cd "$SPARK_HOME"

pushd connect/src/main
pushd connector/connect/src/main

LICENSE=$(cat <<'EOF'
#
Expand Down Expand Up @@ -79,4 +79,4 @@ for f in `find gen/proto/python -name "*.py*"`; do
done
# Clean up everything.
rm -Rf gen
rm -Rf gen
2 changes: 1 addition & 1 deletion connect/pom.xml → connector/connect/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.4.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
<relativePath>../../pom.xml</relativePath>
</parent>

<artifactId>spark-connect_2.12</artifactId>
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
16 changes: 0 additions & 16 deletions dev/deps/spark-deps-hadoop-2-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,7 @@ ST4/4.0.4//ST4-4.0.4.jar
activation/1.1.1//activation-1.1.1.jar
aircompressor/0.21//aircompressor-0.21.jar
algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar
animal-sniffer-annotations/1.19//animal-sniffer-annotations-1.19.jar
annotations/17.0.0//annotations-17.0.0.jar
annotations/4.1.1.4//annotations-4.1.1.4.jar
antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
antlr4-runtime/4.9.3//antlr4-runtime-4.9.3.jar
aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
Expand Down Expand Up @@ -64,19 +62,9 @@ datanucleus-core/4.1.17//datanucleus-core-4.1.17.jar
datanucleus-rdbms/4.1.19//datanucleus-rdbms-4.1.19.jar
derby/10.14.2.0//derby-10.14.2.0.jar
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
error_prone_annotations/2.10.0//error_prone_annotations-2.10.0.jar
failureaccess/1.0.1//failureaccess-1.0.1.jar
flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
gcs-connector/hadoop2-2.2.7/shaded/gcs-connector-hadoop2-2.2.7-shaded.jar
gmetric4j/1.0.10//gmetric4j-1.0.10.jar
grpc-api/1.47.0//grpc-api-1.47.0.jar
grpc-context/1.47.0//grpc-context-1.47.0.jar
grpc-core/1.47.0//grpc-core-1.47.0.jar
grpc-netty-shaded/1.47.0//grpc-netty-shaded-1.47.0.jar
grpc-protobuf-lite/1.47.0//grpc-protobuf-lite-1.47.0.jar
grpc-protobuf/1.47.0//grpc-protobuf-1.47.0.jar
grpc-services/1.47.0//grpc-services-1.47.0.jar
grpc-stub/1.47.0//grpc-stub-1.47.0.jar
gson/2.2.4//gson-2.2.4.jar
guava/14.0.1//guava-14.0.1.jar
guice-servlet/3.0//guice-servlet-3.0.jar
Expand Down Expand Up @@ -122,7 +110,6 @@ httpclient/4.5.13//httpclient-4.5.13.jar
httpcore/4.4.14//httpcore-4.4.14.jar
istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
ivy/2.5.0//ivy-2.5.0.jar
j2objc-annotations/1.3//j2objc-annotations-1.3.jar
jackson-annotations/2.13.4//jackson-annotations-2.13.4.jar
jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar
jackson-core/2.13.4//jackson-core-2.13.4.jar
Expand Down Expand Up @@ -245,10 +232,7 @@ parquet-encoding/1.12.3//parquet-encoding-1.12.3.jar
parquet-format-structures/1.12.3//parquet-format-structures-1.12.3.jar
parquet-hadoop/1.12.3//parquet-hadoop-1.12.3.jar
parquet-jackson/1.12.3//parquet-jackson-1.12.3.jar
perfmark-api/0.25.0//perfmark-api-0.25.0.jar
pickle/1.2//pickle-1.2.jar
proto-google-common-protos/2.0.1//proto-google-common-protos-2.0.1.jar
protobuf-java-util/3.19.2//protobuf-java-util-3.19.2.jar
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
py4j/0.10.9.7//py4j-0.10.9.7.jar
remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar
Expand Down
16 changes: 0 additions & 16 deletions dev/deps/spark-deps-hadoop-3-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@ aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar
aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar
animal-sniffer-annotations/1.19//animal-sniffer-annotations-1.19.jar
annotations/17.0.0//annotations-17.0.0.jar
annotations/4.1.1.4//annotations-4.1.1.4.jar
antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
antlr4-runtime/4.9.3//antlr4-runtime-4.9.3.jar
aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
Expand Down Expand Up @@ -61,19 +59,9 @@ datanucleus-core/4.1.17//datanucleus-core-4.1.17.jar
datanucleus-rdbms/4.1.19//datanucleus-rdbms-4.1.19.jar
derby/10.14.2.0//derby-10.14.2.0.jar
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
error_prone_annotations/2.10.0//error_prone_annotations-2.10.0.jar
failureaccess/1.0.1//failureaccess-1.0.1.jar
flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
gcs-connector/hadoop3-2.2.7/shaded/gcs-connector-hadoop3-2.2.7-shaded.jar
gmetric4j/1.0.10//gmetric4j-1.0.10.jar
grpc-api/1.47.0//grpc-api-1.47.0.jar
grpc-context/1.47.0//grpc-context-1.47.0.jar
grpc-core/1.47.0//grpc-core-1.47.0.jar
grpc-netty-shaded/1.47.0//grpc-netty-shaded-1.47.0.jar
grpc-protobuf-lite/1.47.0//grpc-protobuf-lite-1.47.0.jar
grpc-protobuf/1.47.0//grpc-protobuf-1.47.0.jar
grpc-services/1.47.0//grpc-services-1.47.0.jar
grpc-stub/1.47.0//grpc-stub-1.47.0.jar
gson/2.2.4//gson-2.2.4.jar
guava/14.0.1//guava-14.0.1.jar
hadoop-aliyun/3.3.4//hadoop-aliyun-3.3.4.jar
Expand Down Expand Up @@ -110,7 +98,6 @@ httpcore/4.4.14//httpcore-4.4.14.jar
ini4j/0.5.4//ini4j-0.5.4.jar
istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
ivy/2.5.0//ivy-2.5.0.jar
j2objc-annotations/1.3//j2objc-annotations-1.3.jar
jackson-annotations/2.13.4//jackson-annotations-2.13.4.jar
jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar
jackson-core/2.13.4//jackson-core-2.13.4.jar
Expand Down Expand Up @@ -232,10 +219,7 @@ parquet-encoding/1.12.3//parquet-encoding-1.12.3.jar
parquet-format-structures/1.12.3//parquet-format-structures-1.12.3.jar
parquet-hadoop/1.12.3//parquet-hadoop-1.12.3.jar
parquet-jackson/1.12.3//parquet-jackson-1.12.3.jar
perfmark-api/0.25.0//perfmark-api-0.25.0.jar
pickle/1.2//pickle-1.2.jar
proto-google-common-protos/2.0.1//proto-google-common-protos-2.0.1.jar
protobuf-java-util/3.19.2//protobuf-java-util-3.19.2.jar
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
py4j/0.10.9.7//py4j-0.10.9.7.jar
remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar
Expand Down
25 changes: 18 additions & 7 deletions dev/sparktestsupport/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,17 @@ def __hash__(self):
],
)

connect = Module(
name="connect",
dependencies=[sql],
source_file_regexes=[
"connector/connect",
],
sbt_test_goals=[
"connect/test",
],
)

sketch = Module(
name="sketch",
dependencies=[tags],
Expand Down Expand Up @@ -473,18 +484,18 @@ def __hash__(self):
],
)

pyspark_sql = Module(
name="pyspark-sql-connect",
dependencies=[pyspark_core, hive, avro],
pyspark_connect = Module(
name="pyspark-connect",
dependencies=[pyspark_sql, connect],
source_file_regexes=["python/pyspark/sql/connect"],
python_test_goals=[
# doctests
# No doctests yet.
# unittests
"pyspark.sql.tests.connect.test_column_expressions",
"pyspark.sql.tests.connect.test_plan_only",
"pyspark.sql.tests.connect.test_select_ops",
"pyspark.sql.tests.connect.test_spark_connect",
"pyspark.sql.tests.test_connect_column_expressions",
"pyspark.sql.tests.test_connect_plan_only",
"pyspark.sql.tests.test_connect_select_ops",
"pyspark.sql.tests.test_connect_basic",
],
excluded_python_implementations=[
"PyPy" # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
Expand Down
20 changes: 10 additions & 10 deletions dev/sparktestsupport/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,23 +108,23 @@ def determine_modules_to_test(changed_modules, deduplicated=True):
['graphx', 'examples']
>>> [x.name for x in determine_modules_to_test([modules.sql])]
... # doctest: +NORMALIZE_WHITESPACE
['sql', 'avro', 'docker-integration-tests', 'hive', 'mllib', 'sql-kafka-0-10', 'examples',
'hive-thriftserver', 'pyspark-sql', 'pyspark-sql-connect', 'repl', 'sparkr',
['sql', 'avro', 'connect', 'docker-integration-tests', 'hive', 'mllib', 'sql-kafka-0-10',
'examples', 'hive-thriftserver', 'pyspark-sql', 'repl', 'sparkr', 'pyspark-connect',
'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-ml']
>>> sorted([x.name for x in determine_modules_to_test(
... [modules.sparkr, modules.sql], deduplicated=False)])
... # doctest: +NORMALIZE_WHITESPACE
['avro', 'docker-integration-tests', 'examples', 'hive', 'hive-thriftserver', 'mllib',
'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-sql',
'pyspark-sql-connect', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 'hive-thriftserver',
'mllib', 'pyspark-connect', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas',
'pyspark-pandas-slow', 'pyspark-sql', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
>>> sorted([x.name for x in determine_modules_to_test(
... [modules.sql, modules.core], deduplicated=False)])
... # doctest: +NORMALIZE_WHITESPACE
['avro', 'catalyst', 'core', 'docker-integration-tests', 'examples', 'graphx', 'hive',
'hive-thriftserver', 'mllib', 'mllib-local', 'pyspark-core', 'pyspark-ml', 'pyspark-mllib',
'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql',
'pyspark-sql-connect', 'pyspark-streaming', 'repl', 'root', 'sparkr', 'sql',
'sql-kafka-0-10', 'streaming', 'streaming-kafka-0-10', 'streaming-kinesis-asl']
['avro', 'catalyst', 'connect', 'core', 'docker-integration-tests', 'examples', 'graphx',
'hive', 'hive-thriftserver', 'mllib', 'mllib-local', 'pyspark-connect', 'pyspark-core',
'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource',
'pyspark-sql', 'pyspark-streaming', 'repl', 'root', 'sparkr', 'sql', 'sql-kafka-0-10',
'streaming', 'streaming-kafka-0-10', 'streaming-kinesis-asl']
"""
modules_to_test = set()
for module in changed_modules:
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@
<module>connector/kafka-0-10-assembly</module>
<module>connector/kafka-0-10-sql</module>
<module>connector/avro</module>
<module>connect</module>
<module>connector/connect</module>
<!-- See additional modules enabled by profiles below -->
</modules>

Expand Down
9 changes: 3 additions & 6 deletions python/mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,6 @@ warn_redundant_casts = True
[mypy-pyspark.sql.connect.proto.*]
ignore_errors = True

; TODO(SPARK-40537) reenable mypi support.
[mypy-pyspark.sql.tests.connect.*]
disallow_untyped_defs = False
ignore_missing_imports = True
ignore_errors = True

; Allow untyped def in internal modules and tests

[mypy-pyspark.daemon]
Expand Down Expand Up @@ -78,6 +72,9 @@ disallow_untyped_defs = False

[mypy-pyspark.sql.tests.*]
disallow_untyped_defs = False
; TODO(SPARK-40537) reenable mypi support.
ignore_missing_imports = True
ignore_errors = True

[mypy-pyspark.sql.pandas.serializers]
disallow_untyped_defs = False
Expand Down
17 changes: 9 additions & 8 deletions python/pyspark/sql/connect/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@

# [EXPERIMENTAL] Spark Connect
# Spark Connect

**Spark Connect is a strictly experimental feature and under heavy development.
All APIs should be considered volatile and should not be used in production.**
Expand All @@ -8,30 +7,32 @@ This module contains the implementation of Spark Connect which is a logical plan
facade for the implementation in Spark. Spark Connect is directly integrated into the build
of Spark. To enable it, you only need to activate the driver plugin for Spark Connect.




## Build

1. Build Spark as usual per the documentation.

2. Build and package the Spark Connect package

```bash
./build/mvn -Phive package
```

or
```shell

```bash
./build/sbt -Phive package
```

## Run Spark Shell

```bash
./bin/spark-shell --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
./bin/spark-shell \
--packages org.apache.spark:spark-connect_2.12:3.4.0 \
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

## Run Tests


```bash
./run-tests --testnames 'pyspark.sql.tests.connect.test_spark_connect'
```
Expand Down
16 changes: 0 additions & 16 deletions python/pyspark/sql/tests/connect/__init__.py

This file was deleted.

20 changes: 0 additions & 20 deletions python/pyspark/sql/tests/connect/utils/__init__.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,18 @@
from pyspark.sql import SparkSession, Row
from pyspark.sql.connect.client import RemoteSparkSession
from pyspark.sql.connect.function_builder import udf
from pyspark.testing.connectutils import should_test_connect, connect_requirement_message
from pyspark.testing.utils import ReusedPySparkTestCase


@unittest.skipIf(not should_test_connect, connect_requirement_message)
class SparkConnectSQLTestCase(ReusedPySparkTestCase):
"""Parent test fixture class for all Spark Connect related
test cases."""

connect = RemoteSparkSession
tbl_name = str

@classmethod
def setUpClass(cls: Any) -> None:
ReusedPySparkTestCase.setUpClass()
Expand All @@ -55,7 +60,6 @@ def spark_connect_test_data(cls: Any) -> None:

class SparkConnectTests(SparkConnectSQLTestCase):
def test_simple_read(self) -> None:
"""Tests that we can access the Spark Connect GRPC service locally."""
df = self.connect.read.table(self.tbl_name)
data = df.limit(10).toPandas()
# Check that the limit is applied
Expand All @@ -77,7 +81,7 @@ def test_simple_explain_string(self) -> None:


if __name__ == "__main__":
from pyspark.sql.tests.connect.test_spark_connect import * # noqa: F401
from pyspark.sql.tests.test_connect_basic import * # noqa: F401

try:
import xmlrunner # type: ignore
Expand Down
Loading

0 comments on commit e217633

Please sign in to comment.