A wrapper of the Apache Spark Connect client with additional functionalities that allow applications to communicate with a remote Dataproc Spark cluster using the Spark Connect protocol without requiring additional steps.
.. code-block:: console
pip install dataproc_spark_connect
.. code-block:: console
pip uninstall dataproc_spark_connect
This client requires permissions to manage Dataproc sessions and session templates. If you are running the client outside of Google Cloud, you must set following environment variables:
- GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
- GOOGLE_CLOUD_REGION - The Compute Engine region where you run the Spark workload.
- GOOGLE_APPLICATION_CREDENTIALS - Your Application Credentials
- DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as
tests/integration/resources/session.textproto
-
Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
.. code-block:: console
pip install google_cloud_dataproc --force-reinstall pip install dataproc_spark_connect --force-reinstall
-
Add the required import into your PySpark application or notebook:
.. code-block:: python
from google.cloud.spark_connect import GoogleSparkSession
-
There are two ways to create a spark session,
-
Start a Spark session using properties defined in
DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG
:.. code-block:: python
spark = GoogleSparkSession.builder.getOrCreate()
-
Start a Spark session with the following code instead of using a config file:
.. code-block:: python
from google.cloud.dataproc_v1 import SparkConnectConfig from google.cloud.dataproc_v1 import Session dataproc_config = Session() dataproc_config.spark_connect_session = SparkConnectConfig() dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>" dataproc_config.runtime_config.version = '3.0' spark = GoogleSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
-
As this client runs the spark workload on Dataproc, your project will be billed as per Dataproc Serverless Pricing. This will happen even if you are running the client from a non-GCE instance.
-
Install the requirements in virtual environment.
.. code-block:: console
pip install -r requirements.txt
-
Build the code.
.. code-block:: console
python setup.py sdist bdist_wheel
-
Copy the generated
.whl
file to Cloud Storage. Use the version specified in thesetup.py
file... code-block:: console
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
-
Download the new SDK on Vertex, then uninstall the old version and install the new one.
.. code-block:: console
%%bash export VERSION=<version> gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl . yes | pip uninstall dataproc_spark_connect pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl