You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug databricks-connect is a common tool used to interact with data stored in Databricks using spark. This package essentially installs it's own "custom" version of pyspark and so it conflicts with pyspark if it is already installed. The official docs point this out repeatedly. From pandera version 0.20.4, where support for connectpyspark dataframes was introduced, it is currently not possible to use pandera for pyspark pandera models with databricks-connect. This is due to three issues:
When databricks-connect is installed (at least in my version, which is 11.3), the "custom" version of pyspark that gets installed shows a version of pyspark equal to the databricks runtime. Without getting into details, these versions are typically above 10. However, the currently highest version of pyspark is 3.5.4.
The current pandera codebase is littered with these types of checks (example from pandera.backends.pyspark.register.py):
When code from 2. checks the version from 1, it tries to import from pyspark.sql.connect. The issue is that the pyspark code installed by databricks-connect does not have a pyspark.sql.connect module
Code Sample, a copy-pastable example
Steps to replicate:
pip install "pandera==0.22.1"
pip install "databricks-connect==11.3.*"
Run any pandera.pyspark import, e.g.:
frompandera.pysparkimportDataFrameModel
You should see an error about the missing connect module from pyspark
Expected behavior
Ideally, either explicitly state this dependency clearly or better yet refactor the if statements to check pyspark versions only between 3.4 and the latest pyspark version. A draft of how the if statements should look like would be:
importjsonimporturllib.requestfrompackagingimportversionimportpyspark# Retrieve the latest PySpark version from PyPIlatest_pyspark_version=version.parse(
json.load(urllib.request.urlopen("https://pypi.org/pypi/pyspark/json"))["info"]["version"]
)
# Handles optional Spark Connect imports for pyspark>=3.4 and <= latest available versionCURRENT_PYSPARK_VERSION=version.parse(pyspark.__version__)
ifversion.parse("3.4") <=CURRENT_PYSPARK_VERSION<=latest_pyspark_version:
frompyspark.sql.connectimportdataframeaspsc
Desktop:
OS: MacOS
Browser: chrome
The text was updated successfully, but these errors were encountered:
Describe the bug
databricks-connect
is a common tool used to interact with data stored in Databricks using spark. This package essentially installs it's own "custom" version ofpyspark
and so it conflicts withpyspark
if it is already installed. The official docs point this out repeatedly. Frompandera
version 0.20.4, where support forconnect
pyspark
dataframes was introduced, it is currently not possible to use pandera for pyspark pandera models withdatabricks-connect
. This is due to three issues:databricks-connect
is installed (at least in my version, which is11.3
), the "custom" version ofpyspark
that gets installed shows a version ofpyspark
equal to the databricks runtime. Without getting into details, these versions are typically above 10. However, the currently highest version ofpyspark
is3.5.4
.pandera
codebase is littered with these types of checks (example frompandera.backends.pyspark.register.py
):pyspark.sql.connect
. The issue is that thepyspark
code installed bydatabricks-connect
does not have apyspark.sql.connect
moduleCode Sample, a copy-pastable example
Steps to replicate:
pip install "pandera==0.22.1"
pip install "databricks-connect==11.3.*"
pandera.pyspark
import, e.g.:You should see an error about the missing
connect
module frompyspark
Expected behavior
Ideally, either explicitly state this dependency clearly or better yet refactor the if statements to check pyspark versions only between 3.4 and the latest pyspark version. A draft of how the if statements should look like would be:
Desktop:
The text was updated successfully, but these errors were encountered: