Versions >0.20.3 incompatible with older versions of databricks-connect #1914

marrov · 2025-02-20T11:17:30Z

Describe the bug
databricks-connect is a common tool used to interact with data stored in Databricks using spark. This package essentially installs it's own "custom" version of pyspark and so it conflicts with pyspark if it is already installed. The official docs point this out repeatedly. From pandera version 0.20.4, where support for connect pyspark dataframes was introduced, it is currently not possible to use pandera for pyspark pandera models with databricks-connect. This is due to three issues:

When databricks-connect is installed (at least in my version, which is 11.3), the "custom" version of pyspark that gets installed shows a version of pyspark equal to the databricks runtime. Without getting into details, these versions are typically above 10. However, the currently highest version of pyspark is 3.5.4.
The current pandera codebase is littered with these types of checks (example from pandera.backends.pyspark.register.py):

# Handles optional Spark Connect imports for pyspark>=3.4 (if available)
CURRENT_PYSPARK_VERSION = version.parse(pyspark.__version__)
if CURRENT_PYSPARK_VERSION >= version.parse("3.4"):
    from pyspark.sql.connect import dataframe as psc

When code from 2. checks the version from 1, it tries to import from pyspark.sql.connect. The issue is that the pyspark code installed by databricks-connect does not have a pyspark.sql.connect module

Code Sample, a copy-pastable example

Steps to replicate:

pip install "pandera==0.22.1"
pip install "databricks-connect==11.3.*"
Run any pandera.pyspark import, e.g.:

from pandera.pyspark import DataFrameModel

You should see an error about the missing connect module from pyspark

Expected behavior

Ideally, either explicitly state this dependency clearly or better yet refactor the if statements to check pyspark versions only between 3.4 and the latest pyspark version. A draft of how the if statements should look like would be:

import json
import urllib.request
from packaging import version
import pyspark

# Retrieve the latest PySpark version from PyPI
latest_pyspark_version = version.parse(
    json.load(urllib.request.urlopen("https://pypi.org/pypi/pyspark/json"))["info"]["version"]
)

# Handles optional Spark Connect imports for pyspark>=3.4 and <= latest available version
CURRENT_PYSPARK_VERSION = version.parse(pyspark.__version__)
if version.parse("3.4") <= CURRENT_PYSPARK_VERSION <= latest_pyspark_version:
    from pyspark.sql.connect import dataframe as psc

Desktop:

OS: MacOS
Browser: chrome

The text was updated successfully, but these errors were encountered:

marrov added the bug Something isn't working label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versions >0.20.3 incompatible with older versions of databricks-connect #1914

Versions >0.20.3 incompatible with older versions of databricks-connect #1914

marrov commented Feb 20, 2025

Versions >0.20.3 incompatible with older versions of databricks-connect #1914

Versions >0.20.3 incompatible with older versions of databricks-connect #1914

Comments

marrov commented Feb 20, 2025

Code Sample, a copy-pastable example

Expected behavior

Desktop: