Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versions >0.20.3 incompatible with older versions of databricks-connect #1914

Open
marrov opened this issue Feb 20, 2025 · 0 comments
Open
Labels
bug Something isn't working

Comments

@marrov
Copy link

marrov commented Feb 20, 2025

Describe the bug
databricks-connect is a common tool used to interact with data stored in Databricks using spark. This package essentially installs it's own "custom" version of pyspark and so it conflicts with pyspark if it is already installed. The official docs point this out repeatedly. From pandera version 0.20.4, where support for connect pyspark dataframes was introduced, it is currently not possible to use pandera for pyspark pandera models with databricks-connect. This is due to three issues:

  1. When databricks-connect is installed (at least in my version, which is 11.3), the "custom" version of pyspark that gets installed shows a version of pyspark equal to the databricks runtime. Without getting into details, these versions are typically above 10. However, the currently highest version of pyspark is 3.5.4.
  2. The current pandera codebase is littered with these types of checks (example from pandera.backends.pyspark.register.py):
# Handles optional Spark Connect imports for pyspark>=3.4 (if available)
CURRENT_PYSPARK_VERSION = version.parse(pyspark.__version__)
if CURRENT_PYSPARK_VERSION >= version.parse("3.4"):
    from pyspark.sql.connect import dataframe as psc
  1. When code from 2. checks the version from 1, it tries to import from pyspark.sql.connect. The issue is that the pyspark code installed by databricks-connect does not have a pyspark.sql.connect module

Code Sample, a copy-pastable example

Steps to replicate:

  1. pip install "pandera==0.22.1"
  2. pip install "databricks-connect==11.3.*"
  3. Run any pandera.pyspark import, e.g.:
from pandera.pyspark import DataFrameModel

You should see an error about the missing connect module from pyspark

Expected behavior

Ideally, either explicitly state this dependency clearly or better yet refactor the if statements to check pyspark versions only between 3.4 and the latest pyspark version. A draft of how the if statements should look like would be:

import json
import urllib.request
from packaging import version
import pyspark

# Retrieve the latest PySpark version from PyPI
latest_pyspark_version = version.parse(
    json.load(urllib.request.urlopen("https://pypi.org/pypi/pyspark/json"))["info"]["version"]
)

# Handles optional Spark Connect imports for pyspark>=3.4 and <= latest available version
CURRENT_PYSPARK_VERSION = version.parse(pyspark.__version__)
if version.parse("3.4") <= CURRENT_PYSPARK_VERSION <= latest_pyspark_version:
    from pyspark.sql.connect import dataframe as psc

Desktop:

  • OS: MacOS
  • Browser: chrome
@marrov marrov added the bug Something isn't working label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant