Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Polars DataFrameModel.validate crashes with sample specified #1912

Open
3 tasks done
m-richards opened this issue Feb 15, 2025 · 1 comment
Open
3 tasks done
Labels
bug Something isn't working

Comments

@m-richards
Copy link
Collaborator

m-richards commented Feb 15, 2025

Describe the bug
A clear and concise description of what the bug is.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import pandera.polars as pa

import pandas as pd
import polars as pl

from pandera.typing import Series


class SchemaPolars(pa.DataFrameModel):
    col1: Series[int]
    col2: Series[int]

pandas_df = pd.DataFrame({"col1": [1, 2, 3], "col2": [1, 2, 3]})
polars_df = pl.from_pandas(pandas_df)

result2 = SchemaPolars.validate(polars_df, sample=10)
Traceback (most recent call last):
  File "C:\Data\myDocuments\Code\python_other\pandera\fork\tester_sample.py", line 24, in <module>
    result2 = SchemaPolars.validate(lazyframe, sample=10)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Data\myDocuments\Code\python_other\pandera\fork\pandera\api\dataframe\model.py", line 289, in validate
    cls.to_schema().validate(
  File "C:\Data\myDocuments\Code\python_other\pandera\fork\pandera\api\polars\container.py", line 64, in validate
    output = self.get_backend(check_obj).validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Data\myDocuments\Code\python_other\pandera\fork\pandera\backends\polars\container.py", line 80, in validate
    sample = self.subsample(check_obj, head, tail, sample, random_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Data\myDocuments\Code\python_other\pandera\fork\pandera\backends\polars\base.py", line 50, in subsample
    check_obj.sample(sample, random_state=random_state)
    ^^^^^^^^^^^^^^^^
AttributeError: 'LazyFrame' object has no attribute 'sample'

Expected behavior

Sample should work, or should raise a NotImplementedError if not supported.

Desktop (please complete the following information):

  • OS: Windows 11, using python 3.12.8

Additional context

I came across this looking at running mypy over pandera.
The implementation calls .sample which is a pl.DataFrame method, but there is no lazy equivalent.

pola-rs/polars#3933 discusses this with some potential workarounds listed, e.g.
lazy_df.with_row_index().filter(col("index").hash(seed)%10 == 1).drop("index")

@m-richards m-richards added the bug Something isn't working label Feb 15, 2025
@cosmicBboy
Copy link
Collaborator

thanks for finding this @m-richards ! yes this should only work with pl.DataFrame and raise a NotImplementedError with pl.LazyFrame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants