Skip to content

SNOW-2126350: Cannot use Snowpark LightGBM wrappers with sparse columns if first 100 rows of a column having missing values. #159

@zyzil

Description

@zyzil

We have several LightGBM regression models we are trying to refactor to use the Snowflake distributed wrappers. We've run into an issue where the Snowpark code to infer model signatures fails in datasets where we have columns with missing values. We utilize the built in LightGBM handling of missing values. Trying to train the same data with the Snowflake versions causes an error because it can't infer signatures. There also doesn't seem to be a way to manually specify the signatures to work around it.

I've distilled it down to a small reproducible example.

import numpy as np
import pandas as pd
from snowflake.ml.modeling.lightgbm import LGBMRegressor  # type: ignore
from snowflake.snowpark import Session

session = Session.builder.getOrCreate()

# Create a DataFrame with 4 columns of random floats and a 5th column with sparse data in the last column
num_rows = 200
data = np.random.rand(num_rows, 4)
fifth_col = np.concatenate([np.full(100, np.nan), np.random.rand(100)])

df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
df['E'] = fifth_col
df.head(10)

spdf = session.create_dataframe(df)

input_cols = ['A', 'B', 'C', 'E']
label_cols = 'D'

lgb_model = LGBMRegressor(
    input_cols=input_cols,
    label_cols=label_cols,
    objective='rmse',
    reg_sqrt=True,
    verbosity=-1,
    boosting_type='gbdt',
    random_state=42,
)

t = lgb_model.fit(spdf)

Error:

C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\model_signature.py:71: UserWarning: The sample input has 200 rows. Using the first 100 rows to define the inputs and outputs of the model and the data types of each. Use `signatures` parameter to specify model inputs and outputs manually if the automatic inference is not correct.
  warnings.warn(
Traceback (most recent call last):
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 539, in wrap
    return ctx.run(execute_func_with_statement_params)
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 500, in execute_func_with_statement_params
    result = func(*args, **kwargs)
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\modeling\framework\base.py", line 440, in fit
    return self._fit(dataset)
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\modeling\lightgbm\lgbm_regressor.py", line 290, in _fit
    self._generate_model_signatures(dataset)
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\modeling\lightgbm\lgbm_regressor.py", line 1107, in _generate_model_signatures
    inputs = list(_infer_signature(_truncate_data(dataset[self.input_cols], INFER_SIGNATURE_MAX_ROWS), "input", use_snowflake_identifiers=True))
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\model_signature.py", line 114, in _infer_signature
    signature = handler.infer_signature(data, role)
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\_signatures\snowpark_handler.py", line 54, in infer_signature
    return pandas_handler.PandasDataFrameHandler.infer_signature(
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\_signatures\pandas_handler.py", line 143, in infer_signature
    raise snowml_exceptions.SnowflakeMLException(
snowflake.ml._internal.exceptions.exceptions.SnowflakeMLException: ValueError('(2112) Data Validation Error: There is no non-null data in column E so the signature cannot be inferred.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\dev\src\poc\ds-goat-snowflake\issue.py", line 32, in <module>
    t = lgb_model.fit(spdf)
  File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 571, in wrap
    raise me.original_exception from e
ValueError: (2112) Data Validation Error: There is no non-null data in column E so the signature cannot be inferred.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions