-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
We have several LightGBM regression models we are trying to refactor to use the Snowflake distributed wrappers. We've run into an issue where the Snowpark code to infer model signatures fails in datasets where we have columns with missing values. We utilize the built in LightGBM handling of missing values. Trying to train the same data with the Snowflake versions causes an error because it can't infer signatures. There also doesn't seem to be a way to manually specify the signatures to work around it.
I've distilled it down to a small reproducible example.
import numpy as np
import pandas as pd
from snowflake.ml.modeling.lightgbm import LGBMRegressor # type: ignore
from snowflake.snowpark import Session
session = Session.builder.getOrCreate()
# Create a DataFrame with 4 columns of random floats and a 5th column with sparse data in the last column
num_rows = 200
data = np.random.rand(num_rows, 4)
fifth_col = np.concatenate([np.full(100, np.nan), np.random.rand(100)])
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
df['E'] = fifth_col
df.head(10)
spdf = session.create_dataframe(df)
input_cols = ['A', 'B', 'C', 'E']
label_cols = 'D'
lgb_model = LGBMRegressor(
input_cols=input_cols,
label_cols=label_cols,
objective='rmse',
reg_sqrt=True,
verbosity=-1,
boosting_type='gbdt',
random_state=42,
)
t = lgb_model.fit(spdf)Error:
C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\model_signature.py:71: UserWarning: The sample input has 200 rows. Using the first 100 rows to define the inputs and outputs of the model and the data types of each. Use `signatures` parameter to specify model inputs and outputs manually if the automatic inference is not correct.
warnings.warn(
Traceback (most recent call last):
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 539, in wrap
return ctx.run(execute_func_with_statement_params)
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 500, in execute_func_with_statement_params
result = func(*args, **kwargs)
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\modeling\framework\base.py", line 440, in fit
return self._fit(dataset)
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\modeling\lightgbm\lgbm_regressor.py", line 290, in _fit
self._generate_model_signatures(dataset)
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\modeling\lightgbm\lgbm_regressor.py", line 1107, in _generate_model_signatures
inputs = list(_infer_signature(_truncate_data(dataset[self.input_cols], INFER_SIGNATURE_MAX_ROWS), "input", use_snowflake_identifiers=True))
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\model_signature.py", line 114, in _infer_signature
signature = handler.infer_signature(data, role)
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\_signatures\snowpark_handler.py", line 54, in infer_signature
return pandas_handler.PandasDataFrameHandler.infer_signature(
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\model\_signatures\pandas_handler.py", line 143, in infer_signature
raise snowml_exceptions.SnowflakeMLException(
snowflake.ml._internal.exceptions.exceptions.SnowflakeMLException: ValueError('(2112) Data Validation Error: There is no non-null data in column E so the signature cannot be inferred.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\dev\src\poc\ds-goat-snowflake\issue.py", line 32, in <module>
t = lgb_model.fit(spdf)
File "C:\dev\src\poc\ds-goat-snowflake\.venv\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 571, in wrap
raise me.original_exception from e
ValueError: (2112) Data Validation Error: There is no non-null data in column E so the signature cannot be inferred.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels