-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Context
Environment
- snowflake-ml-python version: 1.20.0
- snowflake-snowpark-python: 1.28.0
- snowflake-connector-python[pandas]: 3.17.3
- Python versions tested: 3.10, 3.11
NOTE: Also fails in a notebook environment in Snowsight
Use case
When using Snowpark to run a LinearRegression, the fit execution fails regardless of local package setup or Python version due to the behavior of the the train method in snowpark_trainer.py. Its behavior can be reproduced with an SQL query that defines a temporary stored procedure for a Python function.
Impact
This was part of vetting Snowflake vs Amazon Sagemaker for AI/ML engineering at a business. The experience was so incredibly frustrating - wasting days of development time - that it made Snowflake look unusable and very immature in comparison.
Repro Steps
Run this SQL to simulate the packages defined by snowpark_trainer.py & run some Python to examine the environment and its packages. The example below is from running LinearRegression.
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.11'
PACKAGES=('snowflake-snowpark-python','snowflake-telemetry-python','numpy==2.2.5','scikit-learn==1.7.2','cloudpickle==3.1.1','xgboost==3.1.2')
HANDLER = 'probe'
AS
$$
import sys
import os
import importlib.metadata
def probe(session):
report = []
report.append(f"--- PYTHON RUNTIME ---")
report.append(f"Version: {sys.version}")
report.append(f"\n--- INSTALLED PACKAGES ---")
dists = sorted(
[(d.metadata["Name"], d.version) for d in importlib.metadata.distributions()],
key=lambda x: x[0].lower()
)
for name, version in dists:
report.append(f"{name} == {version}")
return "\n".join(report)
$$;
-- Run the probe
CALL DEBUG_ENV_PROBE();
Expected
This should succeed and print out information about the Python version and packages.
Actual
A package conflict error, as described here:
https://community.snowflake.com/s/article/package-conflicts-were-detected-when-starting-the-snowflake-notebook
Additional information
Which packages are causing conflicts?
Performing either of these changes to the packages list makes the query succeed:
Add snowflake-ml-python
PACKAGES = ('snowflake-snowpark-python', 'snowflake-telemetry-python', 'numpy==2.2.5', 'scikit-learn==1.7.2', 'cloudpickle==3.1.1', 'xgboost==3.1.2', 'snowflake-ml-python')
Remove snowflake-telemetry-python
PACKAGES=('snowflake-snowpark-python', 'numpy==2.2.5','scikit-learn==1.7.2','cloudpickle==3.1.1','xgboost==3.1.2')
It's unclear to me why the dependencies solver works when ML package is added or telemetry is removed.
(The end-user cannot add or remove packages though - even by changing the session packages.)
Where is the source code?
This is the snowpark_trainer.py line that defines packages used:
https://github.com/snowflakedb/snowflake-ml-python/blob/release_1.20.0/snowflake/ml/modeling/_internal/snowpark_implementations/snowpark_trainer.py#L234
Suggested fixes
-
Consider automated testing of package combinations - in theory this should be able to be caught by a test just checking if a LinearRegression model can be trained.
-
The error message is very misleading since the packages are out of the end-user's control. This caused our engineer to waste days looking into the dependencies thinking they could address the issue.
Since Snowpark must do translations and can introduce bugs itself, consider catching the error and adding this to the errors shown to users.
- It's not clear to me if the bug is the package list or package conflict resolution (solving), but one of them needs to be fixed for this to work
I'm not experienced enough to know if this is true, but it seems like snowflake-ml-python should be in the list of packages. Or if that's not the case, then the telemetry package and the solver needs to be considered.
Thank you for your attention.
We've continued our vetting process by avoiding Snowpark and using SPCS, but of course this would be a valuable feature that Snowflake offers that Sagemaker cannot. It would be nice to know if it is something we can rely on in the future.