Skip to content

snowpark_trainer.py specifies incomplete PACKAGES for sproc, which leads to red herring error message #192

@katechumbarova

Description

@katechumbarova

Context

Environment

  • snowflake-ml-python version: 1.20.0
  • snowflake-snowpark-python: 1.28.0
  • snowflake-connector-python[pandas]: 3.17.3
  • Python versions tested: 3.10, 3.11

NOTE: Also fails in a notebook environment in Snowsight

Use case

When using Snowpark to run a LinearRegression, the fit execution fails regardless of local package setup or Python version due to the behavior of the the train method in snowpark_trainer.py. Its behavior can be reproduced with an SQL query that defines a temporary stored procedure for a Python function.

Impact

This was part of vetting Snowflake vs Amazon Sagemaker for AI/ML engineering at a business. The experience was so incredibly frustrating - wasting days of development time - that it made Snowflake look unusable and very immature in comparison.

Repro Steps

Run this SQL to simulate the packages defined by snowpark_trainer.py & run some Python to examine the environment and its packages. The example below is from running LinearRegression.

RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.11'
PACKAGES=('snowflake-snowpark-python','snowflake-telemetry-python','numpy==2.2.5','scikit-learn==1.7.2','cloudpickle==3.1.1','xgboost==3.1.2')
HANDLER = 'probe'
AS
$$
import sys
import os
import importlib.metadata

def probe(session):
    report = []

    report.append(f"--- PYTHON RUNTIME ---")
    report.append(f"Version: {sys.version}")

    report.append(f"\n--- INSTALLED PACKAGES ---")
    dists = sorted(
        [(d.metadata["Name"], d.version) for d in importlib.metadata.distributions()],
        key=lambda x: x[0].lower()
    )
    for name, version in dists:
        report.append(f"{name} == {version}")

    return "\n".join(report)
$$;

-- Run the probe
CALL DEBUG_ENV_PROBE();

Expected

This should succeed and print out information about the Python version and packages.

Actual

A package conflict error, as described here:
https://community.snowflake.com/s/article/package-conflicts-were-detected-when-starting-the-snowflake-notebook

Additional information

Which packages are causing conflicts?

Performing either of these changes to the packages list makes the query succeed:

Add snowflake-ml-python
PACKAGES = ('snowflake-snowpark-python', 'snowflake-telemetry-python', 'numpy==2.2.5', 'scikit-learn==1.7.2', 'cloudpickle==3.1.1', 'xgboost==3.1.2', 'snowflake-ml-python')

Remove snowflake-telemetry-python
PACKAGES=('snowflake-snowpark-python', 'numpy==2.2.5','scikit-learn==1.7.2','cloudpickle==3.1.1','xgboost==3.1.2')

It's unclear to me why the dependencies solver works when ML package is added or telemetry is removed.

(The end-user cannot add or remove packages though - even by changing the session packages.)

Where is the source code?

This is the snowpark_trainer.py line that defines packages used:
https://github.com/snowflakedb/snowflake-ml-python/blob/release_1.20.0/snowflake/ml/modeling/_internal/snowpark_implementations/snowpark_trainer.py#L234

Suggested fixes

  1. Consider automated testing of package combinations - in theory this should be able to be caught by a test just checking if a LinearRegression model can be trained.

  2. The error message is very misleading since the packages are out of the end-user's control. This caused our engineer to waste days looking into the dependencies thinking they could address the issue.

Since Snowpark must do translations and can introduce bugs itself, consider catching the error and adding this to the errors shown to users.

  1. It's not clear to me if the bug is the package list or package conflict resolution (solving), but one of them needs to be fixed for this to work

I'm not experienced enough to know if this is true, but it seems like snowflake-ml-python should be in the list of packages. Or if that's not the case, then the telemetry package and the solver needs to be considered.

Thank you for your attention.

We've continued our vetting process by avoiding Snowpark and using SPCS, but of course this would be a valuable feature that Snowflake offers that Sagemaker cannot. It would be nice to know if it is something we can rely on in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions