Failing to deploy with a 800 Mo sklearn model #2353
Replies: 4 comments
-
Hello, Sorry for the delay in response. Is the problem you are experiencing reproducible? Do you happen to have data/entry point script that we can use to reproduce the problem? In my experience, the time out issues are usually related to OOM issues, but based on the size of the model (800mb) and the amount of memory in the chosen instance, it may be due to something else. |
Beta Was this translation helpful? Give feedback.
-
Hello, Maybe I got something wrong but I was not able to find any information related to this. |
Beta Was this translation helpful? Give feedback.
-
Hi, According to the logs (when it fails deploying with the timeout), it happens when loading the model. Note that
Note that the log message def model_fn(model_dir):
"""Loads the model for deployment
model_dir: (sting) specifies location of saved model
"""
print("loading model...", model_dir)
model = joblib.load(os.path.join(model_dir, "model.joblib"))
print("...model loaded")
return model The fact that it works on a ===== UpdateAdditional information: I realized that |
Beta Was this translation helpful? Give feedback.
-
I have a question. If I am deploying the endpoint using boto3's sagemaker client, how to send the |
Beta Was this translation helpful? Give feedback.
-
System Information
Describe the problem
I am trying to deploy a logistic regression model with sagemaker sklearn. When I train with 1/10 of the data I can deploy without problem using the commands below. When I train with all the data, the training is OK and my model is around 800mo . But the deployment is falling with these erros
Minimal repro / logs
"in the jupyter notebook"
ValueError: Error hosting endpoint sagemaker-scikit-learn-2019-01-17-12-59-16-371: Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.
"in the clouwatch console"
2019/01/17 14:29:00 [error] 25#25: *47 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.32.0.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/ping", host: "model.aws.local:8080"
from sagemaker.sklearn.estimator import SKLearn
script_path = 'sklearn_sentiment.py'
sklearn_preprocessor = SKLearn(
entry_point=script_path,
role=role,
train_instance_type="ml.m4.4xlarge",
sagemaker_session=sagemaker_session)
sklearn_preprocessor.fit({'train' : data_location})
predictor = sklearn_preprocessor.deploy(initial_instance_count=1, instance_type="ml.c5.4xlarge")
Beta Was this translation helpful? Give feedback.
All reactions