You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY = os.path.join(os.getcwd(), ".sagemaker/mms/models") the directory is forced to be in the same path as the current directory of the running process. In some SageMaker instances this is a relatively small partition that can't be extended. Allowing this var to be modified by an environment variable will allow the download of larger models in a variety of instances (i.e. ml.g5.16xlarge)
To reproduce the problem you can try this particular model (other large models will fail the same):
hub = {
'HF_MODEL_ID':'Salesforce/instructblip-flan-t5-xxl',
'HF_TASK':'image-to-text',
'SM_NUM_GPUS': '1',
'HF_HOME':'/tmp/hf_home',
'HF_ASSETS_CACHE': '/tmp/hf_assets_cache',
'HF_DATASETS_CACHE':'/tmp/hf_cache',
'HF_DATASETS_HOME':'/tmp/hf_home',
'HF_HUB_CACHE': '/tmp/hf_hub_cache'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.37.0',
pytorch_version='2.1.0',
py_version='py310',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.g5.16xlarge', # ec2 instance type
# volume_size=256
)
The error in CloudWatch is similar to:
OSError: [Errno 28] No space left on device: '/tmp/hf_hub_cache/tmpd1hcphh0' -> '/.sagemaker/mms/models/Salesforce__instructblip-flan-t5-xxl/pytorch_model-00001-of-00005.bin'
The text was updated successfully, but these errors were encountered:
In
DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY = os.path.join(os.getcwd(), ".sagemaker/mms/models")
the directory is forced to be in the same path as the current directory of the running process. In some SageMaker instances this is a relatively small partition that can't be extended. Allowing this var to be modified by an environment variable will allow the download of larger models in a variety of instances (i.e. ml.g5.16xlarge)To reproduce the problem you can try this particular model (other large models will fail the same):
The error in CloudWatch is similar to:
The text was updated successfully, but these errors were encountered: