|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Mastering Custom SageMaker Deployment: A Comprehensive Guide" |
| 4 | +description: |
| 5 | + "A deep dive into the intricacies of deploying custom models to Amazon |
| 6 | + SageMaker" |
| 7 | +image: /assets/images/custom-sagemaker.jpg |
| 8 | +project: false |
| 9 | +permalink: "/blog/:title/" |
| 10 | +tags: |
| 11 | + - machine-learning |
| 12 | + - aws |
| 13 | + - sagemaker |
| 14 | +--- |
| 15 | + |
| 16 | +## Introduction |
| 17 | + |
| 18 | +Recently, I embarked on what I initially thought would be a straightforward |
| 19 | +journey: deploying a custom BERT-style model, trained on Databricks and packaged |
| 20 | +by MLflow, to Amazon SageMaker. Little did I know that this endeavor would lead |
| 21 | +me down a rabbit hole of CUDA errors, SageMaker-specific arguments, and |
| 22 | +scattered documentation. This two-week journey of dissecting SageMaker internals |
| 23 | +and scouring the web for information has equipped me with valuable insights that |
| 24 | +I'm eager to share. Consider this post your guide if you ever find yourself in a |
| 25 | +similar predicament. |
| 26 | + |
| 27 | +## The Scenario |
| 28 | + |
| 29 | +Imagine a BERT-style model trained for token classification (essentially Named |
| 30 | +Entity Recognition or NER), enhanced with post-processing steps to refine its |
| 31 | +output by eliminating false positives based on a predefined list of terms. This |
| 32 | +model was developed on Databricks, leveraging its seamless MLflow integration |
| 33 | +for experiment tracking and logging. To maintain consistency, we opted to |
| 34 | +package the model as an MLflow bundle with a |
| 35 | +[custom class](https://mlflow.org/blog/custom-pyfunc) encapsulating all |
| 36 | +necessary post-processing steps, with the intention of deploying it to SageMaker |
| 37 | +directly via MLflow. |
| 38 | + |
| 39 | +Our initial approach was to use MLflow's built-in method for creating a |
| 40 | +SageMaker-compatible image: |
| 41 | + |
| 42 | +```bash |
| 43 | +mlflow sagemaker build-and-push-container |
| 44 | +``` |
| 45 | + |
| 46 | +This command is designed to generate a SageMaker-compatible image for seamless |
| 47 | +deployment. However, we quickly discovered a significant limitation: the |
| 48 | +resulting image is CPU-only, unsuitable for our Large Language Model (LLM) needs |
| 49 | +(Yes, BERT is an LLM). This realization necessitated a pivot to manually |
| 50 | +creating and pushing our own GPU-enabled image to Amazon Elastic Container |
| 51 | +Registry (ECR) for SageMaker deployment. |
| 52 | + |
| 53 | +## Diving into the Deep End |
| 54 | + |
| 55 | +### The Dockerfile Dilemma |
| 56 | + |
| 57 | +Creating a Dockerfile is typically straightforward, but SageMaker introduces its |
| 58 | +own set of requirements and nuances. Key questions emerged: |
| 59 | + |
| 60 | +- What should be the entrypoint? |
| 61 | +- Are there CUDA version restrictions? |
| 62 | +- How does SageMaker execute the Docker image? |
| 63 | + |
| 64 | +These uncertainties weren't immediately clear from the available documentation. |
| 65 | + |
| 66 | +### Initial Attempts and Roadblocks |
| 67 | + |
| 68 | +Drawing inspiration from a colleague's SageMaker deployment Dockerfile, which |
| 69 | +curiously lacked an entrypoint, I initially opted for `nvidia/cuda` as the base |
| 70 | +image and attempted to use MLflow for model serving: |
| 71 | + |
| 72 | +```docker |
| 73 | +FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 |
| 74 | +
|
| 75 | +# Install Python and necessary packages |
| 76 | +RUN apt update && apt install -y python3 python3-pip |
| 77 | +RUN pip install mlflow==2.15.1 transformers torch sagemaker |
| 78 | +
|
| 79 | +# Set the entrypoint |
| 80 | +ENTRYPOINT ["mlflow", "models", "serve", "-m", "/opt/ml/model", "-h", "0.0.0.0", "-p", "8080"] |
| 81 | +``` |
| 82 | + |
| 83 | +This approach, however, led to an unexpected error regarding an unknown `serve` |
| 84 | +argument. After some investigation and experimentation, including removing the |
| 85 | +entrypoint entirely, we encountered more cryptic errors, suggesting deeper |
| 86 | +issues with our container configuration. |
| 87 | + |
| 88 | +## Going Fully Custom |
| 89 | + |
| 90 | +After hitting multiple dead ends, I decided to take a step back and rethink our |
| 91 | +approach. The next day, I opted to drop the use of MLflow for serving and |
| 92 | +instead create a custom API inside the Docker image using FastAPI. We would |
| 93 | +still use MLflow to load the model file, as our post-processing logic was |
| 94 | +defined there. This approach allowed me to test the image locally, and it worked |
| 95 | +fine. However, once deployed to SageMaker, we encountered the same `serve` |
| 96 | +error. |
| 97 | + |
| 98 | +This persistent issue prompted a deeper investigation into how SageMaker |
| 99 | +actually starts these containers. The AWS documentation, unfortunately, didn't |
| 100 | +provide a clear, consolidated answer. Ironically, it was AWS's Q AI that |
| 101 | +provided the crucial information, outlining exactly how SageMaker starts |
| 102 | +containers and what the requirements are for a container to work properly. |
| 103 | + |
| 104 | +### The SageMaker Container Startup Revelation |
| 105 | + |
| 106 | +The key revelation was that SageMaker passes the `serve` argument when starting |
| 107 | +the container: |
| 108 | + |
| 109 | +```bash |
| 110 | +docker run image_name --volume /path/to/model:/opt/ml/model serve |
| 111 | +``` |
| 112 | + |
| 113 | +This insight was a game-changer. It explained why our previous attempts failed |
| 114 | +and provided a clear direction for our solution. To handle this, we needed to |
| 115 | +make our entrypoint a shell script that could correctly consume and handle |
| 116 | +passed arguments. |
| 117 | + |
| 118 | +### Unraveling SageMaker's Requirements |
| 119 | + |
| 120 | +Through further research and experimentation (mostly bugging Q about it), we |
| 121 | +pieced together the following requirements for a SageMaker-compatible container: |
| 122 | + |
| 123 | +1. **API Endpoints**: The container must expose two specific endpoints: |
| 124 | + |
| 125 | + - `/invocations` for handling prediction requests |
| 126 | + - `/ping` for health checks |
| 127 | + |
| 128 | +2. **Port Binding**: The Docker image must be labeled to accept port binding, as |
| 129 | + SageMaker dynamically changes the deployment port. |
| 130 | + |
| 131 | +3. **Port Configuration**: The server inside the container should use the port |
| 132 | + specified in the `SAGEMAKER_BIND_TO_PORT` environment variable. |
| 133 | + |
| 134 | +4. **Custom Inference Script**: We need to inform SageMaker that we're using a |
| 135 | + custom inference script by specifying its name as an environment variable |
| 136 | + when creating the `Model` object. |
| 137 | + |
| 138 | +These requirements led us to create the following files: |
| 139 | + |
| 140 | +### entrypoint.sh |
| 141 | + |
| 142 | +```bash |
| 143 | +#!/bin/bash |
| 144 | + |
| 145 | +# Set the default port to existing sagemaker set environment variable or default to 8080 |
| 146 | +PORT=${SAGEMAKER_BIND_TO_PORT:-8080} |
| 147 | + |
| 148 | +# start the API server |
| 149 | +# yes the use of exec is actually necessary here |
| 150 | +exec uvicorn inference:app --host 0.0.0.0 --port $PORT --workers 4 |
| 151 | +``` |
| 152 | + |
| 153 | +### inference.py |
| 154 | + |
| 155 | +```python |
| 156 | +import mlflow.pyfunc |
| 157 | +from fastapi import FastAPI, HTTPException |
| 158 | +from pydantic import BaseModel |
| 159 | +from typing import List, Dict, Any |
| 160 | + |
| 161 | +app = FastAPI() |
| 162 | + |
| 163 | +# Load the model from the default SageMaker model directory |
| 164 | +model = mlflow.pyfunc.load_model("/opt/ml/model") |
| 165 | + |
| 166 | +class PredictionRequest(BaseModel): |
| 167 | + text: List[str] |
| 168 | + |
| 169 | +class PredictionResponse(BaseModel): |
| 170 | + predictions: List[Dict[str, Any]] |
| 171 | + |
| 172 | +@app.get("/ping") |
| 173 | +def ping(): |
| 174 | + return {"status": "ok"} |
| 175 | + |
| 176 | +@app.post("/invocations", response_model=PredictionResponse) |
| 177 | +def predict(request: PredictionRequest): |
| 178 | + try: |
| 179 | + predictions = model.predict(request.text) |
| 180 | + return PredictionResponse(predictions=predictions) |
| 181 | + except Exception as e: |
| 182 | + raise HTTPException(status_code=500, detail=str(e)) |
| 183 | + |
| 184 | +if __name__ == '__main__': |
| 185 | + import uvicorn |
| 186 | + import os |
| 187 | + port = os.environ.get("SAGEMAKER_BIND_TO_PORT", 8080) |
| 188 | + uvicorn.run(app, host='0.0.0.0', port=port) |
| 189 | +``` |
| 190 | + |
| 191 | +### requirements.txt |
| 192 | + |
| 193 | +``` |
| 194 | +mlflow==2.15.1 |
| 195 | +cloudpickle==2.2.1 |
| 196 | +fastapi |
| 197 | +uvicorn |
| 198 | +pydantic |
| 199 | +transformers |
| 200 | +``` |
| 201 | + |
| 202 | +## Additional Challenges |
| 203 | + |
| 204 | +Even with these requirements in place, we faced two more significant hurdles: |
| 205 | + |
| 206 | +1. **CUDA Version Issues**: We found that the CUDA version of the Docker image |
| 207 | + needed to match (at least on major versions) with the instance type's CUDA |
| 208 | + version. This was particularly challenging as CUDA versions for instance |
| 209 | + types are not documented anywhere as usual. |
| 210 | + |
| 211 | +2. **Python Version Compatibility**: We discovered that the Python version, |
| 212 | + along with some dependencies like MLflow and cloudpickle, needed to match the |
| 213 | + versions used during model packaging. This led to our decision to use pyenv |
| 214 | + in the Dockerfile to ensure we had the correct Python version. |
| 215 | + |
| 216 | +These additional considerations resulted in our final, more complex Dockerfile: |
| 217 | + |
| 218 | +```docker |
| 219 | +FROM nvidia/cuda:11.4.3-runtime-ubuntu20.04 |
| 220 | +
|
| 221 | +# sagemaker labels |
| 222 | +LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true |
| 223 | +
|
| 224 | +# install system build dependencies (required by pyenv) |
| 225 | +ENV HOME="/root" |
| 226 | +ENV DEBIAN_FRONTEND=noninteractive |
| 227 | +WORKDIR ${HOME} |
| 228 | +RUN apt update && apt install -y \ |
| 229 | + build-essential \ |
| 230 | + curl \ |
| 231 | + git \ |
| 232 | + libssl-dev \ |
| 233 | + zlib1g-dev \ |
| 234 | + libbz2-dev \ |
| 235 | + libreadline-dev \ |
| 236 | + libsqlite3-dev \ |
| 237 | + lzma \ |
| 238 | + liblzma-dev \ |
| 239 | + libbz2-dev \ |
| 240 | + wget \ |
| 241 | + xz-utils \ |
| 242 | + tk-dev \ |
| 243 | + libffi-dev \ |
| 244 | + python3-dev \ |
| 245 | + gnupg |
| 246 | +
|
| 247 | +# cleanup to reduce image size |
| 248 | +RUN apt clean && rm -rf /var/lib/apt/lists/* |
| 249 | +
|
| 250 | +# install pyenv |
| 251 | +RUN git clone --depth=1 https://github.com/pyenv/pyenv.git .pyenv |
| 252 | +ENV PYENV_ROOT="${HOME}/.pyenv" |
| 253 | +ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${PATH}" |
| 254 | +
|
| 255 | +# install correct python version |
| 256 | +ENV PYTHON_VERSION=3.11 |
| 257 | +RUN pyenv install ${PYTHON_VERSION} |
| 258 | +RUN pyenv global ${PYTHON_VERSION} |
| 259 | +
|
| 260 | +# install compatible pytorch version |
| 261 | +RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
| 262 | +
|
| 263 | +# Install necessary Python packages |
| 264 | +COPY ./requirements.txt . |
| 265 | +RUN pip install -r requirements.txt |
| 266 | +
|
| 267 | +# copy code |
| 268 | +COPY . /app |
| 269 | +WORKDIR /app |
| 270 | +
|
| 271 | +# Define the entry point for the container |
| 272 | +RUN chmod +x entrypoint.sh |
| 273 | +ENTRYPOINT ["./entrypoint.sh"] |
| 274 | +``` |
| 275 | + |
| 276 | +This journey through the intricacies of custom SageMaker deployment has been |
| 277 | +both challenging and enlightening. It underscores the complexity of deploying |
| 278 | +sophisticated machine learning models in cloud environments and the importance |
| 279 | +of understanding the underlying infrastructure. Key takeaways from this |
| 280 | +experience include the critical nature of proper documentation (if anyone from |
| 281 | +AWS is reading this), and the value of persistence in problem-solving. We've |
| 282 | +learned that successful deployment often requires a deep dive into system |
| 283 | +internals, a willingness to experiment, and the ability to piece together |
| 284 | +information from various sources. This process has not only resulted in a |
| 285 | +working solution but has also equipped us with valuable knowledge for future |
| 286 | +deployments. Developing the model that was deployed is another story for another |
| 287 | +blog post when I'm bored enough to recreate it with non proprietary data |
| 288 | +(brainrot classifier?). |
0 commit comments