|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +--> |
| 5 | + |
| 6 | +# Adding New End-to-End Tests for Documentation Examples |
| 7 | + |
| 8 | +## IMPORTANT: Code Examples in This File |
| 9 | + |
| 10 | +**The bash code examples in this documentation use backslashes (`\`) before the triple backticks** to prevent them from being parsed as actual test commands by the test framework. |
| 11 | + |
| 12 | +**When copying examples from this file, you MUST remove the backslashes (`\`) before using them.** |
| 13 | + |
| 14 | +For example, this file shows examples like `\```bash` but you should write ````bash` (without the backslash). |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +This guide explains how to add new end-to-end tests for server examples in the AIPerf documentation. |
| 19 | + |
| 20 | +## Overview |
| 21 | + |
| 22 | +The end-to-end test framework automatically discovers and tests server examples from markdown documentation files. It: |
| 23 | +1. Parses markdown files for specially tagged bash commands |
| 24 | +2. Builds an AIPerf Docker container |
| 25 | +3. For each discovered server: |
| 26 | + - Runs the server setup command |
| 27 | + - Waits for the server to become healthy |
| 28 | + - Executes AIPerf benchmark commands |
| 29 | + - Validates results and cleans up |
| 30 | + |
| 31 | +## How Tests Are Discovered |
| 32 | + |
| 33 | +The test parser (`parser.py`) scans all markdown files (`*.md`) in the repository and looks for HTML comment tags with specific patterns: |
| 34 | + |
| 35 | +- **Setup commands**: `<!-- setup-{server-name}-endpoint-server -->` |
| 36 | +- **Health checks**: `<!-- health-check-{server-name}-endpoint-server -->` |
| 37 | +- **AIPerf commands**: `<!-- aiperf-run-{server-name}-endpoint-server -->` |
| 38 | + |
| 39 | +Each tag must be followed by a bash code block (` ```bash ... ``` `) containing the actual command. |
| 40 | + |
| 41 | +## Adding a New Server Test |
| 42 | + |
| 43 | +To add tests for a new server, you need to add three types of tagged commands to your documentation: |
| 44 | + |
| 45 | +### 1. Server Setup Command |
| 46 | + |
| 47 | +Tag the bash command that starts your server: |
| 48 | + |
| 49 | +```markdown |
| 50 | +<!-- setup-myserver-endpoint-server --> |
| 51 | +\```bash |
| 52 | +# Start your server |
| 53 | +docker run --gpus all -p 8000:8000 myserver/image:latest \ |
| 54 | + --model my-model \ |
| 55 | + --host 0.0.0.0 --port 8000 |
| 56 | +\``` |
| 57 | +<!-- /setup-myserver-endpoint-server --> |
| 58 | +``` |
| 59 | + |
| 60 | +**Important notes:** |
| 61 | +- The server name (`myserver` in this example) must be consistent across all three tag types |
| 62 | +- The setup command runs in the background |
| 63 | +- The command should start a long-running server process |
| 64 | +- Use port 8000 or ensure your health check targets the correct port |
| 65 | + |
| 66 | +### 2. Health Check Command |
| 67 | + |
| 68 | +Tag a bash command that waits for your server to be ready: |
| 69 | + |
| 70 | +```markdown |
| 71 | +<!-- health-check-myserver-endpoint-server --> |
| 72 | +\```bash |
| 73 | +timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/health -H "Content-Type: application/json")" != "200" ]; do sleep 2; done' || { echo "Server not ready after 15min"; exit 1; } |
| 74 | +\``` |
| 75 | +<!-- /health-check-myserver-endpoint-server --> |
| 76 | +``` |
| 77 | + |
| 78 | +**Important notes:** |
| 79 | +- The health check should poll the server until it responds successfully |
| 80 | +- Use a reasonable timeout (e.g., 900 seconds = 15 minutes) |
| 81 | +- The command must exit with code 0 when the server is healthy |
| 82 | +- The command must exit with non-zero code if the server fails to start |
| 83 | + |
| 84 | +### 3. AIPerf Run Commands |
| 85 | + |
| 86 | +Tag one or more AIPerf benchmark commands: |
| 87 | + |
| 88 | +```markdown |
| 89 | +<!-- aiperf-run-myserver-endpoint-server --> |
| 90 | +\```bash |
| 91 | +aiperf profile \ |
| 92 | + --model my-model \ |
| 93 | + --endpoint-type chat \ |
| 94 | + --endpoint /v1/chat/completions \ |
| 95 | + --service-kind openai \ |
| 96 | + --streaming \ |
| 97 | + --num-prompts 10 \ |
| 98 | + --max-tokens 100 |
| 99 | +\``` |
| 100 | +<!-- /aiperf-run-myserver-endpoint-server --> |
| 101 | +``` |
| 102 | + |
| 103 | +You can have multiple `aiperf-run` commands for the same server. Each will be executed sequentially: |
| 104 | + |
| 105 | +```markdown |
| 106 | +<!-- aiperf-run-myserver-endpoint-server --> |
| 107 | +\```bash |
| 108 | +# First test: streaming mode |
| 109 | +aiperf profile \ |
| 110 | + --model my-model \ |
| 111 | + --endpoint-type chat \ |
| 112 | + --endpoint /v1/chat/completions \ |
| 113 | + --service-kind openai \ |
| 114 | + --streaming \ |
| 115 | + --num-prompts 10 |
| 116 | +\``` |
| 117 | +<!-- /aiperf-run-myserver-endpoint-server --> |
| 118 | + |
| 119 | +<!-- aiperf-run-myserver-endpoint-server --> |
| 120 | +\```bash |
| 121 | +# Second test: non-streaming mode |
| 122 | +aiperf profile \ |
| 123 | + --model my-model \ |
| 124 | + --endpoint-type chat \ |
| 125 | + --endpoint /v1/chat/completions \ |
| 126 | + --service-kind openai \ |
| 127 | + --num-prompts 10 |
| 128 | +\``` |
| 129 | +<!-- /aiperf-run-myserver-endpoint-server --> |
| 130 | +``` |
| 131 | + |
| 132 | +**Important notes:** |
| 133 | +- Do NOT include `--ui-type` flag - the test framework adds `--ui-type simple` automatically |
| 134 | +- Each command is executed inside the AIPerf Docker container |
| 135 | +- Commands should complete in a reasonable time (default timeout: 300 seconds) |
| 136 | +- Use small values for `--num-prompts` and `--max-tokens` to keep tests fast |
| 137 | + |
| 138 | +## Complete Example |
| 139 | + |
| 140 | +Here's a complete example for a new server called "fastapi": |
| 141 | + |
| 142 | +```markdown |
| 143 | +### Running FastAPI Server |
| 144 | + |
| 145 | +Start the FastAPI server: |
| 146 | + |
| 147 | +<!-- setup-fastapi-endpoint-server --> |
| 148 | +\```bash |
| 149 | +docker run --gpus all -p 8000:8000 mycompany/fastapi-llm:latest \ |
| 150 | + --model-name meta-llama/Llama-3.2-1B \ |
| 151 | + --host 0.0.0.0 \ |
| 152 | + --port 8000 |
| 153 | +\``` |
| 154 | +<!-- /setup-fastapi-endpoint-server --> |
| 155 | + |
| 156 | +Wait for the server to be ready: |
| 157 | + |
| 158 | +<!-- health-check-fastapi-endpoint-server --> |
| 159 | +\```bash |
| 160 | +timeout 600 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/models)" != "200" ]; do sleep 2; done' || { echo "FastAPI server not ready after 10min"; exit 1; } |
| 161 | +\``` |
| 162 | +<!-- /health-check-fastapi-endpoint-server --> |
| 163 | + |
| 164 | +Profile the model: |
| 165 | + |
| 166 | +<!-- aiperf-run-fastapi-endpoint-server --> |
| 167 | +\```bash |
| 168 | +aiperf profile \ |
| 169 | + --model meta-llama/Llama-3.2-1B \ |
| 170 | + --endpoint-type chat \ |
| 171 | + --endpoint /v1/chat/completions \ |
| 172 | + --service-kind openai \ |
| 173 | + --streaming \ |
| 174 | + --num-prompts 20 \ |
| 175 | + --max-tokens 50 |
| 176 | +\``` |
| 177 | +<!-- /aiperf-run-fastapi-endpoint-server --> |
| 178 | +``` |
| 179 | + |
| 180 | +## Running the Tests |
| 181 | + |
| 182 | +### Run all discovered tests: |
| 183 | + |
| 184 | +```bash |
| 185 | +cd tests/ci/test_docs_end_to_end |
| 186 | +python main.py |
| 187 | +``` |
| 188 | + |
| 189 | +### Dry run to see what would be tested: |
| 190 | + |
| 191 | +```bash |
| 192 | +python main.py --dry-run |
| 193 | +``` |
| 194 | + |
| 195 | +### Test specific servers: |
| 196 | + |
| 197 | +Currently, the framework tests the first discovered server by default. Use `--all-servers` to test all: |
| 198 | + |
| 199 | +```bash |
| 200 | +python main.py --all-servers |
| 201 | +``` |
| 202 | + |
| 203 | +## Validation Rules |
| 204 | + |
| 205 | +The test framework validates that each server has: |
| 206 | +- Exactly ONE setup command (duplicates cause test failure) |
| 207 | +- Exactly ONE health check command (duplicates cause test failure) |
| 208 | +- At least ONE aiperf command |
| 209 | + |
| 210 | +If any of these requirements are not met, the tests will fail with a clear error message. |
| 211 | + |
| 212 | +## Test Execution Flow |
| 213 | + |
| 214 | +For each server, the test runner: |
| 215 | + |
| 216 | +1. **Build Phase**: Builds the AIPerf Docker container (once for all tests) |
| 217 | +2. **Setup Phase**: Starts the server in the background |
| 218 | +3. **Health Check Phase**: Waits for server to be ready (runs in parallel with setup) |
| 219 | +4. **Test Phase**: Executes all AIPerf commands sequentially |
| 220 | +5. **Cleanup Phase**: Gracefully shuts down the server and cleans up Docker resources |
| 221 | + |
| 222 | +## Common Patterns |
| 223 | + |
| 224 | +### Pattern: OpenAI-compatible API |
| 225 | + |
| 226 | +```markdown |
| 227 | +<!-- setup-myserver-endpoint-server --> |
| 228 | +\```bash |
| 229 | +docker run --gpus all -p 8000:8000 myserver:latest \ |
| 230 | + --model model-name \ |
| 231 | + --host 0.0.0.0 --port 8000 |
| 232 | +\``` |
| 233 | +<!-- /setup-myserver-endpoint-server --> |
| 234 | + |
| 235 | +<!-- health-check-myserver-endpoint-server --> |
| 236 | +\```bash |
| 237 | +timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"model-name\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "Server not ready"; exit 1; } |
| 238 | +\``` |
| 239 | +<!-- /health-check-myserver-endpoint-server --> |
| 240 | + |
| 241 | +<!-- aiperf-run-myserver-endpoint-server --> |
| 242 | +\```bash |
| 243 | +aiperf profile \ |
| 244 | + --model model-name \ |
| 245 | + --endpoint-type chat \ |
| 246 | + --endpoint /v1/chat/completions \ |
| 247 | + --service-kind openai \ |
| 248 | + --streaming \ |
| 249 | + --num-prompts 10 \ |
| 250 | + --max-tokens 100 |
| 251 | +\``` |
| 252 | +<!-- /aiperf-run-myserver-endpoint-server --> |
| 253 | +``` |
| 254 | + |
| 255 | +## Troubleshooting |
| 256 | + |
| 257 | +### Tests not discovered |
| 258 | + |
| 259 | +- Verify tag format: `setup-{name}-endpoint-server`, `health-check-{name}-endpoint-server`, `aiperf-run-{name}-endpoint-server` |
| 260 | +- Ensure bash code block immediately follows the tag |
| 261 | +- Check that the server name is consistent across all three tag types |
| 262 | +- Run `python main.py --dry-run` to see what's discovered |
| 263 | + |
| 264 | +### Health check timeout |
| 265 | + |
| 266 | +- Increase the timeout value in your health check command |
| 267 | +- Verify the health check endpoint is correct |
| 268 | +- Check server logs: the test runner shows setup output for 30 seconds |
| 269 | +- Ensure your server starts on the expected port |
| 270 | + |
| 271 | +### AIPerf command fails |
| 272 | + |
| 273 | +- Test your AIPerf command manually first |
| 274 | +- Use small values for `--num-prompts` and `--max-tokens` |
| 275 | +- Verify the model name matches what the server expects |
| 276 | +- Check that the endpoint URL is correct |
| 277 | + |
| 278 | +### Duplicate command errors |
| 279 | + |
| 280 | +If you see errors like "DUPLICATE SETUP COMMAND", you have multiple commands with the same server name: |
| 281 | +- Search your docs for all instances of that tag |
| 282 | +- Ensure each server has a unique name |
| 283 | +- Or remove duplicate tags if they're truly duplicates |
| 284 | + |
| 285 | +## Best Practices |
| 286 | + |
| 287 | +1. **Keep tests fast**: Use minimal `--num-prompts` (10-20) and small `--max-tokens` values |
| 288 | +2. **Use standard ports**: Default to 8000 for consistency |
| 289 | +3. **Add timeouts**: Always include timeouts in health checks |
| 290 | +4. **Test locally first**: Run commands manually before adding tags |
| 291 | +5. **One server per doc section**: Avoid mixing multiple servers in the same doc section |
| 292 | +6. **Clear error messages**: Include helpful error messages in health checks |
| 293 | +7. **Document requirements**: Note any GPU, memory, or dependency requirements in surrounding text |
| 294 | + |
| 295 | +## Architecture Reference |
| 296 | + |
| 297 | +Key files in the test framework: |
| 298 | + |
| 299 | +- `main.py`: Entry point, orchestrates parsing and testing |
| 300 | +- `parser.py`: Markdown parser that discovers tagged commands |
| 301 | +- `test_runner.py`: Executes tests for each server |
| 302 | +- `constants.py`: Configuration constants (timeouts, tag patterns) |
| 303 | +- `data_types.py`: Data models for commands and servers |
| 304 | +- `utils.py`: Utility functions for Docker operations |
| 305 | + |
| 306 | +## Constants and Configuration |
| 307 | + |
| 308 | +Key constants in `constants.py`: |
| 309 | + |
| 310 | +- `SETUP_MONITOR_TIMEOUT`: 30 seconds (how long to monitor setup output) |
| 311 | +- `CONTAINER_BUILD_TIMEOUT`: 600 seconds (Docker build timeout) |
| 312 | +- `AIPERF_COMMAND_TIMEOUT`: 300 seconds (per-command timeout) |
| 313 | +- `AIPERF_UI_TYPE`: "simple" (auto-added to all aiperf commands) |
| 314 | + |
| 315 | +To modify these, edit `constants.py`. |
0 commit comments