Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions higgs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Higgs Audio v2 Generation 3B vLLM Truss

Higgs Audio v2 Generation 3B is a multimodal model that can generate audio content based on text and audio inputs.

This is a [Truss](https://truss.baseten.co/) for Higgs Audio using the vLLM OpenAI Compatible server. This Truss bypasses the need for writing a `model.py` and instead runs `vllm serve` directly at startup and uses the HTTP endpoint provided by `vLLM` OpenAI Compatible Server to directly serve requests.

## Deployment

First, clone this repository:

```sh
git clone https://github.com/basetenlabs/truss-examples.git
cd higgs
```

Before deployment:

1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
2. Install the latest version of Truss: `pip install --upgrade truss`
3. Retrieve your Hugging Face token from the [settings](https://huggingface.co/settings/tokens).
4. Set your Hugging Face token as a Baseten secret [here](https://app.baseten.co/settings/secrets) with the key `hf_access_token`. Note that you will *not* be able to successfully deploy Higgs Audio without doing this.

With `higgs` as your working directory, you can deploy the model with:

```sh
truss push --publish
```

Paste your Baseten API key if prompted.

For more information, see [Truss documentation](https://truss.baseten.co).

## vLLM OpenAI Compatible Server

This Truss demonstrates how to start [vLLM's OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) without the need for a `model.py` through the `docker_server.start_command` option.

The server is configured with the following parameters:
- **Model**: `bosonai/higgs-audio-v2-generation-3B-base`
- **Audio Tokenizer**: `bosonai/higgs-audio-v2-tokenizer`
- **Max Model Length**: 8192 tokens
- **GPU Memory Utilization**: 80%
- **Audio Limit**: 50 audio inputs per prompt
- **MM Preprocessor Cache**: Disabled for better performance

## API Documentation

The API follows the OpenAI ChatCompletion format. You can interact with the model using the standard ChatCompletion interface.

Example usage:

```python
from openai import OpenAI

model_id = "YOUR_MODEL_ID" # Replace with your model ID

client = OpenAI(
api_key="YOUR-API-KEY",
base_url=f"https://model-{model_id}.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
model="higgs-audio-v2-generation-3B-base",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Generate audio based on this description: upbeat electronic music"
},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/reference-audio.wav"}
}
]
}
],
max_tokens=512,
temperature=0.7
)

print(response.choices[0].message.content)
```

### Streaming Example

```python
response = client.chat.completions.create(
model="higgs-audio-v2-generation-3B-base",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Create ambient background music"
}
]
}
],
stream=True,
max_tokens=512
)

for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
```

## Model Features

- **Audio Generation**: Generate high-quality audio content from text descriptions
- **Multimodal Input**: Accepts both text and audio inputs for context-aware generation
- **Flexible Length**: Supports up to 8192 tokens with configurable output length
- **Streaming Support**: Real-time streaming responses for interactive applications
- **Audio Tokenization**: Uses specialized audio tokenizer for optimal performance

## Configuration Options

The model is configured with several performance optimizations:

- **GPU Memory Utilization**: Set to 80% for optimal memory usage
- **Audio Limit**: Up to 50 audio inputs per prompt for complex audio generation tasks
- **Disabled MM Preprocessor Cache**: Ensures consistent performance across requests
- **Max Model Length**: 8192 tokens for extended context handling

## Support

If you have any questions or need assistance, please open an issue in this repository or contact our support team.

## License

Please refer to the original model's license at [bosonai/higgs-audio-v2-generation-3B-base](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base) for usage terms and conditions.
43 changes: 43 additions & 0 deletions higgs/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
description: Higgs Audio v2 Generation 3B - Audio generation model with vLLM
base_image:
image: bosonai/higgs-audio-vllm:latest
model_metadata:
repo_id: bosonai/higgs-audio-v2-generation-3B-base
example_model_input: {
"model": "higgs-audio-v2-generation-3B-base",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Generate audio based on this description"
},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/sample.wav"}
}
]
}
],
"max_tokens": 512,
"temperature": 0.7
}
tags:
- openai-compatible
- audio-generation
- multimodal
docker_server:
start_command: sh -c "vllm serve bosonai/higgs-audio-v2-generation-3B-base --served-model-name higgs-audio-v2-generation-3B-base --limit-mm-per-prompt audio=50 --max-model-len 8192 --port 8000 --gpu-memory-utilization 0.8 --disable-mm-preprocessor-cache"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
resources:
accelerator: H100
use_gpu: true
runtime:
predict_concurrency: 16
model_name: Higgs Audio v2 Generation 3B
environment_variables:
VLLM_LOGGING_LEVEL: INFO
Loading