|
| 1 | +# Quickstart |
| 2 | + |
| 3 | +## Prerequisites |
| 4 | + |
| 5 | +Make sure you've followed the base [quickstart instructions](../quickstart.md) for the Agents SDK, and set up a virtual environment. Then, install the optional voice dependencies from the SDK: |
| 6 | + |
| 7 | +```bash |
| 8 | +pip install 'openai-agents[voice]' |
| 9 | +``` |
| 10 | + |
| 11 | +## Concepts |
| 12 | + |
| 13 | +The main concept to know about is a [`VoicePipeline`][agents.voice.pipeline.VoicePipeline], which is a 3 step process: |
| 14 | + |
| 15 | +1. Run a speech-to-text model to turn audio into text. |
| 16 | +2. Run your code, which is usually an agentic workflow, to produce a result. |
| 17 | +3. Run a text-to-speech model to turn the result text back into audio. |
| 18 | + |
| 19 | +```mermaid |
| 20 | +graph LR |
| 21 | + %% Input |
| 22 | + A["🎤 Audio Input"] |
| 23 | +
|
| 24 | + %% Voice Pipeline |
| 25 | + subgraph Voice_Pipeline [Voice Pipeline] |
| 26 | + direction TB |
| 27 | + B["Transcribe (speech-to-text)"] |
| 28 | + C["Your Code"]:::highlight |
| 29 | + D["Text-to-speech"] |
| 30 | + B --> C --> D |
| 31 | + end |
| 32 | +
|
| 33 | + %% Output |
| 34 | + E["🎧 Audio Output"] |
| 35 | +
|
| 36 | + %% Flow |
| 37 | + A --> Voice_Pipeline |
| 38 | + Voice_Pipeline --> E |
| 39 | +
|
| 40 | + %% Custom styling |
| 41 | + classDef highlight fill:#ffcc66,stroke:#333,stroke-width:1px,font-weight:700; |
| 42 | +
|
| 43 | +``` |
| 44 | + |
| 45 | +## Agents |
| 46 | + |
| 47 | +First, let's set up some Agents. This should feel familiar to you if you've built any agents with this SDK. We'll have a couple of Agents, a handoff, and a tool. |
| 48 | + |
| 49 | +```python |
| 50 | +import asyncio |
| 51 | +import random |
| 52 | + |
| 53 | +from agents import ( |
| 54 | + Agent, |
| 55 | + function_tool, |
| 56 | +) |
| 57 | +from agents.extensions.handoff_prompt import prompt_with_handoff_instructions |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +@function_tool |
| 62 | +def get_weather(city: str) -> str: |
| 63 | + """Get the weather for a given city.""" |
| 64 | + print(f"[debug] get_weather called with city: {city}") |
| 65 | + choices = ["sunny", "cloudy", "rainy", "snowy"] |
| 66 | + return f"The weather in {city} is {random.choice(choices)}." |
| 67 | + |
| 68 | + |
| 69 | +spanish_agent = Agent( |
| 70 | + name="Spanish", |
| 71 | + handoff_description="A spanish speaking agent.", |
| 72 | + instructions=prompt_with_handoff_instructions( |
| 73 | + "You're speaking to a human, so be polite and concise. Speak in Spanish.", |
| 74 | + ), |
| 75 | + model="gpt-4o-mini", |
| 76 | +) |
| 77 | + |
| 78 | +agent = Agent( |
| 79 | + name="Assistant", |
| 80 | + instructions=prompt_with_handoff_instructions( |
| 81 | + "You're speaking to a human, so be polite and concise. If the user speaks in Spanish, handoff to the spanish agent.", |
| 82 | + ), |
| 83 | + model="gpt-4o-mini", |
| 84 | + handoffs=[spanish_agent], |
| 85 | + tools=[get_weather], |
| 86 | +) |
| 87 | +``` |
| 88 | + |
| 89 | +## Voice pipeline |
| 90 | + |
| 91 | +We'll set up a simple voice pipeline, using [`SingleAgentVoiceWorkflow`][agents.voice.workflow.SingleAgentVoiceWorkflow] as the workflow. |
| 92 | + |
| 93 | +```python |
| 94 | +from agents.voice import SingleAgentVoiceWorkflow, VoicePipeline, |
| 95 | +pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(agent)) |
| 96 | +``` |
| 97 | + |
| 98 | +## Run the pipeline |
| 99 | + |
| 100 | +```python |
| 101 | +import numpy as np |
| 102 | +import sounddevice as sd |
| 103 | + |
| 104 | +# For simplicity, we'll just create 3 seconds of silence |
| 105 | +# In reality, you'd get microphone data |
| 106 | +audio = np.zeros(24000 * 3, dtype=np.int16) |
| 107 | +result = await pipeline.run(audio_input) |
| 108 | + |
| 109 | +# Create an audio player using `sounddevice` |
| 110 | +player = sd.OutputStream(samplerate=24000, channels=1, dtype=np.int16) |
| 111 | +player.start() |
| 112 | + |
| 113 | +# Play the audio stream as it comes in |
| 114 | +async for event in result.stream(): |
| 115 | + if event.type == "voice_stream_event_audio": |
| 116 | + player.write(event.data) |
| 117 | + |
| 118 | +``` |
| 119 | + |
| 120 | +## Put it all together |
| 121 | + |
| 122 | +```python |
| 123 | +import asyncio |
| 124 | +import random |
| 125 | + |
| 126 | +import numpy as np |
| 127 | +import sounddevice as sd |
| 128 | + |
| 129 | +from agents import ( |
| 130 | + Agent, |
| 131 | + function_tool, |
| 132 | + set_tracing_disabled, |
| 133 | +) |
| 134 | +from agents.voice import ( |
| 135 | + AudioInput, |
| 136 | + SingleAgentVoiceWorkflow, |
| 137 | + VoicePipeline, |
| 138 | +) |
| 139 | +from agents.extensions.handoff_prompt import prompt_with_handoff_instructions |
| 140 | + |
| 141 | + |
| 142 | +@function_tool |
| 143 | +def get_weather(city: str) -> str: |
| 144 | + """Get the weather for a given city.""" |
| 145 | + print(f"[debug] get_weather called with city: {city}") |
| 146 | + choices = ["sunny", "cloudy", "rainy", "snowy"] |
| 147 | + return f"The weather in {city} is {random.choice(choices)}." |
| 148 | + |
| 149 | + |
| 150 | +spanish_agent = Agent( |
| 151 | + name="Spanish", |
| 152 | + handoff_description="A spanish speaking agent.", |
| 153 | + instructions=prompt_with_handoff_instructions( |
| 154 | + "You're speaking to a human, so be polite and concise. Speak in Spanish.", |
| 155 | + ), |
| 156 | + model="gpt-4o-mini", |
| 157 | +) |
| 158 | + |
| 159 | +agent = Agent( |
| 160 | + name="Assistant", |
| 161 | + instructions=prompt_with_handoff_instructions( |
| 162 | + "You're speaking to a human, so be polite and concise. If the user speaks in Spanish, handoff to the spanish agent.", |
| 163 | + ), |
| 164 | + model="gpt-4o-mini", |
| 165 | + handoffs=[spanish_agent], |
| 166 | + tools=[get_weather], |
| 167 | +) |
| 168 | + |
| 169 | + |
| 170 | +async def main(): |
| 171 | + pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(agent)) |
| 172 | + buffer = np.zeros(24000 * 3, dtype=np.int16) |
| 173 | + audio_input = AudioInput(buffer=buffer) |
| 174 | + |
| 175 | + result = await pipeline.run(audio_input) |
| 176 | + |
| 177 | + # Create an audio player using `sounddevice` |
| 178 | + player = sd.OutputStream(samplerate=24000, channels=1, dtype=np.int16) |
| 179 | + player.start() |
| 180 | + |
| 181 | + # Play the audio stream as it comes in |
| 182 | + async for event in result.stream(): |
| 183 | + if event.type == "voice_stream_event_audio": |
| 184 | + player.write(event.data) |
| 185 | + |
| 186 | + |
| 187 | +if __name__ == "__main__": |
| 188 | + asyncio.run(main()) |
| 189 | +``` |
| 190 | + |
| 191 | +If you run this example, the agent will speak to you! Check out the example in [examples/voice/static](https://github.com/openai/openai-agents-python/tree/main/examples/voice/static) to see a demo where you can speak to the agent yourself. |
0 commit comments