Skip to content

Commit d3c3cb4

Browse files
authored
refactor: move asr and tts agents (#469)
1 parent 31166ae commit d3c3cb4

File tree

12 files changed

+174
-207
lines changed

12 files changed

+174
-207
lines changed

examples/s2s/asr.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@
1717
import time
1818

1919
import rclpy
20-
from rai.agents import VoiceRecognitionAgent
2120
from rai.communication.sound_device.api import SoundDeviceConfig
2221

22+
from rai_asr.agents import SpeechRecognitionAgent
2323
from rai_asr.models import LocalWhisper, OpenWakeWord, SileroVAD
2424

2525
VAD_THRESHOLD = 0.8 # Note that this might be different depending on your device
@@ -100,7 +100,7 @@ def parse_arguments():
100100
rclpy.init()
101101
ros2_name = "rai_asr_agent"
102102

103-
agent = VoiceRecognitionAgent(microphone_configuration, ros2_name, whisper, vad)
103+
agent = SpeechRecognitionAgent(microphone_configuration, ros2_name, whisper, vad)
104104
# optionally add additional models to decide when to record data for transcription
105105
# agent.add_detection_model(oww, pipeline="record")
106106

examples/s2s/tts.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@
1717
import time
1818

1919
import rclpy
20-
from rai.agents import TextToSpeechAgent
2120
from rai.communication.sound_device import SoundDeviceConfig
2221

22+
from rai_tts.agents import TextToSpeechAgent
2323
from rai_tts.models import OpenTTS
2424

2525

src/rai_asr/README.md

+62-14
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,74 @@
22

33
## Description
44

5-
The RAI ASR (Automatic Speech Recognition) node utilizes a combination of voice activity detection (VAD) and a speech recognition model to transcribe spoken language into text. The node is configured to handle multiple languages and model types, providing flexibility in various ASR applications. It detects speech, records it, and then uses a model to transcribe the recorded audio into text.
5+
This is the [RAI](https://github.com/RobotecAI/rai) automatic speech recognition package.
6+
It contains Agents definitions for the ASR feature.
67

7-
## Installation
8+
## Models
89

9-
```bash
10-
rosdep install --from-paths src --ignore-src -r
10+
This package contains three types of models: Voice Activity Detection (VAD), Wake word and transcription.
11+
12+
The `detect` API for VAD and Wake word models, with the following signature:
13+
14+
```
15+
def detect(
16+
self, audio_data: NDArray, input_parameters: dict[str, Any]
17+
) -> Tuple[bool, dict[str, Any]]:
18+
```
19+
20+
Allows for chaining the models into detection piplelines. The `input_parameters` provide a utility to pass the output dictionary from previous models.
21+
22+
The `transcribe` API for transcription models, with the following signature:
23+
24+
```
25+
def transcribe(self, data: NDArray[np.int16]) -> str:
1126
```
1227

13-
## Subscribed Topics
28+
Takes the audio data encoded as 2 byte ints and returns the string with transcription.
29+
30+
### SileroVAD
31+
32+
[SileroVAD](https://github.com/snakers4/silero-vad) is an open source VAD model. It requires no additional setup. It returns confidence regarding there being voice in the provided recording.
33+
34+
### OpenWakeWord
35+
36+
[OpenWakeWord](https://github.com/dscripka/openWakeWord) is an open source package containing multiple pre-configured models, as well as allowing for using custom wake words.
37+
Refer to the package documentation for adding custom wake words.
38+
39+
The model is expected to return `True` if the wake word is detected in the audio sample contains it.
40+
41+
### OpenAIWhisper
42+
43+
[OpenAIWhisper](https://platform.openai.com/docs/guides/speech-to-text) is a cloud-based transcription model. Refer to the documentation for configuration capabilities.
44+
The environment variable `OPEN_API_KEY` needs to be set to a valid OPENAI key in order to use this model.
45+
46+
### LocalWhisper
47+
48+
[LocalWhisper](https://github.com/openai/whisper) is the locally hosted version of OpenAI whisper. It supports GPU acceleration, and follows the same configuration capabilities, as the cloud based one.
49+
50+
### FasterWhisper
51+
52+
[FasterWhisper](https://github.com/SYSTRAN/faster-whisper) is another implementation of the whisper model. It's optimized for speed and memory footprint. It follows the same API as the other two provided implementations.
53+
54+
### Custom Models
55+
56+
Custom VAD, Wake Word, or other detection models can be implemented by inheriting from `rai_asr.base.BaseVoiceDetectionModel`. The `detect` and `reset` methods must be implemented.
57+
58+
Custom transcription models can be implemented by inheriting from `rai_asr.base.BaseTranscriptionModel`. The `transcribe` method must be implemented.
59+
60+
## Agents
61+
62+
### Speech Recognition Agent
1463

15-
This node does not subscribe to any topics. It operates independently, capturing audio directly from the microphone.
64+
The speech recognition Agent uses ROS 2 and sounddevice `Connectors`, to communicate with other agents and access the microphone.
1665

17-
## Published Topics
66+
It fulfills the following ROS 2 communication API:
1867

19-
- **`rai_asr/transcription`** (`std_msgs/String`): Publishes the transcribed text obtained from the audio recording.
20-
- **`rai_asr/status`** (`std_msgs/String`): Publishes node status (recording, transcribing). During transcription, the node does not listen/record.
68+
Publishes to topic `/to_human: [HRIMessage]`:
69+
`message.text` is set with the transcription result using the selected transcription model.
2170

22-
## Parameters
71+
Publishes to topic `/voice_commands: [std_msgs/msg/String]`:
2372

24-
- **`language`** (`string`, default: `"en"`): The language code for the ASR model. This parameter defines the language in which the audio will be transcribed.
25-
- **`model`** (`string`, default: `"base"`): The type of ASR model to use. Different models may have different performance characteristics. For list of models see `python -c "import whisper;print(whisper.available_models())"`
26-
- **`silence_grace_period`** (`double`, default: `1.0`): The grace period in seconds after silence is detected to stop recording. This helps in determining the end of a speech segment.
27-
- **`sample_rate`** (`integer`, default: `0`): The sample rate for audio capture. If set to 0, the sample rate will be auto-detected.
73+
- `"pause"` - when voice is detected but the `detection_pipeline` didn't return detection (for interruptive S2S)
74+
- `"play"` - when voice is not detected, but there was previously a transcription sent
75+
- `"stop"` - when voice is detected and the `detection_pipeline` returned a detection (or is empty)
+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (C) 2025 Robotec.AI
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from rai_asr.agents.asr_agent import SpeechRecognitionAgent
16+
17+
__all__ = [
18+
"SpeechRecognitionAgent",
19+
]

src/rai_core/rai/agents/voice_agent.py renamed to src/rai_asr/rai_asr/agents/asr_agent.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (C) 2024 Robotec.AI
1+
# Copyright (C) 2025 Robotec.AI
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -21,7 +21,6 @@
2121

2222
import numpy as np
2323
from numpy.typing import NDArray
24-
2524
from rai.agents.base import BaseAgent
2625
from rai.communication import (
2726
HRIPayload,
@@ -33,6 +32,7 @@
3332
SoundDeviceConnector,
3433
SoundDeviceMessage,
3534
)
35+
3636
from rai_asr.models import BaseTranscriptionModel, BaseVoiceDetectionModel
3737

3838

@@ -43,7 +43,7 @@ class ThreadData(TypedDict):
4343
joined: bool
4444

4545

46-
class VoiceRecognitionAgent(BaseAgent):
46+
class SpeechRecognitionAgent(BaseAgent):
4747
"""
4848
Agent responsible for voice recognition, transcription, and processing voice activity.
4949

src/rai_asr/rai_asr/asr_clients.py

-75
This file was deleted.

src/rai_core/rai/agents/__init__.py

-4
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,10 @@
1616
from rai.agents.react_agent import ReActAgent
1717
from rai.agents.state_based import create_state_based_agent
1818
from rai.agents.tool_runner import ToolRunner
19-
from rai.agents.tts_agent import TextToSpeechAgent
20-
from rai.agents.voice_agent import VoiceRecognitionAgent
2119

2220
__all__ = [
2321
"ReActAgent",
24-
"TextToSpeechAgent",
2522
"ToolRunner",
26-
"VoiceRecognitionAgent",
2723
"create_conversational_agent",
2824
"create_state_based_agent",
2925
]

src/rai_tts/README.md

+62
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# RAI Text To Speech
2+
3+
This is the [RAI](https://github.com/RobotecAI/rai) text to speech package.
4+
It contains Agent definitions for the TTS feature.
5+
6+
## Models
7+
8+
Out of the box the following models are supported:
9+
10+
### ElevenLabs
11+
12+
[ElevenLabs](https://elevenlabs.io/) is a proprietary cloud provider for TTS. Refer to the website for the documentation.
13+
In order to use it the `ELEVENLABS_API_KEY` environment variable must be set, with a valid API key.
14+
15+
### OpenTTS
16+
17+
[OpenTTS](https://github.com/synesthesiam/opentts) is an open source model for TTS.
18+
It can be easily set up using docker. Run:
19+
20+
```
21+
docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak
22+
```
23+
24+
To setup a basic english OpenTTS server on port 5500 (default).
25+
Refer to the providers documentation for available voices and options.
26+
27+
### Custom Models
28+
29+
To add your custom TTS model inherit from the `rai_tts.models.base.TTSModel` class.
30+
31+
You can use the following template:
32+
33+
```
34+
class MyTTSModel(TTSModel):
35+
def get_speech(self, text: str) -> AudioSegment:
36+
...
37+
return AudioSegment()
38+
39+
def get_tts_params(self) -> Tuple[int, int]:
40+
...
41+
return sample_rate, channels
42+
43+
```
44+
45+
Such a model will work with the `TextToSpeechAgent` defined below:
46+
47+
## Agents
48+
49+
### TextToSpeechAgent
50+
51+
The TextToSpeechAgent utilises ROS 2 and sounddevice `Connectors` to receive data, and play it using a speaker.
52+
It complies to the following ROS 2 API:
53+
54+
Subscription topic `/to_human: [rai_interfaces/msg/HRIMessage]`:
55+
`message.text` will be parsed, run through the TTS model and played using the speaker
56+
Subscription topic `/voice_commands: [std_msgs/msg/String]`:
57+
The following values are accepted:
58+
59+
- `"play"`: allow for playing the voice through the speaker (if voice queue is not empty)
60+
- `"pause"`: pause the playing of the voice through the speaker
61+
- `"stop"`: stop the current playback and clear the queue
62+
- `"tog_play"`: toggle between play and pause

src/rai_tts/rai_tts/__init__.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from .tts_clients import ElevenLabsClient, OpenTTSClient
15+
from .agents import TextToSpeechAgent
16+
from .models import ElevenLabsTTS, OpenTTS
1617

17-
__all__ = ["ElevenLabsClient", "OpenTTSClient"]
18+
__all__ = ["ElevenLabsTTS", "OpenTTS", "TextToSpeechAgent"]
+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (C) 2025 Robotec.AI
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from rai_tts.agents.tts_agent import TextToSpeechAgent
16+
17+
__all__ = [
18+
"TextToSpeechAgent",
19+
]

src/rai_core/rai/agents/tts_agent.py renamed to src/rai_tts/rai_tts/agents/tts_agent.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (C) 2024 Robotec.AI
1+
# Copyright (C) 2025 Robotec.AI
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -21,8 +21,6 @@
2121

2222
from numpy._typing import NDArray
2323
from pydub import AudioSegment
24-
from std_msgs.msg import String
25-
2624
from rai.agents.base import BaseAgent
2725
from rai.communication import (
2826
ROS2HRIConnector,
@@ -33,6 +31,8 @@
3331
from rai.communication.ros2.api import IROS2Message
3432
from rai.communication.ros2.connectors import ROS2HRIMessage
3533
from rai.communication.sound_device.connector import SoundDeviceMessage
34+
from std_msgs.msg import String
35+
3636
from rai_interfaces.msg._hri_message import HRIMessage
3737
from rai_tts.models.base import TTSModel
3838

0 commit comments

Comments
 (0)