|
2 | 2 |
|
3 | 3 | ## Description
|
4 | 4 |
|
5 |
| -The RAI ASR (Automatic Speech Recognition) node utilizes a combination of voice activity detection (VAD) and a speech recognition model to transcribe spoken language into text. The node is configured to handle multiple languages and model types, providing flexibility in various ASR applications. It detects speech, records it, and then uses a model to transcribe the recorded audio into text. |
| 5 | +This is the [RAI](https://github.com/RobotecAI/rai) automatic speech recognition package. |
| 6 | +It contains Agents definitions for the ASR feature. |
6 | 7 |
|
7 |
| -## Installation |
| 8 | +## Models |
8 | 9 |
|
9 |
| -```bash |
10 |
| -rosdep install --from-paths src --ignore-src -r |
| 10 | +This package contains three types of models: Voice Activity Detection (VAD), Wake word and transcription. |
| 11 | + |
| 12 | +The `detect` API for VAD and Wake word models, with the following signature: |
| 13 | + |
| 14 | +``` |
| 15 | + def detect( |
| 16 | + self, audio_data: NDArray, input_parameters: dict[str, Any] |
| 17 | + ) -> Tuple[bool, dict[str, Any]]: |
| 18 | +``` |
| 19 | + |
| 20 | +Allows for chaining the models into detection piplelines. The `input_parameters` provide a utility to pass the output dictionary from previous models. |
| 21 | + |
| 22 | +The `transcribe` API for transcription models, with the following signature: |
| 23 | + |
| 24 | +``` |
| 25 | + def transcribe(self, data: NDArray[np.int16]) -> str: |
11 | 26 | ```
|
12 | 27 |
|
13 |
| -## Subscribed Topics |
| 28 | +Takes the audio data encoded as 2 byte ints and returns the string with transcription. |
| 29 | + |
| 30 | +### SileroVAD |
| 31 | + |
| 32 | +[SileroVAD](https://github.com/snakers4/silero-vad) is an open source VAD model. It requires no additional setup. It returns confidence regarding there being voice in the provided recording. |
| 33 | + |
| 34 | +### OpenWakeWord |
| 35 | + |
| 36 | +[OpenWakeWord](https://github.com/dscripka/openWakeWord) is an open source package containing multiple pre-configured models, as well as allowing for using custom wake words. |
| 37 | +Refer to the package documentation for adding custom wake words. |
| 38 | + |
| 39 | +The model is expected to return `True` if the wake word is detected in the audio sample contains it. |
| 40 | + |
| 41 | +### OpenAIWhisper |
| 42 | + |
| 43 | +[OpenAIWhisper](https://platform.openai.com/docs/guides/speech-to-text) is a cloud-based transcription model. Refer to the documentation for configuration capabilities. |
| 44 | +The environment variable `OPEN_API_KEY` needs to be set to a valid OPENAI key in order to use this model. |
| 45 | + |
| 46 | +### LocalWhisper |
| 47 | + |
| 48 | +[LocalWhisper](https://github.com/openai/whisper) is the locally hosted version of OpenAI whisper. It supports GPU acceleration, and follows the same configuration capabilities, as the cloud based one. |
| 49 | + |
| 50 | +### FasterWhisper |
| 51 | + |
| 52 | +[FasterWhisper](https://github.com/SYSTRAN/faster-whisper) is another implementation of the whisper model. It's optimized for speed and memory footprint. It follows the same API as the other two provided implementations. |
| 53 | + |
| 54 | +### Custom Models |
| 55 | + |
| 56 | +Custom VAD, Wake Word, or other detection models can be implemented by inheriting from `rai_asr.base.BaseVoiceDetectionModel`. The `detect` and `reset` methods must be implemented. |
| 57 | + |
| 58 | +Custom transcription models can be implemented by inheriting from `rai_asr.base.BaseTranscriptionModel`. The `transcribe` method must be implemented. |
| 59 | + |
| 60 | +## Agents |
| 61 | + |
| 62 | +### Speech Recognition Agent |
14 | 63 |
|
15 |
| -This node does not subscribe to any topics. It operates independently, capturing audio directly from the microphone. |
| 64 | +The speech recognition Agent uses ROS 2 and sounddevice `Connectors`, to communicate with other agents and access the microphone. |
16 | 65 |
|
17 |
| -## Published Topics |
| 66 | +It fulfills the following ROS 2 communication API: |
18 | 67 |
|
19 |
| -- **`rai_asr/transcription`** (`std_msgs/String`): Publishes the transcribed text obtained from the audio recording. |
20 |
| -- **`rai_asr/status`** (`std_msgs/String`): Publishes node status (recording, transcribing). During transcription, the node does not listen/record. |
| 68 | +Publishes to topic `/to_human: [HRIMessage]`: |
| 69 | +`message.text` is set with the transcription result using the selected transcription model. |
21 | 70 |
|
22 |
| -## Parameters |
| 71 | +Publishes to topic `/voice_commands: [std_msgs/msg/String]`: |
23 | 72 |
|
24 |
| -- **`language`** (`string`, default: `"en"`): The language code for the ASR model. This parameter defines the language in which the audio will be transcribed. |
25 |
| -- **`model`** (`string`, default: `"base"`): The type of ASR model to use. Different models may have different performance characteristics. For list of models see `python -c "import whisper;print(whisper.available_models())"` |
26 |
| -- **`silence_grace_period`** (`double`, default: `1.0`): The grace period in seconds after silence is detected to stop recording. This helps in determining the end of a speech segment. |
27 |
| -- **`sample_rate`** (`integer`, default: `0`): The sample rate for audio capture. If set to 0, the sample rate will be auto-detected. |
| 73 | +- `"pause"` - when voice is detected but the `detection_pipeline` didn't return detection (for interruptive S2S) |
| 74 | +- `"play"` - when voice is not detected, but there was previously a transcription sent |
| 75 | +- `"stop"` - when voice is detected and the `detection_pipeline` returned a detection (or is empty) |
0 commit comments