Skip to content

feat: kokoro tts support #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jul 16, 2025
Merged

feat: kokoro tts support #643

merged 35 commits into from
Jul 16, 2025

Conversation

MagdalenaKotynia
Copy link
Member

@MagdalenaKotynia MagdalenaKotynia commented Jun 25, 2025

Purpose

  • To support the usage of Kokoro-TTS model. Kokoro-TTS was selected based on its high-quality speech output, small size, and potential to run on edge devices (it is in ONNX format)

Proposed Changes

  • Developed a class implementing the TTSModel interface for the Kokoro-TTS model.
  • Updated docs with newly supported model.
  • Updated example with TTSAgent to be able to use the newly supported model

Testing

With TTSAgent

  • Run TTSAgent example: python examples/s2s/tts.py
  • In another terminal run the following script to send ros2hri message to ros2 topic:
from rai.communication.ros2.connectors import ROS2HRIConnector
from rai.communication.ros2.messages import ROS2HRIMessage
import rclpy
import time

rclpy.init()
my_hri_msg = ROS2HRIMessage(
    text="Hello, human! This is a test message. How are you?",
    message_author="ai",
)

hri_connector = ROS2HRIConnector()

hri_connector.send_message(
    message=my_hri_msg,
    target="/to_human"
)

try:
    print("Sending message... Press Ctrl+C to exit")
    time.sleep(10)
    
except KeyboardInterrupt:
    print("Shutting down...")
finally:
    hri_connector.shutdown()
    rclpy.shutdown()

After a while, you should hear speech output from TTSAgent.

With ROS2S2SAgent

Run the following script and converse with agent:

from rai_s2s.sound_device import SoundDeviceConfig
from rai.communication.ros2 import ROS2Context
from rai_s2s.s2s.agents.s2s_agent import SpeechToSpeechAgent
from rai_s2s.s2s.agents.ros2s2s_agent import ROS2S2SAgent
from rai.agents.langchain.react_agent import ReActAgent
from rai_s2s.asr.models import OpenAIWhisper, SileroVAD
from rai_s2s import KokoroTTS

from rai.agents import AgentRunner


@ROS2Context()
def main():
    speaker_config = SoundDeviceConfig(
        stream=True,
        is_output=True,
        # device_name="EPOS PC 8 USB: Audio (hw:1,0)",
        # device_name="Sennheiser USB headset: Audio (hw:1,0)",
        # device_name="Jabra Speak2 40 MS: USB Audio (hw:2,0)",
        device_name="default",
    )

    microphone_config = SoundDeviceConfig(
        stream=True,
        channels=1,
        device_name="default",
        consumer_sampling_rate=16000,
        dtype="int16",
        is_input=True,
    )

    # whisper = LocalWhisper("tiny", 16000)
    whisper = OpenAIWhisper("gpt-4o-mini-transcribe", 16000)
    vad = SileroVAD(16000, 0.5)
    
    tts = KokoroTTS()

    agent = ROS2S2SAgent(
        from_human_topic="/from_human",
        to_human_topic="/to_human",
        microphone_config=microphone_config,
        speaker_config=speaker_config,
        transcription_model=whisper,
        vad=vad,
        tts=tts,
    )
    from rai.communication.ros2 import ROS2HRIConnector

    hri_connector = ROS2HRIConnector()
    llm = ReActAgent(
        target_connectors={"/to_human": hri_connector},
    )
    llm.subscribe_source("/from_human", hri_connector)
    runner = AgentRunner([agent, llm])
    runner.run_and_wait_for_shutdown()


if __name__ == "__main__":
    main()

The KokoroTTS model works well together with the ROS2S2SAgent.
My UX - It sounds nicer compared with OpenTTS. I didn't observe any significant differences in inference time between the models.
The model sometimes does not put space between the sentences. EDIT: It was fixed by setting trim to false in create method of Kokoro.

@MagdalenaKotynia MagdalenaKotynia marked this pull request as ready for review June 26, 2025 13:35
@MagdalenaKotynia MagdalenaKotynia requested review from boczekbartek and removed request for boczekbartek June 26, 2025 17:45
pyproject.toml Outdated
Comment on lines 24 to 26
# To avoid yanked version 3.0.6
zarr = "!=3.0.6"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does zarr with the 3.0.6 break rai?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it. Zarr 3.0.6 was selected by poetry when resolving dependencies, and poetry threw a warning that zarr 3.0.6 is a yanked version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may introduces further incompatibilities with packages relying on the yanked version. Please remove this line, we will bump the packages later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done bfd43cd

)

if samples.dtype == np.float32:
samples = (samples * 32768).clip(-32768, 32767).astype(np.int16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting values outside of the provided range?
Clipping audio should only be used as a last resort, as it introduces massive quality degradation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clipping is done to ensure that values are within -32768, 32767 range to prevent overflow in case of e.g. eventual numerical errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log an error if values of samples exceed the -1 to 1 range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 1143a79

@@ -28,10 +28,12 @@ elevenlabs = { version = "^1.4.1", optional = true }
openai-whisper = { version = "^20231117", optional = true }
faster-whisper = { version = "^1.1.1", optional = true }
openwakeword = { git = "https://github.com/maciejmajek/openWakeWord.git", branch = "chore/remove-tflite-backend", optional = true }
kokoro-onnx = { version = "0.3.3", optional = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not install gpu support. - model will run only on cpu.
Please take a look here
and here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing it. I added required libraries and instructions on how to run on gpu. 6ba0527
There seems to be a bug in a kokoro-onnx source code here - import for both onnxruntime and onnxruntime-gpu is via import onnxruntime, so automatic detection of available providers will not work. That is why I added instructions to export ONNX_PROVIDER variable.

@maciejmajek maciejmajek merged commit a4af54c into main Jul 16, 2025
6 checks passed
@maciejmajek maciejmajek deleted the feat/kokoro-tts-support branch July 16, 2025 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants