-
Notifications
You must be signed in to change notification settings - Fork 48
feat: kokoro tts support #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…oid yanked version, regenerated poetry lock
…re read alound as words by TTS model
…wo separate chunks
pyproject.toml
Outdated
# To avoid yanked version 3.0.6 | ||
zarr = "!=3.0.6" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does zarr with the 3.0.6 break rai?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test it. Zarr 3.0.6 was selected by poetry when resolving dependencies, and poetry threw a warning that zarr 3.0.6 is a yanked version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may introduces further incompatibilities with packages relying on the yanked version. Please remove this line, we will bump the packages later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done bfd43cd
) | ||
|
||
if samples.dtype == np.float32: | ||
samples = (samples * 32768).clip(-32768, 32767).astype(np.int16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting values outside of the provided range?
Clipping audio should only be used as a last resort, as it introduces massive quality degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clipping is done to ensure that values are within -32768, 32767 range to prevent overflow in case of e.g. eventual numerical errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log an error if values of samples exceed the -1 to 1 range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 1143a79
Co-authored-by: Maciej Majek <[email protected]>
@@ -28,10 +28,12 @@ elevenlabs = { version = "^1.4.1", optional = true } | |||
openai-whisper = { version = "^20231117", optional = true } | |||
faster-whisper = { version = "^1.1.1", optional = true } | |||
openwakeword = { git = "https://github.com/maciejmajek/openWakeWord.git", branch = "chore/remove-tflite-backend", optional = true } | |||
kokoro-onnx = { version = "0.3.3", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing it. I added required libraries and instructions on how to run on gpu. 6ba0527
There seems to be a bug in a kokoro-onnx source code here - import for both onnxruntime
and onnxruntime-gpu
is via import onnxruntime
, so automatic detection of available providers will not work. That is why I added instructions to export ONNX_PROVIDER variable.
Purpose
Proposed Changes
Testing
poetry install --with s2s
With TTSAgent
python examples/s2s/tts.py
After a while, you should hear speech output from TTSAgent.
With ROS2S2SAgent
Run the following script and converse with agent:
The KokoroTTS model works well together with the ROS2S2SAgent.
My UX - It sounds nicer compared with OpenTTS. I didn't observe any significant differences in inference time between the models.
The model sometimes does not put space between the sentences.EDIT: It was fixed by setting trim to false in create method of Kokoro.