Skip to content

New Services to Provide Speech-To-Text and Text-To-Speech Functionality from Aristech #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 11, 2025

Conversation

ajgolledge
Copy link

@ajgolledge ajgolledge commented Jun 3, 2025

This PR provides two new services from Aristech:

Speech-To-Text

This service is called "aristech-transcribe" and can be called from the Call-API "startConversation" with this name alongside the folllowing JSON parameter:

{ "language": "de_DE" }

Note that this is in locale format, not BCP 47. Simply using "de" also works and I have not noticed any difference when using specific regions as well as in English ("en").

An entry like this in the ivr.toml file ensures that authentication is taken care of.

[[contextSwitch.service]]
name = "aristech-transcribe"
params = { apiKey = "an-apikey" }

The following are still open issues:

  • Determine whether the credentials authentication is likely to be necessary in future or whether we can reliably just use apiKey
  • Is there a silence timeout and if so, is it configurable? Does the silence_timeout field in EndpointSpec have any effect?
  • When using the example, if the default microphone settings are used ( as opposed to explicitly using 16kHz) does the conversion function which is currently used get in the way? (audio::into_i16) i.e. does not using it improve the performance of the example?

Text-To-Speech

This service is called "aristech-synthesize" and can be called from the Call-API "startConversation" with this name alongside the folllowing JSON parameter:

{ "voice": "anne_de_DE" }

Currently the only alternative voice available to us is "tom_de_DE".

An entry like this in the ivr.toml file ensures that authentication is taken care of.

[[contextSwitch.service]]
name = "aristech-synthesize"
params = { endpoint = "https://example.com", token = "a-valid-token", secret = "a-valid-secret" }
sampleRate = 22050

Both voices available to us currently work at a sample rate of 22050 Hz. Not specifying this can lead to amusing results 😄

Open Issues

  • Are any other voices available to us apart from "tom_de_DE" and "anne_de_DE"?

@pragmatrix pragmatrix marked this pull request as draft June 4, 2025 05:27
@pragmatrix
Copy link
Owner

Just minor changes and in transcribe.rs I've removed the "" empty string for model / prompt as the default and adjusted the testcases. I like the deserialization of the different credentials options, I'll adopt this for azure.

@pragmatrix
Copy link
Owner

As discussed, merging even though some open issues remain.

@pragmatrix pragmatrix marked this pull request as ready for review June 11, 2025 06:50
@pragmatrix pragmatrix merged commit 1acd92b into pragmatrix:master Jun 11, 2025
3 checks passed
@pragmatrix pragmatrix mentioned this pull request Jun 11, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants