Skip to content

Support multimodality in chat completion output #1563

@ThomasVitale

Description

@ThomasVitale

OpenAI has recently introduced audio multimodality support, both for input and output.

The input audio modality support is introduced in #1560 all the way up to the Spring AI abstractions.

The output audio modality is only supported at the lower level (OpenAIApi). Its usage is demonstrated in this integration test:

@Test
void outputAudio() {
ChatCompletionMessage chatCompletionMessage = new ChatCompletionMessage(
"What is the magic spell to make objects fly?", Role.USER);
ChatCompletionRequest.AudioParameters audioParameters = new ChatCompletionRequest.AudioParameters(
ChatCompletionRequest.AudioParameters.Voice.NOVA,
ChatCompletionRequest.AudioParameters.AudioResponseFormat.MP3);
ChatCompletionRequest chatCompletionRequest = new ChatCompletionRequest(List.of(chatCompletionMessage),
OpenAiApi.ChatModel.GPT_4_O_AUDIO_PREVIEW.getValue(), audioParameters);
ResponseEntity<ChatCompletion> response = openAiApi.chatCompletionEntity(chatCompletionRequest);
assertThat(response).isNotNull();
assertThat(response.getBody()).isNotNull();
assertThat(response.getBody().usage().promptTokenDetails().audioTokens()).isEqualTo(0);
assertThat(response.getBody().usage().completionTokenDetails().audioTokens()).isGreaterThan(0);
assertThat(response.getBody().choices().get(0).message().audioOutput().data()).isNotNull();
assertThat(response.getBody().choices().get(0).message().audioOutput().transcript())
.containsIgnoringCase("leviosa");
}

It would be nice to start identifying what type of abstractions are needed in the ChatResponse API to include audio response data.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions