Skip to content

Conversation

@eureka928
Copy link

@eureka928 eureka928 commented Jan 30, 2026

Description

This PR adds a new WhisperX Python backend that provides transcription with speaker diarization (identifying who is speaking), word-level timestamps, and forced alignment via pyannote-audio.

Closes #3375

Key changes:

  • Extends the gRPC TranscriptSegment message with a speaker field (backward-compatible — existing backends leave it empty)
  • Maps the new Speaker field through the Go schema (core/schema/transcription.go) and backend mapper (core/backend/transcript.go)
  • Adds the full backend/python/whisperx/ backend with gRPC server, requirements for CPU/CUDA 12/CUDA 13/ROCm, and unit tests
  • Registers the backend in the Makefile, backend/index.yaml, and CI workflow

Speaker diarization requires a HuggingFace token (HF_TOKEN env var) with access to pyannote models, and is activated by setting diarize=true in the transcription request.

Notes for Reviewers

  • The alignment model is cached per language to avoid reloading on every transcription call
  • The diarization pipeline is lazily initialized and reused across calls
  • Timestamp handling matches the existing faster-whisper convention

Signed commits

  • Yes, I signed my commits.

@netlify
Copy link

netlify bot commented Jan 30, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 0d10ffb
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/697d1dd8fe8eca000813625c
😎 Deploy Preview https://deploy-preview-8299--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@eureka928
Copy link
Author

@mudler @neurocis nice to meet you and glad to put the first PR

Would you review my PR?

Thank you for your time

@eureka928
Copy link
Author

eureka928 commented Jan 30, 2026

Hi @mudler I have updated the code based on your feedback.
Please let me know if you have any further feedback after your review.

Add speaker field to the gRPC TranscriptSegment message and map it
through the Go schema, enabling backends to return speaker labels.

Signed-off-by: eureka928 <[email protected]>
Add Python gRPC backend using WhisperX for speech-to-text with
word-level timestamps, forced alignment, and speaker diarization
via pyannote-audio when HF_TOKEN is provided.

Signed-off-by: eureka928 <[email protected]>
…ments

Address review feedback:
- Use --extra-index-url for CPU torch wheels to reduce size
- Remove torch version pins, let uv resolve compatible versions

Signed-off-by: eureka928 <[email protected]>
@eureka928 eureka928 force-pushed the feat/whisperx-backend branch from 3e5133d to 0d10ffb Compare January 30, 2026 21:08
@eureka928
Copy link
Author

Hi @mudler hope you're having good weekend
Would you give me more feedback after review?
Thank you and have a nice weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

integrate whisperX

2 participants