Syvai cohere-transcribe-diarize Delivers Speaker Diarization In One Shot

The cohere-transcribe-diarize model adds speaker identification and word-level timestamps to Cohere’s open source speech recognition system. It works by fine-tuning the existing Cohere Transcribe model to output special tokens that mark who is speaking and the exact time for each segment. This means you can now get transcripts that show speaker changes and precise timing directly from short audio clips.
Syvai, a Danish AI company, created this fine-tuned model and released it on Hugging Face. They wanted to fill the gap for an open tool that handles both transcription and diarization without needing separate systems. The model is designed for short audio under 30 seconds, but it includes helper scripts for longer recordings by sliding windows and clustering speaker identities.
Speaker labels and timestamps in one model
- Adds speaker labels and word timestamps.
- Supports up to 8 speaker slots locally.
- Best accuracy with 4 or fewer speakers.
- 100 ms timestamp resolution for each word.
- Long-form audio via sliding window scripts.
This model is ideal for privacy-focused teams that need to transcribe meetings or podcasts without sending audio to the cloud. Hobbyists with an RTX 3090 or similar GPU can run it entirely on their own machine. The provided scripts make it accessible even for users who aren't experienced developers.
Developer notes and known limits
The 30-second input limit means meeting-length recordings must be split into overlapping windows and re-clustered. Speaker identities are local to each decode, so cross-window linking is needed for consistency. When using the recommended vLLM server, throughput reaches over 100x real-time on a single RTX 3090, but the transformers-only path is simpler for quick testing.
"The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds." — Source: Reddit