Zyphra Pioneers ZONOS2 For Natural Voice Cloning And Text To Speech

ZONOS2 is a new text-to-speech model designed to generate highly expressive and natural sounding audio. It predicts high quality audio tokens to create studio-grade sound at a 44.1 kHz sample rate. The system can clone voices with high accuracy without requiring any prior fine tuning of the model.
Zyphra who also developed ZAYA1-8B created this project to solve the usual tradeoff between generation speed and audio quality. They trained the model on more than six million hours of varied multilingual speech data. The company released the system as an open source project under the Apache 2.0 license.
Audio quality and language support
- High fidelity zero shot voice cloning.
- Generates 44.1 kHz studio quality audio.
- Reads raw text bytes without phonemizer.
- Supports multiple global languages for synthesis.
This tool is aimed at users who need fast and efficient speech synthesis running on their own hardware. People working with text to speech applications can use it to generate natural voices without relying on paid cloud services. Developers can also integrate the provided Python API or server endpoint directly into their own software pipelines.
Development and system requirements
Running the model locally requires a Linux system with an x86 64 architecture and an NVIDIA GPU. The developers used a staged data filtering process during training to reduce hallucinations and mispronunciations. Zyphra notes that modeling high fidelity audio tokens is inherently harder than working with lower quality formats.
"ZONOS2 excels at high-fidelity and naturalistic voice cloning." — Source: GitHub