Saganaki22 Brings Voice Cloning To ComfyUI With Zonos2_TTS-ComfyUI

Zonos2_TTS-ComfyUI is a new custom node integration that brings advanced text-to-speech and audio-only voice cloning capabilities directly into ComfyUI workflows. The release allows users to generate high-fidelity speech from text or reference audio using the ZONOS2 model. It operates entirely in-process on Windows and Linux with native PyTorch support.
Developer Saganaki22 created this integration to make the Zyphra ZONOS2 model accessible within the popular ComfyUI interface. They built native memory management and automatic model downloading into the custom node package. The project also includes validated mixed FP8 checkpoints to help run the model on graphics cards with limited memory.
Native voice cloning and generation features
- Zero-shot audio-only voice cloning support.
- Standard 44.1 kHz ComfyUI audio output.
- Automatic FlashAttention selection with fallback.
- Native progress bars and CLI reporting.
- Mixed FP8 checkpoints for memory savings.
This tool is designed for creators who want to generate naturalistic speech locally without relying on cloud services. Users can build custom audio pipelines by combining this text-to-speech node with other image or video generation workflows. Anyone needing reliable voice cloning from short audio samples will find the automated memory management helpful for stable performance.
Limitations and installation details
The integration currently uses a raw text path instead of upstream text normalization, which might affect how it pronounces numbers or dates. Voice cloning relies on a speaker embedding that captures identity well but may not perfectly transfer accent or emotion from the reference audio. Users must also have Transformers version 5.0.0 or higher because older 4.x releases are not supported.
"ZONOS2 excels at high-fidelity and naturalistic voice cloning." - Source: GitHub