Saganaki22 Brings Voice Cloning To ComfyUI With Zonos2_TTS-ComfyUI

    
        By vramkickedin    
     | 
    
            June 29, 2026 at 10:26 pm        
    
     | 
    
        2 min read

Zonos2_TTS-ComfyUI is a new custom node integration that brings advanced text-to-speech and audio-only voice cloning capabilities directly into ComfyUI workflows. The release allows users to generate high-fidelity speech from text or reference audio using the ZONOS2 model. It operates entirely in-process on Windows and Linux with native PyTorch support.

Developer Saganaki22 created this integration to make the Zyphra ZONOS2 model accessible within the popular ComfyUI interface. They built native memory management and automatic model downloading into the custom node package. The project also includes validated mixed FP8 checkpoints to help run the model on graphics cards with limited memory.

Native voice cloning and generation features

Key Features

Zero-shot audio-only voice cloning support.
Standard 44.1 kHz ComfyUI audio output.
Automatic FlashAttention selection with fallback.
Native progress bars and CLI reporting.
Mixed FP8 checkpoints for memory savings.

This tool is designed for creators who want to generate naturalistic speech locally without relying on cloud services. Users can build custom audio pipelines by combining this text-to-speech node with other image or video generation workflows. Anyone needing reliable voice cloning from short audio samples will find the automated memory management helpful for stable performance.

Limitations and installation details

The integration currently uses a raw text path instead of upstream text normalization, which might affect how it pronounces numbers or dates. Voice cloning relies on a speaker embedding that captures identity well but may not perfectly transfer accent or emotion from the reference audio. Users must also have Transformers version 5.0.0 or higher because older 4.x releases are not supported.