LongCat-AudioDiT Masters Zero-Shot Voice Cloning with Ease

Light colored sparkly cat on a colorful wave

LongCat-AudioDiT is a new text-to-speech model that generates high-fidelity audio directly from text inputs. It operates directly on the waveform latent space rather than relying on intermediate acoustic representations like mel-spectrograms. This approach reduces compounding errors and significantly simplifies the audio generation pipeline.

Meituan-longcat developed this tool to achieve top-tier performance in zero-shot voice cloning. The model addresses common issues in previous diffusion-based systems by fixing a training-inference mismatch. It provides a streamlined architecture that requires only a waveform variational autoencoder and a diffusion backbone to function effectively.

Features and performance of LongCat-AudioDiT

  • Diffusion-based text-to-speech synthesis.
  • Direct operation on waveform latent space.
  • Zero-shot voice cloning capabilities.
  • Available in 1B and 3.5B parameter variants.
  • Implements Adaptive Projection Guidance (APG).
  • Outperforms previous state-of-the-art models on the Seed benchmark.
  • Compatible with ComfyUI workflows.
  • Released under the MIT license.

Content creators and developers working with limited datasets may find this tool useful for generating natural-sounding speech. The model’s ability to perform zero-shot cloning allows users to replicate a specific voice from just a short audio sample. Because it eliminates complex multi-stage training, it is easier to deploy locally compared to older systems.

Development and general research insights

During development, the team focused on refining the guidance process used during inference. They replaced traditional classifier-free guidance with Adaptive Projection Guidance to improve the overall quality of the generated audio. Their tests on the Seed benchmark show that the 3.5B variant improved speaker similarity scores over the previous leading model, Seed-TTS.

The researchers shared a counterintuitive discovery regarding model architecture. They noted, "superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance." This finding suggests that optimizing the autoencoder component alone is not sufficient for improving final speech output. The team has released both code and model weights to support further research within the speech community.

Get LongCat-AudioDiT on GitHub. Access the model weights on Hugging Face.