FranckyB Updates Voice Clone Studio App
Voice Clone Studio is a modular Gradio-based web application that handles voice cloning, voice design, multi-speaker conversations, voice conversion, and sound effects generation. The tool consolidates multiple AI audio engines into a single interface, eliminating the need to manage separate repositories or setups for different audio tasks.
Developer FranckyB completely rewrote the application to improve modularity and expand its feature set. The project now aims to serve as a comprehensive audio workstation for AI-powered voice and sound work, with install scripts available for Windows, Linux, and macOS.
Model Size: requirements vary & VRAM GPU: 8GB+ recommended
What Voice Clone Studio Can Do
- Clone voices from short reference audio samples using Qwen3-TTS or VibeVoice engines.
- Generate multi-speaker conversations with up to 90 minutes of continuous speech.
- Create custom voice models through LoRA fine-tuning with built-in training pipelines.
- Automatically split and transcribe long audio files for dataset creation.
- Generate sound effects from text descriptions or sync audio to video using MMAudio.
- Convert speech between voices using Chatterbox speech-to-speech technology.
Content creators working on podcasts, audiobooks, or video projects can manage their entire audio workflow within this single application. The automatic audio splitting feature intelligently divides long recordings at sentence boundaries, making it easier to build training datasets from existing material.
Development notes and practical setup
FranckyB notes that the tool has grown to support a wide range of engines, which required creating platform-specific installation scripts that let users choose which components to deploy. Users might see pip warnings about transformers version conflicts during installation, but the developer confirms either version works correctly. A useful interface tip: double-clicking sample clips plays them directly.
The project includes a Prompt Manager with LLM support for generating TTS prompts locally via llama.cpp or Ollama, with system presets for different generation tasks. Speech-to-Speech support was just added to the development branch, with a basic audio editor planned for assembling clips and sound effects together.