Scenema-Audio Lets You Direct Voices With Emotion And Scene Sounds

A horizontal digital audio waveform ribbon spanning the lower third of a deep charcoal background constructed from 31 precise vertical frequency bars.

Scenema-Audio is a new open-source model that clones voices and generates speech with emotional acting, scene sounds, and zero-shot identity transfer. It doesn’t just read text aloud—it interprets stage directions and shifts its delivery mid-sentence to match anger, exhaustion, or joy. A single text prompt can describe a voice, the environment, and exactly how every line should be spoken.

ScenemaAI built this release by distilling a 22B-parameter audiovisual model from Lightricks into an 8-step audio diffusion pipeline. They paired it with Gemma 3 12B for text encoding so the system runs efficiently on consumer GPUs. The project gives creators a local tool for producing scene-aware, emotionally nuanced speech without depending on cloud services.

Voice acting with stage directions

Key Features
  • Zero-shot voice cloning from 10–20 seconds of audio.
  • Emotion shifts mid-speech using <action> tags.
  • Scene-aware ambient sounds like rain or thunder.
  • Child voices that sound natural, not pitch-shifted.
  • Long-form narration with automatic voice continuity.
  • Support for 13 languages, including English and Hindi.
  • Runs on GPUs with as little as 16 GB of VRAM.
  • 48 kHz stereo output with clean vocal separation.

This release fits creators who want expressive character dialogue for videos, games, or podcasts running entirely on local hardware. Small studios can generate narration that includes matching background acoustics and emotional delivery without ever uploading assets. Privacy-minded professionals also benefit because all processing stays on their own machine.

Limitations and setup notes

Each generation segment is capped at around 15 seconds, though longer scripts are automatically split while preserving voice consistency. Pronunciation may stumble on complex multi-syllable words and proper nouns, and voice cloning with emotionally flat reference audio can slightly dull the intended performance. The pipeline also requires a Hugging Face token and acceptance of Google’s terms for the Gemma 3 text encoder.

"Every existing text-to-speech system converts words into sound, but none of them perform." — Source: Hugging Face