DramaBox Interprets Stage Directions for Expressive AI Voiceovers

A slim frosted-glass script page etched with delicate stage directions and a thin sound wave.

DramaBox is a text-to-speech system that turns scene descriptions and dialogue into expressive speech, complete with laughs, sighs, and pauses. It can clone a speaker’s timbre from just a 10-second audio reference. The model is an IC-LoRA fine‑tune of the open‑source LTX-2.3 audio branch and runs fully on a local GPU.

Resemble AI developed DramaBox to give creators direct, prompt‑based control over delivery, emotion, and non‑verbal sounds. Instead of merely reading text aloud, the model interprets stage directions like “she whispers” or “he chuckles” to produce natural performances. It uses Gemma 3 12B text embeddings to bridge written nuance and spoken output.

Prompt-driven voice acting with cloning

Key Features
  • Full emotion and delivery control through prompt.
  • Optional voice cloning from 10-second reference audio.
  • Generates laughs, sighs, pauses, and whispers.
  • Long-form support with automatic sentence chunking.
  • Imperceptible neural watermark survives editing.
  • LoRA fine‑tuning on top of DramaBox included.
  • Runs locally on a single GPU with 24 GB VRAM.
  • Average generation speed: 2.5 seconds per clip.

Creators who need realistic, acted voiceovers can replace lengthy recording sessions with in‑the‑box direction. Game developers, animators, and audio producers can prototype dialogue with fine emotional detail before committing to final takes. Because everything stays on local hardware, professionals with strict privacy needs can work with sensitive scripts without uploading data to the cloud.

Users can also try the ComfyUI version of DramaBox.

How the model works and what to expect

Peak VRAM usage sits around 24 GB, and a warm server generates a clip in roughly 2.5 seconds on an H100. The base model was trained on short audio, but an integrated chunker splits long scripts at sentence boundaries and cross‑fades the results, keeping speaker consistency up to about two minutes. The project is distributed under the LTX-2 Community License, making it free for non‑commercial use and fine‑tuning, with a provided training script that lets you add a custom voice or style on top of DramaBox.

"The Most Expressive Voice Model." — Source: Reddit