JoyAI-Echo Spins Multi-Shot AI Video Stories With Synced Audio

    
        By vramkickedin    
     | 
    
            June 15, 2026 at 6:22 pm        
    
     | 
    
        2 min read

JoyAI-Echo is a new open-source model that generates multi-shot, minute-long videos with synchronized audio directly from text prompts. You give it a JSON script describing each shot, and it outputs a fully animated sequence with matching sound. A distilled generator makes inference about 7.5 times faster than earlier approaches while preserving character appearance and voice timbre across scenes. The framework supports stories up to five minutes, with each clip running 241 frames at 25 frames per second.

Developed by the Echo Team at JD’s Joy Future Academy, the project addresses two persistent headaches in video generation: error accumulation over long timelines and painfully slow rendering. Their solution pairs a cross-modal audio-visual memory bank with reinforcement learning to lock in identity and voice across shots. The team released both model weights and inference code, making local experimentation practical for anyone with a 48GB-class GPU.

Multi-shot stories with synchronized audio

Key Features

Generate multi-shot video stories up to five minutes.
7.5x faster inference with DMD distillation.
Synchronized audio and video in a single pipeline.
Cross-modal memory keeps characters and voices consistent.
Peak GPU memory around 46–50 GB.

Video creators and small studios can produce long-form narrative content without relying on cloud APIs or monthly fees. Privacy-conscious professionals keep all data and generated media entirely on their own hardware. The open-source code also invites customization for research, prototyping, and creative tooling.

Developer notes and future plans

The current release focuses on text-to-video generation and does not yet accept image inputs, though image-to-video support is on the roadmap. An interactive director agent and a lightweight super-resolution module are also in development to simplify prompt writing and boost output resolution. The project ships under a non-commercial license tied to the LTX-2 Community License Agreement.

"JoyAI-Echo decisively outperforms HappyOyster (directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks." — Source: GitHub

Project Links

JoyAI-Echo Spins Multi-Shot AI Video Stories With Synced Audio

Multi-shot stories with synchronized audio

Developer notes and future plans

More Video Related News

Neodragon Conjures Private Video Creation Directly On Mobile Phones

SwiftVR Breathes New Life Into Old Video With Stunning Real Time 4K Upscaling

Zai-org's SCAIL-2 Breathes Motion Into Still Characters Sans Skeleton

Cosmos3-Super-Image2Video Animates Stills with a Single Prompt