Trending Model:#1Unlimited-OCRbaidu⬇630kTrending Model:#2Qwythos-9B-Claude-Mythos-5-1M-GGUFempero-ai⬇1114kTrending Model:#3GLM-5.2zai-org⬇160kTrending Model:#4Ornith-1.0-35B-GGUFdeepreinforce-ai⬇234kTrending Model:#5Ornith-1.0-9B-GGUFdeepreinforce-ai⬇191kTrending Model:#6gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUFyuxinlu1⬇289kTrending Model:#7Ornith-1.0-9Bdeepreinforce-ai⬇47kTrending Model:#8Qwen-AgentWorld-35B-A3BQwen⬇34kTrending Model:#9Ornith-1.0-35Bdeepreinforce-ai⬇135kTrending Model:#10Qwythos-9B-Claude-Mythos-5-1Mempero-ai⬇114kTrending Model:#1Unlimited-OCRbaidu⬇630kTrending Model:#2Qwythos-9B-Claude-Mythos-5-1M-GGUFempero-ai⬇1114kTrending Model:#3GLM-5.2zai-org⬇160kTrending Model:#4Ornith-1.0-35B-GGUFdeepreinforce-ai⬇234kTrending Model:#5Ornith-1.0-9B-GGUFdeepreinforce-ai⬇191kTrending Model:#6gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUFyuxinlu1⬇289kTrending Model:#7Ornith-1.0-9Bdeepreinforce-ai⬇47kTrending Model:#8Qwen-AgentWorld-35B-A3BQwen⬇34kTrending Model:#9Ornith-1.0-35Bdeepreinforce-ai⬇135kTrending Model:#10Qwythos-9B-Claude-Mythos-5-1Mempero-ai⬇114k

JoyAI-Echo Spins Multi-Shot AI Video Stories With Synced Audio

A film reel seamlessly intertwined with a dynamic glowing audio waveform.

JoyAI-Echo is a new open-source model that generates multi-shot, minute-long videos with synchronized audio directly from text prompts. You give it a JSON script describing each shot, and it outputs a fully animated sequence with matching sound. A distilled generator makes inference about 7.5 times faster than earlier approaches while preserving character appearance and voice timbre across scenes. The framework supports stories up to five minutes, with each clip running 241 frames at 25 frames per second.

Developed by the Echo Team at JD’s Joy Future Academy, the project addresses two persistent headaches in video generation: error accumulation over long timelines and painfully slow rendering. Their solution pairs a cross-modal audio-visual memory bank with reinforcement learning to lock in identity and voice across shots. The team released both model weights and inference code, making local experimentation practical for anyone with a 48GB-class GPU.

Multi-shot stories with synchronized audio

Key Features
  • Generate multi-shot video stories up to five minutes.
  • 7.5x faster inference with DMD distillation.
  • Synchronized audio and video in a single pipeline.
  • Cross-modal memory keeps characters and voices consistent.
  • Peak GPU memory around 46–50 GB.

Video creators and small studios can produce long-form narrative content without relying on cloud APIs or monthly fees. Privacy-conscious professionals keep all data and generated media entirely on their own hardware. The open-source code also invites customization for research, prototyping, and creative tooling.

Developer notes and future plans

The current release focuses on text-to-video generation and does not yet accept image inputs, though image-to-video support is on the roadmap. An interactive director agent and a lightweight super-resolution module are also in development to simplify prompt writing and boost output resolution. The project ships under a non-commercial license tied to the LTX-2 Community License Agreement.

"JoyAI-Echo decisively outperforms HappyOyster (directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks." — Source: GitHub