Zai-org's SCAIL-2 Breathes Motion Into Still Characters Sans Skeleton

SCAIL-2 is a new open-source model that animates still character images directly from a driving video without relying on skeleton maps or inpainting masks. This end-to-end approach removes information loss that occurs when converting motion into intermediate pose representations. The model also handles character replacement tasks and supports multi-character scenarios from a single interface.
Zai-org, the same team behind GLM, developed SCAIL-2 by building a synthetic training pipeline using several off-the-shelf models to generate 60,000 motion pairs. The team designed a Unified Motion Transfer Interface with specialized masking channels and a dedicated RoPE design to unify different animation tasks under one training process. By training the model to reverse the driving process, it learned capabilities beyond its teacher models.
End-to-end animation without intermediates
- End-to-end driving at 512p and 704p resolutions.
- Cross-identity character replacement with detailed prompts.
- Animal-to-character motion transfer without human skeletons.
- Zero-shot support for SAM3D body mesh inputs.
- Multi-reference generation using optional extra images.
- Bias-Aware DPO LoRA for hand and face detail improvement.
- Built-in Wan VAE and T5 in checkpoint.
- ComfyUI integration with community workflows available.
Video creators and animators can use SCAIL-2 to transfer complex movements from any video source onto a reference character image. The removal of skeleton-based restrictions means driving sources can include animals or non-human motion that previous tools could not process. Users benefit from a single pipeline that handles both animation and character replacement without switching between different specialized tools.
Training data and model limitations
The project addresses a core weakness found in SCAIL-1, which identified pose representation and injection as key bottlenecks but still depended on intermediate representations. MotionPair-60K, the synthetic dataset created for training, combines data from multiple off-the-shelf models including MoCha and Wan-Animate alongside the team’s own SCAIL-Preview tool.
While multi-reference inference works in zero-shot mode, the model was not explicitly optimized for it and video quality may degrade when additional reference images are provided.
"SCAIL-2 is an open-source model for end-to-end controlled character animation. It animates a reference character with a driving video, and also supports character replacement and multi-character scenarios without relying on intermediate pose representations." — Source: Hugging Face