ID-LoRA LTX2.3 Creates Talking Heads with Synced Audio

    
        By vramkickedin    
     | 
    
            March 26, 2026 at 8:34 pm        
    
     | 
    
        2 min read

ID-LoRA LTX2.3 is a new tool that generates talking-head videos with synchronized audio using a reference voice and image. It creates personalized video content where both the visual appearance and voice come from a single generative model, rather than processing them separately.

Developed by the ID-LoRA team, this open-source project addresses a gap in existing video personalization methods that treat video and audio as separate problems. The tool adapts the LTX-2.3 joint audio-video diffusion backbone, allowing users to control both visual and audio elements through text prompts, reference images, and short audio clips.

Model Size: ~67-75GB & VRAM GPU: 24GB minimum (48GB+ recommended)

Key specs and generation features

One-stage pipeline for single resolution output.
Two-stage pipeline with 2x spatial upsampling for higher quality.
Identity guidance that preserves speaker-specific vocal features.
Audio-video bimodal generation in a single pass.
Quantization support (int8 and fp8) for reduced memory usage.
ComfyUI custom nodes for workflow integration.

Video producers and content creators working on personalized media projects can use this tool to generate footage where a speaker's voice matches their on-screen appearance. The two-stage pipeline produces higher-resolution output, reaching 1024x1024 from an initial 512x512 generation.

Development approach to ID-LoRA LTX2.3 and results

The research team identified that existing voice-cloning models condition only on reference recordings, meaning text prompts cannot redirect speaking style or acoustic environment.

'Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions,'

the authors note in their paper.

Their solution uses negative temporal positions to distinguish reference tokens from generation tokens while preserving internal temporal structure. In human preference studies, users preferred ID-LoRA over Kling 2.6 Pro by 73% for voice similarity and 65% for speaking style. The model trains on approximately 3,000 pairs using a single GPU.