Comfyui-Controlfoley Syncs AI Foley To Your Silent Footage

A fragmented film strip flows vertically its individual opaque matte frames containing subtle waveform oscillations rendered.

Comfyui-controlfoley brings the ability to generate synchronized foley sound effects directly into the ComfyUI node-based interface. It can produce time-matched audio like footsteps or door slams from silent video, still images, or text prompts. The node wraps the ControlFoley model from Xiaomi Research, letting users create sound layers without leaving their local workflow.

Developer SGUN-father ported the full ControlFoley pipeline into a set of ComfyUI nodes. The integration covers all original capabilities, including video-to-audio, image-to-audio, and text-to-audio generation. It also supports reference audio for controlling the timbre and style of the resulting sound effects.

Multi-modal sound generation nodes

Key features
  • Video-to-audio with time-synchronized foley sounds.
  • Image-to-audio from a single picture.
  • Text-to-audio using natural language descriptions.
  • Reference audio input for timbre styling.
  • Joint multi-modal control with video, text, audio.

Video creators and sound designers working on silent footage can now automatically generate footsteps, environmental ambience, and other effects without recording or hunting for stock sounds. Hobbyists running ComfyUI at home can keep the entire process private and local, with no cloud uploads needed. The node set works best with a GPU having at least 8GB of VRAM, making it accessible to many consumer-grade setups.

Installation and tech notes

Setting up the nodes requires downloading several model weight files and building the audiocraft library from source, which adds extra steps compared to a one-click install. The current release only exposes the large_44k model variant with no selection for smaller or faster alternatives. Future updates could introduce lighter variants and tighter integration with other ComfyUI audio saving nodes.

"generate synchronized foley sound effects from video, images, and text prompts." — Source: GitHub