ACE-Step 1.5 XL turns plain text into full songs in eight quick steps

ACE-Step recently published ACE-Step 1.5 XL, an open audio generation model that produces complete music tracks in just eight steps. This streamlined process significantly reduces rendering wait times while preserving detailed sound output.
Co-developed by ACE Studio and StepFun who created ACE-Step 1.5, the release tackles the heavy computing demands often tied to local audio synthesis. A compact design allows the system to operate efficiently on standard consumer hardware.
Model Size: 20GB & VRAM GPU: 12GB required
Core audio features
- Creates full-length songs using eight processing steps.
- Transforms simple text prompts into detailed musical arrangements across fifty languages.
- Handles vocal removal, instrumental swapping, and targeted track editing.
- Delivers commercially safe output trained on licensed and public domain audio.
Independent producers managing tight project deadlines can bypass traditional recording bottlenecks with this setup. Built-in editing tools let creators modify specific song segments without rebuilding entire compositions, freeing up workstation time for final mixing. Local artists can adjust output parameters through straightforward configuration files without touching complex code.
ACE-Step 1.5 XL Design approach and system notes
Training relied on internal feedback loops rather than human-rated scoring systems to maintain objective alignment. The developers noted that this pathway
"thereby eliminating the biases inherent in external reward models or human preferences,"
as documented in their technical paper.
While optimized configurations function with memory management tricks, reaching peak audio fidelity demands twenty-four gigabytes of dedicated video memory. Smaller companion text engines remain fully compatible for workstations operating under strict bandwidth limits.
The model weights and installation scripts are available on Hugging Face.