Sculpt Sound Instantly: Magenta-Realtime-2 Arrives for Local Devices

Google has released Magenta-Realtime-2, an open music generation model designed to create music on your own device with extremely low delay. This new system lets you steer musical output in real time using text descriptions, audio samples, or MIDI note data. It follows earlier research and delivers richer control while running locally, without needing a cloud connection.
The model comes from Google DeepMind and is built for people who need fast, interactive music tools. It focuses on live performance, improvisation, and creative workflows where waiting for a server response isn't practical. The team trained it on roughly 71,000 hours of mostly instrumental stock music, then released the model weights under permissive licenses so anyone can use or adapt it.
What you can do with it
- Generates music continuously in real time.
- Low latency control around 200 milliseconds.
- Text prompt and audio example steering.
- MIDI input for precise note control.
- Two sizes: 2.4B and 230M parameters.
- Trained on 71k hours of instrumental music.
This tool fits creators who want a responsive, private music generator that runs on their own hardware. Live performers can shape sound instantly, while video game developers might generate adaptive soundtracks locally. Privacy-focused artists and small studios get a capable model that doesn't send audio to external servers.
Under the hood and what's next
The system uses a transformer language model that now processes audio frame by frame rather than in chunks, cutting delay to around 200 milliseconds. It remains the only openly available model that supports continuous, real-time generation with this kind of low-latency control. The team plans to add fine-tuning support soon, letting musicians customize the model with their own recordings, though genre coverage and occasional non-word vocal sounds are still limited.
"At the time of release, Magenta RealTime 2 represents the only open weights model supporting real-time, continuous musical audio generation with low latency control (~200ms)." — Source: Hugging Face