Audio

About audio model releases

Explore the latest open‑source audio and speech AI releases for local use. This archive covers new models and tools for voice cloning, text‑to‑speech, transcription, and music generation.

Latest audio models

June 16, 2026

NVIDIA Drops Nemotron-3.5-Asr-Streaming-0.6b For Real-Time Speech

By vramkickedin

The Nemotron-3.5-ASR-Streaming-0.6b model is NVIDIA’s latest open speech recognition release, designed to transcribe audio in real time across 40 language-locales from a single model. It can handle both low-latency streaming […]

June 15, 2026

MOSS-SoundEffect-v2.0 Crafts Any Sound You Describe In High Fidelity

By vramkickedin

MOSS-SoundEffect-v2.0 is an open model that creates high-fidelity sound effects straight from text prompts. It can generate rain, city noise, animal calls, human actions, and even short musical clips with […]

June 2, 2026

VTS Turns Your Hummed Imitation Into a Real Sound Effect

By vramkickedin

VTS (Voice To Sound) is a newly released open-source model that turns a short vocal imitation and a text description into a realistic sound effect. Instead of fumbling to describe […]

May 31, 2026

MOSS-TTS-v1.5 Lands With Precise Pause Controls And 31-Language Synthesis

By vramkickedin

MOSS-TTS-v1.5 is an upgraded open-source text-to-speech model from the OpenMOSS team, building on their earlier 1.0 release. It keeps zero-shot voice cloning, long-form generation, and multilingual capabilities while delivering more […]

May 25, 2026

DramaBox Interprets Stage Directions for Expressive AI Voiceovers

By vramkickedin

DramaBox is a text-to-speech system that turns scene descriptions and dialogue into expressive speech, complete with laughs, sighs, and pauses. It can clone a speaker’s timbre from just a 10-second […]

May 15, 2026

Scenema-Audio Lets You Direct Voices With Emotion And Scene Sounds

By vramkickedin

Scenema-Audio is a new open-source model that clones voices and generates speech with emotional acting, scene sounds, and zero-shot identity transfer. It doesn’t just read text aloud—it interprets stage directions […]

May 15, 2026

Supertonic-3 Whispers 31 Languages Directly From Your Device

By vramkickedin

Supertonic-3 is a lightweight text-to-speech system that runs entirely on your device using ONNX Runtime, with no cloud calls needed for synthesis. This open-weight release expands language support from 5 […]

April 30, 2026

Xiaomi Research Orchestrates ControlFoley For Video Soundtracks

By vramkickedin

ControlFoley transforms video clips into synchronized soundtracks by combining visual scenes, written descriptions, and existing audio samples into a single generation system. This new framework produces matching sound effects and […]

April 28, 2026

Trelis Debuts Chorus-v1-GGML For Local Voice Separation

By vramkickedin

Trelis recently released a specialized speech transcription model that handles overlapping conversations between two participants. The system processes audio clips locally without relying on external cloud servers. Built as an […]

« Previous 1 2 3 4 Next »