Multimodal

June 15, 2026

Google Drops Gemma-4-12B-It A Senses First Model That Runs Offline

By vramkickedin

Google just released Gemma-4-12B-It, an open-weights instruction-tuned model that handles text, images, video, and audio natively in one compact 12 billion parameter package. Instead of bolting on separate vision and […]

June 10, 2026

Google Drops Gemma-4-12B: One Model, Three Formats, Zero Encoders

By vramkickedin

Google has released Gemma-4-12B, a 12-billion-parameter open model that handles text, images, and audio in a single decoder-only system. The unified design ditches separate encoders, so all data goes straight […]

June 8, 2026

ByteDance Bernini Crafts Videos With Words, Not Pixel Paintbrushes

By vramkickedin

ByteDance has released Bernini, an open-source framework that unifies video generation and editing through a semantic planning approach. Instead of controlling pixels directly, the system uses a multimodal large language […]

June 6, 2026

Cosmos3-Nano Conjures Video, Audio, and Robot Commands from Any Input

By vramkickedin

Nvidia has released Cosmos3-Nano, a 16-billion-parameter omnimodal model that turns text, images, video, audio, or action data into dynamic video with synced sound, reasoning text, or robot movement commands. The […]

June 2, 2026

Gemma-4-Harmonia-31B-uncensored-heretic Slashes Refusals by 91%

By vramkickedin

Gemma-4-Harmonia-31B-uncensored-heretic is a decensored version of a 31-billion-parameter language model that dramatically cuts response refusals by 91%. The release uses an ablation technique to strip away content restrictions while keeping […]

June 2, 2026

PaddleOCR-VL-1.6 Smashes Document Parsing Accuracy At 96.33%

By vramkickedin

PaddleOCR-VL-1.6 is a compact document parsing model that reaches a new state-of-the-art accuracy of 96.33% on the OmniDocBench v1.6 benchmark. It boosts recognition of text, formulas, tables, ancient documents, rare […]

June 1, 2026

QwenLM Drops Qwen-Image-Bench to Grade AI Art Like a Pro

By vramkickedin

Qwen-Image-Bench is a new evaluation toolkit that scores images from any text-to-image model using a fine-tuned judge AI called Q-Judger. It checks generated images across five major quality dimensions, including […]

June 1, 2026

Qwen3.6-27B-pure-GGUF Squeezes Full 27B Model Onto One 16GB GPU

By vramkickedin

A new quantized version of Alibaba's coding model has been released to the community, offering a 27B parameter AI that can run entirely on a single 16GB graphics card. The […]

May 31, 2026

StepFun Delivers Step-3.7-Flash MoE Vision Model for Local AI Agents

By vramkickedin

Step-3.7-Flash is a 198-billion-parameter vision‑language model that uses a sparse mixture‑of‑experts design to activate only about 11 billion parameters per token. It handles images and text natively through a 1.8‑billion‑parameter […]

About multimodal releases

Latest multimodal models