Google DeepMind Launches Gemma 4 12B Open Model That Runs on Any Laptop with 16GB RAM

Google DeepMind's Gemma 4 12B runs on any laptop with 16GB RAM, processing images and audio directly without separate encoders.

Paige Roberts

Senior Smart Home & IoT Editor

Updated June 4, 2026Jun 4, 2026

•

3 min read

Don't Miss the Good Stuff

Get tech news that matters delivered weekly. Join 50,000+ readers.

Google DeepMind dropped Gemma 4 12B on June 3, a 12-billion-parameter open model that runs on any laptop with 16GB of RAM. The headline number undersells the engineering: this is the first mid-sized Gemma to ditch multimodal encoders entirely, processing images and audio directly through the language backbone. The model fills the gap Google left open when it launched the Gemma 4 family in April.

That first wave included two mobile-optimized variants (E2B and E4B), a 26B Mixture of Experts, and a 31B dense model, with nothing between the phone-class models and the workstation-class ones. The 12B lands in that middle slot, and Google says it trails the 26B MoE on benchmarks while beating the older Gemma 3 27B across tests like GPQA Diamond, MMLU Pro, and DocVQA. The architecture is the real story. Most multimodal AI models route non-text inputs through dedicated vision and audio encoders before the language model sees them.

Google built the 12B without those encoders. For vision, a 35-million-parameter embedding module splits images into 48x48 pixel patches and projects each into the model's hidden dimension with a single matrix multiplication. That replaces the 27 vision transformer layers and roughly 550 million parameters the larger Gemma 4 models carry. Audio gets even leaner treatment: raw 16 kHz waveforms are chopped into 40-millisecond frames and projected straight into the same vector space as text tokens, with no encoding step at all.

This makes Gemma 4 12B the first mid-sized Gemma with native audio support, capable of speech recognition, speaker diarization, code generation, image understanding, and video analysis. In one demo, the model processed a five-minute Google I/O keynote clip, reading 313 frames at one per second alongside audio using 70 visual tokens per frame. The model ships with Multi-Token Prediction (MTP) drafters enabled by default, the first Gemma 4 model to do so. MTP uses spare processing cycles to guess multiple future tokens simultaneously, cutting latency without quality loss.

Google previously offered MTP as an optional add-on for the other Gemma 4 models. The timing matters. DRAM prices jumped roughly 90% in Q1 2026 as Samsung, SK Hynix, and Micron redirected production toward AI data center memory.

Micron told CNBC at CES it was effectively sold out of memory for 2026. A model that wrings capable multimodal performance from a standard 16GB machine sidesteps both the hardware crunch and cloud costs.

Weights are free under an Apache 2.0 license on Kaggle and Hugging Face, clocking in just under 18GB. The model works with Hugging Face Transformers, vLLM, SGLang, MLX, llama.cpp, and LiteRT-LM.

Google is also pushing the Google AI Edge stack for macOS, including a voice dictation app (Eloquent) and a coding showcase app (Gallery), both running fully on-device. Gemma 4 models have crossed 150 million downloads, according to Google.