Google just collapsed the AI video workflow into a single model. At I/O 2026, the company unveiled Gemini Omni, a new family of generative models that accepts text, images, video, and audio as inputs and outputs a unified, physics-aware video.
No more stitching together separate tools for visuals, sound, and continuity. The first model shipping today is Omni Flash. It generates clips up to 10 seconds long with synchronized audio, according to Dumitru Erhan, senior research director at Google DeepMind.
Google is working on extending that limit. Unlike the company's existing Veo model, which is strictly text-to-video, Omni Flash can take an existing video clip and use it as the foundation for a new one.
Omni Flash also carries "a lot" more world knowledge than Veo because it draws from Gemini's training data, Koray Kavukcuoglu, CTO of Google DeepMind and chief AI architect at Google, told The Verge. Early demos bear that out: a rolling marble video maintains believable physics for each bounce and generates convincing sound effects for every ring and impact.
Another demo shows a claymation-style explainer of protein folding.
Google is positioning Omni as a direct successor to Nano Banana, its image generation model that has already produced more than 50 billion images since launching last year. The Omni branding signals the endgame: "create anything from any input." The model edits video through conversational language, with each instruction building on the last.
Characters stay consistent, physics holds up, and the scene remembers what came before.
Access is tiered. Gemini Omni Flash is available today to AI Plus, Pro, and Ultra subscribers in the Gemini app and Google Flow, with subscriptions starting at $7.99 per month.
Free users get access through YouTube Shorts and the YouTube Create app later this week. Enterprise API access arrives in the coming weeks.
Google also teased a higher-tier "Omni Pro" model with details coming soon.
Safety measures are layered in from the start. All Omni-generated videos carry invisible SynthID watermarks.
Google is holding back audio and speech editing capabilities until it can "bring this capability to users responsibly." The model also supports creating a personalized "Avatar" of yourself to insert into videos, similar to a feature in OpenAI's recently discontinued Sora app.













