on the first day of google i/o 2026, google unveiled its groundbreaking new multimodal large model—gemini omni—marking a new era in ai capabilities characterized by full-scenario, cross-modal collaboration. demis hassabis, co-founder and ceo of deepmind and a leading figure in the field of artificial intelligence, announced on stage that this is the most powerful, unified, and intuitively human‑like native multimodal architecture.
the name “omni” not only signifies all‑encompassing coverage but also embodies a fundamental breakthrough: true cross‑modal semantic alignment and bidirectional generation. whether it’s reimagining video characters through text‑driven commands, instantly transferring visual styles via audio input, or using static images to drive dynamic narrative logic, gemini omni can handle these tasks end‑to‑end with high fidelity and precise control. the initial lightweight version, gemini omni flash, has already been fully launched across the gemini app, google flow, and youtube shorts, ready for immediate use; a complete api interface for developers will be rolled out in phases going forward.
the industry widely believes that this model is propelling ai from the perceptual‑understanding level to the realm of concrete creative production, compressing professional‑grade video editing into natural language interaction, dramatically lowering technical barriers to creative expression, and reshaping the paradigm of content creation.