
on april 28, local time, nvidia officially unveiled nemotron 3 nano omni, an open-source, multimodal inference model designed to provide an integrated foundation model for enterprise-grade ai agents. built on a 30-billion-parameter, a3b mixture-of-experts (moe) architecture, the model can dynamically activate based on tasks and modalities, delivering high throughput and scalable multimodal performance.
unlike traditional solutions that rely on fragmented chains of vision–speech–language models, nemotron 3 nano omni integrates unified multimodal inference across video, audio, images, and text into a single, efficient, open model, thereby reducing inference hops and orchestration complexity, significantly lowering inference costs, and enhancing cross-modal contextual consistency. under a fixed interaction latency threshold, the model’s effective system capacity in video inference tasks is up to approximately 9.2 times higher than that of other open-source multimodal models, and up to about 7.4 times higher in multi-document inference tasks.
this model can serve as a multimodal perception and context sub-agent within agent systems, enabling agents to process visual, audio, and textual inputs within a single, shared “perception–action” loop. on the document intelligence benchmarks mmlongbench-doc and ocrbenchv2, it achieves state-of-the-art accuracy in its class, and also delivers outstanding performance on video and audio understanding benchmarks such as worldsense, dailyomni, and voicebench. in terms of architectural design, nemotron 3 nano omni combines mamba layers—designed to enhance sequence and memory efficiency—with transformer layers—optimized for precise inference—resulting in up to fourfold improvements in memory and computational efficiency. visual processing employs 3d convolutions to capture inter-frame motion, the audio component is based on nvidia’s parakeet encoder, and the text component uses a powerful language model as its central decoder.
the model’s weights are currently available on hugging face and will soon be deployed as an nvidia nim microservice, allowing developers to freely customize, deploy, and integrate multimodal sub-agents.