
xiaomi today officially launched the mimo-v2.5-tts series and the mimo-v2.5-asr, a full-stack speech model designed for the agent era that covers both speech recognition and synthesis—the two core capabilities—enabling fully flexible language-driven control over both speech input and output.
among them, the mimo-v2.5-tts series comprises three models, which are now available on xiaomi’s mimo open platform and can be used free of charge for a limited time. these three models share unified style-guided instruction following, audio tag control, and text understanding capabilities: the standard edition comes pre-loaded with several high-quality premium voice tones and supports fine-grained control over speech rate, emotion, and tone; the voicedesign version allows users to quickly generate a brand-new voice tone with just a single sentence; and the voiceclone version can faithfully replicate a target voice tone using only a small number of samples. users can describe the desired emotional nuance as if directing an actor, and the model will deliver a stable performance—supporting even director-level, hierarchical input at the script level—so that the character’s voice remains consistent throughout, with each utterance individually controllable.
meanwhile, mimo-v2.5-asr has been officially open-sourced. this model achieves industry-leading performance across a wide range of complex real-world scenarios, including chinese–english bilingual contexts, chinese dialects (such as wu, cantonese, minnan, and sichuanese), code-switching, heavy noise environments, and multi-speaker settings. it supports precise transcription of knowledge-intensive content, such as song lyrics, classical poetry, and technical terminology, and can natively output punctuation. evaluation results show that it delivers state-of-the-art or highly competitive performance across multiple dimensions. users can explore the tts series on the xiaomi mimo api open platform and mimo studio, while developers can directly use or further customize the asr model through its open-source code. with this comprehensive end-to-end speech solution, xiaomi is providing a more natural and more controllable vocal foundation for agent-based interactions.