
according to foreign media reports, google recently officially unveiled a multi-token prediction drafters for the gemma 4 series of models. this technological breakthrough leverages a speculative decoding architecture to boost model inference speed by up to three times—without compromising output quality or logical reasoning capabilities. as one of the most closely watched open-source models worldwide, gemma 4 has already surpassed 60 million downloads shortly after its release, and the core objective of this update is precisely to address the long-standing inference bottleneck in large language models, thereby further maximizing the efficiency of computational resources.
inferencing with traditional language models is often constrained by gpu memory bandwidth: when generating text, the processor must spend considerable time transferring tens of billions of parameters from gpu memory to the compute units, leaving most of the hardware resources idle and resulting in noticeable response latency. google’s newly introduced speculative decoding technique adopts a “master–slave” coordination model: the system pairs heavyweight target models like gemma 4 31b with lightweight mtp drafters. the drafter uses idle computing power to proactively predict multiple potential future tokens, which are then concurrently verified by the main model. once the predictions match, the model can directly confirm the entire sequence in a single computation, dramatically reducing text-generation time.
according to official benchmark data, this acceleration is particularly striking on local devices. on apple silicon chips, the local inference speed of the gemma 4 26b model has improved by about 2.2 times. this means developers can now smoothly run complex offline programming assistants or intelligent agent workflows on personal computers or standard consumer-grade gpus, while the increased inference efficiency also significantly reduces power consumption on edge devices. this technical update primarily targets low-latency use cases such as real-time chatbots and automated programming tools. through the mtp drafter, google has demonstrated that even in resource-constrained hardware environments, developers can deploy state-of-the-art language models without having to compromise between response speed and computational accuracy. as inference costs and barriers continue to fall, gemma 4 is bringing ai from the cloud to a much broader range of personal computing endpoints.