
recently, tencent’s hunyuan team, in collaboration with renmin university of china’s gaoling academy of artificial intelligence and several other research institutions, officially released and open-sourced a new framework for evaluating and training planning capabilities—planningbench. anchored in real-world planning problems, this framework establishes a data-generation and evaluation system that is both scalable, verifiable, and diverse in tasks, aiming to systematically measure and enhance the structured decision-making abilities of large language models under complex constraints.
breaking away from the limitations of traditional single-task evaluations, planningbench achieves, for the first time, full coverage of six core planning scenarios: schedule planning, resource allocation, workforce scheduling, route optimization, production management, and emergency response, encompassing more than 30 sub‑tasks. its data-generation mechanism does not rely on simply increasing prompt length; instead, it dynamically adjusts difficulty levels based on essential dimensions such as task topology, multi‑layer constraint coupling, and the degree of resource supply‑demand tension, ensuring that each sample directly addresses real‑world planning bottlenecks. each instance comes with a structured checklist that conducts triple validation—from input consistency and constraint satisfaction to objective optimality—comprehensively identifying feasibility issues in model outputs.
most notably, the framework innovatively introduces a dual-track evaluation paradigm of “local compliance–global feasibility,” enabling precise identification of typical failure modes such as “steps are correct but overall conflicts persist” or “resource allocation is reasonable yet impractical.” this significantly enhances the ability to diagnose the model’s underlying planning logic. empirical results show that after reinforcement training using verifiable data generated by planningbench, models not only demonstrate markedly improved performance on unseen planning benchmarks but also exhibit cross-domain transfer advantages in general reasoning and multi‑step tasks. as a result, planningbench establishes a complete closed loop—“scenario-driven–data generation–verifiable training–generalization evaluation”—providing a solid foundation for the scientific assessment and efficient advancement of large models’ planning capabilities.