on may 8, it was reported that tencent hunyuan, in collaboration with the university of california, los angeles (ucla), the chinese university of hong kong, and other institutions, jointly released the opensearch-vl open-source multimodal training framework, leveraging reinforcement learning (rl) techniques to build state-of-the-art deep search agents.
multimodal search agents are intelligent systems capable of processing multiple modalities of input, such as images and text, and proactively invoking external tools—such as search engines and image-processing utilities—to perform multi-step reasoning, evidence verification, and knowledge retrieval, with the goal of addressing complex, knowledge-intensive visual question-answering tasks. the report, published yesterday (may 6) on the arxiv platform, introduces the opensearch-vl framework for training cutting-edge multimodal deep search agents. the research team developed a high-quality data pipeline that employs wikipedia path sampling and fuzzy entity rewriting to reduce retrieval shortcuts, resulting in datasets such as searchvl-sft-36k.
the research team notes that the primary bottleneck currently hindering the advancement of state-of-the-art multimodal search agents is the availability of high-quality training data. most leading systems today are developed by commercial companies, whose data sources, filtering criteria, and tool-use logs are proprietary, thereby impeding the replication of advanced capabilities and systematic research. to address this challenge, the study proposes opensearch-vl, an end-to-end open-source solution spanning data, tools, and training algorithms.
in building the data pipeline, opensearch-vl leverages wikipedia’s hyperlink graph to perform multi-hop entity path sampling, rewrites intermediate entities into fuzzy descriptions, and anchors anchor entities to source images, thereby discouraging single-step retrieval shortcuts and encouraging the agent to learn multi-hop search and reasoning behaviors.
the pipeline generates the searchvl-sft-36k dataset for supervised fine-tuning, with each trajectory averaging 6.3 tool calls. at the same time, 10% of the data is randomly subjected to degrading treatments such as blurring and downsampling, paired with augmentation tools to induce “thinking while processing images” behavior.
the tool environment goes beyond simple retrieval agents, integrating functions such as text search, image search, ocr, cropping, sharpening, super-resolution, and perspective correction. this enables the agent to first process blurry, low-resolution, or skewed visual inputs before querying external knowledge, thus seamlessly combining proactive perception with knowledge acquisition.
experiments show that the opensearch-vl-30b-a3b model boosts the baseline average score from 47.8 to 61.6, achieving significant improvements on benchmarks such as vdr and mmsearch. ablation studies confirm the contribution of each component: removing source-anchor anchoring, fuzzy rewriting, or staged filtering results in an average score drop of 8.2 to 11.5 points.