Tencent’s new framework is OpenSearch-VL, an open-source training recipe for multimodal search agents rather than a consumer chatbot. Its goal is to move vision-language models beyond answering from a single image toward agents that can gather missing evidence, use tools and reason over multiple steps [17]. arXiv lists the paper as submitted on May 6, 2026, and launch coverage says Tencent Hunyuan worked with UCLA and The Chinese University of Hong Kong on the release [
18][
21].
The problem OpenSearch-VL targets
The release is aimed at a reproducibility gap. Early coverage framed the next challenge for multimodal large language models as moving from passively understanding images to actively seeking evidence and reasoning, while noting that high-quality trajectory data, automated synthesis paths and detailed training recipes have been bottlenecks [1].
OpenSearch-VL’s answer is to publish a more explicit agent-building recipe: data, tool orchestration, supervised fine-tuning, reinforcement learning and evaluation around multimodal deep search [17].




