~ similar to 2605.30611· 20 results
The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…
Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang +21 more
This paper introduces Agents-K1, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs.
Xinjiang Yu, Junyi Han, Zhuofan Chen, Chi Zhang +6 more
DiagramRAG is a lightweight retrieval-augmented framework that uses reference diagrams to guide the completion of scientific diagrams from incomplete user sketches, achieving high performance on stand…
Astrid van den Brandt, Kiroong Choe, Sehi L'Yi, Devin Lange +1 more
The paper evaluates various LLM-based agentic schemes for authoring complex, interactive, multiview genomics visualizations, finding that agentic iteration significantly improves visualization quality…
Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang +1 more
The paper introduces AIBuildAI-2, a knowledge-enhanced agent that significantly improves the automatic building of AI models by integrating an external, evolving knowledge system, achieving state-of-t…
Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat +4 more
The paper introduces AutoformBot, a multi-agent system that successfully autoformalizes a large corpus of open-access graduate-level mathematics textbooks into a verified library in Lean 4, demonstrat…
Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu +20 more
The paper introduces TVIR, a new benchmark and multi-agent framework for deep research, to evaluate and improve the generation of factually reliable, text-visual interleaved reports.
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao +4 more
This paper presents EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery.
The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…
Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao +1 more
The paper introduces extsc{Ptah}, a multi-agent harness designed to improve verifiable multimodal deep research by orchestrating the entire report generation process, ensuring factual grounding and v…
Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang +4 more
MACReD introduces a hierarchical multi-agent framework that achieves state-of-the-art performance in parsing complex chemical reaction diagrams by coordinating specialized agents for perception and gl…
Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben +1 more
The paper introduces TECCI, a novel and challenging benchmark dataset of 7550 image-edit pairs, and demonstrates that current state-of-the-art text-guided image editing models struggle significantly w…
MOOSE-Copilot is a novel web-based framework that unifies scientific hypothesis discovery by formalizing human-AI interaction, significantly improving performance over autonomous LLM baselines.
Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more
The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…
The paper introduces SPIRE, a multi-agent framework designed to extend LLM research capabilities to the humanities by enabling evidence-grounded interpretive reasoning over primary sources.
Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan +15 more
AutoSci is a memory-centric agentic system designed to automate the entire scientific research lifecycle by integrating structured memory, multi-stage execution, and continuous self-improvement.
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…