2606.11926v1 Jun 10, 2026 cs.CL

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Hongjin Qian
Hongjin Qian
Citations: 65
h-index: 4
Guanting Dong
Guanting Dong
Citations: 1,152
h-index: 13
Yuyang Hu
Yuyang Hu
GSAI
Citations: 204
h-index: 3
Yutao Zhu
Yutao Zhu
University of Montreal
Citations: 4,882
h-index: 29
Zhicheng Dou
Zhicheng Dou
Citations: 2,389
h-index: 24
Qi Dai
Qi Dai
Citations: 523
h-index: 4
Jiajie Jin
Jiajie Jin
Citations: 1,619
h-index: 16
Xiaolong Ma
Xiaolong Ma
Citations: 41
h-index: 4
Gongrui Zhang
Gongrui Zhang
Citations: 256
h-index: 3
Kai Qiu
Kai Qiu
Citations: 856
h-index: 8
Zhirong Wu
Zhirong Wu
Citations: 215
h-index: 3
Chong Luo
Chong Luo
Citations: 882
h-index: 10
Zhengyuan Yang
Zhengyuan Yang
Citations: 9,496
h-index: 36
Lijuan Wang
Lijuan Wang
Citations: 303
h-index: 4
Tong Zhao
Tong Zhao
Citations: 31
h-index: 4
Xiaoxian Li
Xiaoxian Li
Citations: 83
h-index: 4
Bei Liu
Bei Liu
Citations: 8
h-index: 1
Linjie Li
Linjie Li
Citations: 385
h-index: 4

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

0 Citations
0 Influential
18 Altmetric
90.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!