2601.11868v1 Jan 17, 2026 cs.SE

Terminal-Bench: 명령줄 인터페이스 환경에서 어려운, 실제적인 작업에 대한 에이전트 성능 평가

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Steven Dillmann
Steven Dillmann
Citations: 270
h-index: 5
Sasha Cui
Sasha Cui
Citations: 39
h-index: 2
Xuandong Zhao
Xuandong Zhao
UC Berkeley
Citations: 2,731
h-index: 26
Xin Lan
Xin Lan
Citations: 28
h-index: 3
Niklas Muennighoff
Niklas Muennighoff
Citations: 38
h-index: 3
Terry Yue Zhuo
Terry Yue Zhuo
Citations: 3,884
h-index: 19
Hao Lin
Hao Lin
Citations: 67
h-index: 5
Zhikang Dong
Zhikang Dong
Citations: 45
h-index: 4
Yuxin Wang
Yuxin Wang
Citations: 239
h-index: 4
Yuxuan Zhu
Yuxuan Zhu
Citations: 51
h-index: 3
Mike A. Merrill
Mike A. Merrill
Citations: 465
h-index: 6
Alexander G Shaw
Alexander G Shaw
Citations: 22
h-index: 2
Nicholas Carlini
Nicholas Carlini
Citations: 705
h-index: 4
Boxuan Li
Boxuan Li
Citations: 309
h-index: 4
Harsh Raj
Harsh Raj
Citations: 21
h-index: 1
Ivan Bercovich
Ivan Bercovich
Citations: 19
h-index: 1
Lin Shi
Lin Shi
Citations: 197
h-index: 5
J. Shin
J. Shin
Citations: 24
h-index: 2
Thomas Walshe
Thomas Walshe
Citations: 25
h-index: 2
E. K. Buchanan
E. K. Buchanan
Citations: 84
h-index: 4
Junhong Shen
Junhong Shen
Citations: 133
h-index: 6
Guanghao Ye
Guanghao Ye
Citations: 207
h-index: 8
Jason Poulos
Jason Poulos
Citations: 19
h-index: 1
Maoyu Wang
Maoyu Wang
Citations: 30
h-index: 3
Marianna Nezhurina
Marianna Nezhurina
Citations: 3,572
h-index: 8
J. Jitsev
J. Jitsev
Citations: 10,099
h-index: 18
Di Lu
Di Lu
Citations: 2,061
h-index: 3
O. M. Mastromichalakis
O. M. Mastromichalakis
Citations: 119
h-index: 6
Zizhao Chen
Zizhao Chen
Citations: 69
h-index: 4
Yue Liu
Yue Liu
Citations: 20
h-index: 1
Robert Zhang
Robert Zhang
Citations: 214
h-index: 6
L. Chen
L. Chen
Citations: 466
h-index: 3
Anurag Kashyap
Anurag Kashyap
Citations: 23
h-index: 2
Jan-Lucas Uslu
Jan-Lucas Uslu
Citations: 52
h-index: 2
Jeffrey Li
Jeffrey Li
Citations: 559
h-index: 8
Jianbo Wu
Jianbo Wu
Citations: 191
h-index: 6
Minghao Yan
Minghao Yan
Citations: 374
h-index: 4
Song Bian
Song Bian
Citations: 370
h-index: 5
Vedang Sharma
Vedang Sharma
Citations: 19
h-index: 1
Ke Sun
Ke Sun
Citations: 26
h-index: 2
Akshay Anand
Akshay Anand
Citations: 20
h-index: 1
Andrew Lanpouthakoun
Andrew Lanpouthakoun
Citations: 19
h-index: 1
Bardia Koopah
Bardia Koopah
Citations: 19
h-index: 1
Changran Hu
Changran Hu
Citations: 134
h-index: 5
E. Guha
E. Guha
Citations: 560
h-index: 9
Gabriel H. S. Dreiman
Gabriel H. S. Dreiman
Citations: 21
h-index: 2
Jiacheng Zhu
Jiacheng Zhu
Citations: 20
h-index: 1
Karl Krauth
Karl Krauth
Citations: 249
h-index: 2
Li Zhong
Li Zhong
Citations: 269
h-index: 6
Robert K. Amanfu
Robert K. Amanfu
Citations: 83
h-index: 4
Shangyin Tan
Shangyin Tan
Citations: 206
h-index: 4
Shreyas Pimpalgaonkar
Shreyas Pimpalgaonkar
Citations: 141
h-index: 3
Tushar Aggarwal
Tushar Aggarwal
Citations: 62
h-index: 4
Xia Lin
Xia Lin
Citations: 1,536
h-index: 17
Yiqing Liang
Yiqing Liang
Citations: 150
h-index: 5
Yuanli Wang
Yuanli Wang
Citations: 21
h-index: 2
Zilong Wang
Zilong Wang
Citations: 96
h-index: 6
Changzhi Zhou
Changzhi Zhou
Citations: 109
h-index: 7
David Heineman
David Heineman
Citations: 19
h-index: 1
Hange Liu
Hange Liu
Citations: 24
h-index: 2
H. Trivedi
H. Trivedi
Citations: 2,650
h-index: 15
John Yang
John Yang
Citations: 106
h-index: 2
Junhong Lin
Junhong Lin
Citations: 104
h-index: 5
Manish Shetty
Manish Shetty
Citations: 210
h-index: 7
Michael Yang
Michael Yang
Citations: 211
h-index: 4
Nabil Omi
Nabil Omi
Citations: 30
h-index: 2
Negin Raoof
Negin Raoof
Citations: 290
h-index: 4
Shanda Li
Shanda Li
Citations: 552
h-index: 9
Wu Lin
Wu Lin
Citations: 24
h-index: 2
Yiwei Dai
Yiwei Dai
Citations: 77
h-index: 5
Wenhao Chai
Wenhao Chai
Citations: 193
h-index: 5
Shang Zhou
Shang Zhou
Citations: 63
h-index: 3
Dariush Wahdany
Dariush Wahdany
Citations: 75
h-index: 2
Ziyu She
Ziyu She
Citations: 45
h-index: 4
Jiaming Hu
Jiaming Hu
Citations: 32
h-index: 3
Ahson Saiyed
Ahson Saiyed
Citations: 29
h-index: 2
Arinbjörn Kolbeinsson
Arinbjörn Kolbeinsson
Citations: 169
h-index: 6
Jesse Hu
Jesse Hu
Citations: 19
h-index: 1
Christopher Rytting
Christopher Rytting
University of Washington
Citations: 1,845
h-index: 10
Ryan Marten
Ryan Marten
Citations: 125
h-index: 2
Yixin Wang
Yixin Wang
Citations: 78
h-index: 2
Alexandros G. Dimakis
Alexandros G. Dimakis
Citations: 785
h-index: 10
A. Konwinski
A. Konwinski
Citations: 25,534
h-index: 18
Ludwig Schmidt
Ludwig Schmidt
Citations: 121
h-index: 2
Zhiwei Xu
Zhiwei Xu
Citations: 91
h-index: 5

인공지능 에이전트는 곧 다양한 분야에서 가치 있고 복잡한 작업을 자율적으로 수행할 수 있을 것입니다. 현재의 벤치마크들은 실제 세계의 작업을 측정하지 못하거나, 최첨단 모델의 성능을 의미 있게 평가하기에 충분히 어렵지 않습니다. 이러한 문제를 해결하기 위해, 우리는 Terminal-Bench 2.0을 소개합니다. Terminal-Bench 2.0은 실제 워크플로우에서 영감을 받은 컴퓨터 터미널 환경에서 89개의 작업으로 구성된 엄선된 어려운 벤치마크입니다. 각 작업은 고유한 환경, 사람이 작성한 솔루션, 그리고 검증을 위한 종합적인 테스트를 포함합니다. 우리는 최첨단 모델과 에이전트가 이 벤치마크에서 65% 미만의 성능을 보임을 확인했으며, 모델 및 에이전트 개선을 위한 영역을 파악하기 위해 오류 분석을 수행했습니다. 데이터셋과 평가 도구를 https://www.tbench.ai/ 에서 공개하여 개발자와 연구자들이 향후 연구에 활용할 수 있도록 지원합니다.

Original Abstract

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

22 Citations
3 Influential
13 Altmetric
93.0 Score

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!