2601.11868v1 Jan 17, 2026 cs.SE

Terminal-Bench: 명령줄 인터페이스 환경에서 어려운, 실제적인 작업에 대한 에이전트 성능 평가

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Steven Dillmann

Citations: 391

h-index: 6

Sasha Cui

Citations: 99

h-index: 2

Xuandong Zhao

UC Berkeley

Citations: 3,118

h-index: 28

Xin Lan

Citations: 109

h-index: 3

Terry Yue Zhuo

Citations: 4,149

h-index: 21

Hao Lin

Citations: 133

h-index: 6

Jiacheng Zhu

Citations: 1,600

h-index: 21

Zhikang Dong

Citations: 93

h-index: 4

Yuxin Wang

Citations: 357

h-index: 4

Yuxuan Zhu

Citations: 179

h-index: 3

Mike A. Merrill

Citations: 579

h-index: 6

Alexander G Shaw

Citations: 69

h-index: 2

Nicholas Carlini

Citations: 771

h-index: 4

Boxuan Li

Citations: 408

h-index: 4

Harsh Raj

Citations: 68

h-index: 1

Lin Shi

Citations: 290

h-index: 5

J. Shin

Citations: 71

h-index: 2

Thomas Walshe

Citations: 74

h-index: 2

E. K. Buchanan

Citations: 138

h-index: 4

Junhong Shen

Citations: 189

h-index: 7

Guanghao Ye

Citations: 263

h-index: 8

Jason Poulos

Citations: 66

h-index: 1

Maoyu Wang

Citations: 83

h-index: 3

Marianna Nezhurina

Citations: 3,746

h-index: 8

J. Jitsev

Citations: 10,673

h-index: 19

Di Lu

Citations: 2,761

h-index: 3

O. M. Mastromichalakis

Citations: 175

h-index: 6

Zizhao Chen

Citations: 136

h-index: 5

Yue Liu

Citations: 67

h-index: 1

Robert Zhang

Citations: 271

h-index: 6

L. Chen

Citations: 525

h-index: 3

Anurag Kashyap

Citations: 70

h-index: 2

Jan-Lucas Uslu

Citations: 106

h-index: 3

Jeffrey Li

Citations: 679

h-index: 8

Jianbo Wu

Citations: 256

h-index: 6

Minghao Yan

Citations: 434

h-index: 5

Song Bian

Citations: 425

h-index: 5

Vedang Sharma

Citations: 66

h-index: 1

Ke Sun

Citations: 76

h-index: 2

Akshay Anand

Citations: 67

h-index: 1

Andrew Lanpouthakoun

Citations: 66

h-index: 1

Bardia Koopah

Citations: 66

h-index: 1

Changran Hu

Citations: 233

h-index: 6

E. Guha

Citations: 663

h-index: 9

Gabriel H. S. Dreiman

Citations: 68

h-index: 2

Karl Krauth

Citations: 290

h-index: 2

Li Zhong

Citations: 344

h-index: 6

Robert K. Amanfu

Citations: 130

h-index: 4

Shangyin Tan

Citations: 315

h-index: 4

Shreyas Pimpalgaonkar

Citations: 214

h-index: 3

Tushar Aggarwal

Citations: 115

h-index: 5

Xia Lin

Citations: 1,582

h-index: 17

Yiqing Liang

Citations: 256

h-index: 6

Yuanli Wang

Citations: 68

h-index: 2

Zilong Wang

Citations: 160

h-index: 6

Changzhi Zhou

Citations: 188

h-index: 8

David Heineman

Citations: 66

h-index: 1

Hange Liu

Citations: 74

h-index: 3

H. Trivedi

Citations: 3,000

h-index: 15

John Yang

Citations: 173

h-index: 2

Junhong Lin

Citations: 178

h-index: 5

Manish Shetty

Citations: 288

h-index: 7

Michael Yang

Citations: 306

h-index: 5

Nabil Omi

Citations: 83

h-index: 2

Negin Raoof

Citations: 367

h-index: 4

Shanda Li

Citations: 660

h-index: 9

Wu Lin

Citations: 71

h-index: 2

Yiwei Dai

Citations: 137

h-index: 5

Wenhao Chai

Citations: 267

h-index: 6

Shang Zhou

Citations: 125

h-index: 5

Dariush Wahdany

Citations: 126

h-index: 3

Ziyu She

Citations: 93

h-index: 4

Jiaming Hu

Citations: 80

h-index: 3

Ahson Saiyed

Citations: 78

h-index: 2

Arinbjörn Kolbeinsson

Citations: 221

h-index: 6

Jesse Hu

Citations: 67

h-index: 1

Christopher Rytting

University of Washington

Citations: 2,004

h-index: 10

Ryan Marten

Citations: 196

h-index: 2

Yixin Wang

Citations: 127

h-index: 2

Alexandros G. Dimakis

Citations: 914

h-index: 10

A. Konwinski

Citations: 25,691

h-index: 17

Ludwig Schmidt

Citations: 191

h-index: 2

Zhiwei Xu

Citations: 144

h-index: 5

I. Bercovich

Citations: 95

h-index: 3

Niklas Muennighoff

Citations: 19,870

h-index: 50

인공지능 에이전트는 곧 다양한 분야에서 가치 있고 복잡한 작업을 자율적으로 수행할 수 있을 것입니다. 현재의 벤치마크들은 실제 세계의 작업을 측정하지 못하거나, 최첨단 모델의 성능을 의미 있게 평가하기에 충분히 어렵지 않습니다. 이러한 문제를 해결하기 위해, 우리는 Terminal-Bench 2.0을 소개합니다. Terminal-Bench 2.0은 실제 워크플로우에서 영감을 받은 컴퓨터 터미널 환경에서 89개의 작업으로 구성된 엄선된 어려운 벤치마크입니다. 각 작업은 고유한 환경, 사람이 작성한 솔루션, 그리고 검증을 위한 종합적인 테스트를 포함합니다. 우리는 최첨단 모델과 에이전트가 이 벤치마크에서 65% 미만의 성능을 보임을 확인했으며, 모델 및 에이전트 개선을 위한 영역을 파악하기 위해 오류 분석을 수행했습니다. 데이터셋과 평가 도구를 https://www.tbench.ai/ 에서 공개하여 개발자와 연구자들이 향후 연구에 활용할 수 있도록 지원합니다.

Original Abstract

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

80 Citations

16 Influential

25 Altmetric

237.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!