2604.17596v1 Apr 19, 2026 cs.CR

터미널 렌치: 331개의 보상 우회 환경 데이터셋 및 3,632개의 공격 경로

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Shashwat Saxena

Citations: 46

h-index: 2

Aditi Raghunathan

Citations: 103

h-index: 4

I. Bercovich

Citations: 95

h-index: 3

Ivgeni Segal

Citations: 0

h-index: 0

Kexun Zhang

Citations: 86

h-index: 4

Ziqian Zhong

Citations: 32

h-index: 3

본 연구에서는 '터미널 렌치(Terminal Wrench)'라는 데이터셋을 공개합니다. 이 데이터셋은 인기 있는 오픈 벤치마크에서 가져온 331개의 터미널 에이전트 벤치마크 환경으로 구성되어 있으며, 보상을 우회할 수 있는 것으로 입증되었습니다. 데이터셋에는 세 가지 최첨단 모델(Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4)에 대한 3,632개의 공격 경로와 2,352개의 정상적인 기준 경로가 포함되어 있습니다. 각 항목은 원래 작업 정의와 함께 검증 시스템을 우회하는 방법을 보여주는 전체 공격 경로를 보존합니다. 또한 의도한 대로 작업을 해결하지 못한 경우도 포함되어 있습니다. 작업은 시스템 관리, 머신 러닝, 소프트웨어 엔지니어링 및 보안 과제를 포괄하며, 공격 방법은 간단한 출력 위조부터 스택 프레임 검사, 표준 라이브러리 패치, 루트킷 스타일의 바이너리 하이재킹에 이르기까지 다양합니다. 중요한 점은 이러한 공격이 평가 환경 자체가 아닌 각 작업에 특이적이라는 것이므로, 패치가 더 어렵습니다. 또한, 공격 경로에서 추론 과정을 삭제하거나 제거한 후 LLM 평가 모델이 점수를 매기는 모니터링 연구를 수행했습니다. 그 결과, 연쇄적 사고(chain-of-thought)가 제거되면 탐지 성능이 현저하게 저하되는 것을 확인했습니다(AUC가 0.97에서 0.92로 감소). 이 데이터셋은 https://github.com/few-sh/terminal-wrench 에서 공개적으로 이용할 수 있습니다.

Original Abstract

We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.

1 Citations

0 Influential

36.97866136777 Altmetric

185.9 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!