2602.08023v2 Feb 08, 2026 cs.CR

CyberExplorer: 실제 공격 시뮬레이션 환경에서 LLM의 공격 보안 능력 벤치마킹

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

Nanda Rani

Citations: 109

h-index: 7

Kimberly Milner

Citations: 211

h-index: 6

Minghao Shao

Citations: 454

h-index: 11

Meet Udeshi

Citations: 227

h-index: 6

Haoran Xi

Citations: 234

h-index: 7

Venkata Sai Charan Putrevu

Citations: 87

h-index: 6

Saksham Aggarwal

Citations: 11

h-index: 2

S. K. Shukla

Citations: 152

h-index: 8

P. Krishnamurthy

Citations: 4,262

h-index: 33

F. Khorrami

Citations: 6,433

h-index: 41

Muhammad Shafique

Citations: 61

h-index: 3

Ramesh Karri

Citations: 336

h-index: 9

실제 공격 보안 작업은 본질적으로 개방형입니다. 공격자는 알려지지 않은 공격 지점을 탐색하고, 불확실성 하에서 가설을 수정하며, 성공을 보장받지 못한 상태에서 활동합니다. 기존의 LLM 기반 공격 에이전트 평가는 미리 정의된 목표와 이진 성공 기준을 가진 폐쇄형 환경에 의존합니다. 이러한 격차를 해소하기 위해, 우리는 CyberExplorer를 소개합니다. CyberExplorer는 두 가지 핵심 구성 요소를 갖는 평가 도구입니다. (1) 실제 CTF(Capture The Flag) 챌린지에서 파생된 40개의 취약 웹 서비스를 호스팅하는 가상 머신을 기반으로 구축된 개방형 환경 벤치마크입니다. 에이전트는 사전 지식 없이 자율적으로 정찰, 대상 선택 및 공격을 수행합니다. (2) 미리 정의된 계획 없이 동적 탐색을 지원하는 반응형 멀티 에이전트 프레임워크입니다. CyberExplorer는 단순히 플래그 획득을 넘어 상호 작용 역학, 협업 행동, 실패 모드 및 취약점 발견 신호를 정밀하게 평가하여 벤치마크와 실제 다중 대상 공격 시나리오 간의 간극을 좁힙니다.

Original Abstract

Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.

2 Citations

0 Influential

20.5 Altmetric

104.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!