2602.05523v1 Feb 05, 2026 cs.SE

플래그 획득: 의미 보존 변환을 통한 에이전트 LLM의 패밀리 기반 평가

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

H. Coppock

Citations: 52

h-index: 3

Marek Rei

Citations: 48

h-index: 3

Shahin Honarvar

Citations: 112

h-index: 3

Amber Gorzynski

Citations: 1

h-index: 1

James Lee-Jones

Citations: 0

h-index: 0

Joseph Ryan

Citations: 0

h-index: 0

Alastair Donaldson

Citations: 17

h-index: 2

에이전트 대규모 언어 모델(LLM)은 점점 더 많은 사이버 보안 작업에서 캡처 더 플래그(CTF) 벤치마크를 사용하여 평가되고 있습니다. 그러나 기존의 점별 벤치마크는 소스 코드의 다양한 버전에서 에이전트의 견고성 및 일반화 능력을 파악하는 데 한계가 있습니다. 본 연구에서는 CTF 챌린지 패밀리를 도입합니다. 이는 하나의 CTF를 기반으로 의미적으로 동등한 챌린지 패밀리를 생성하기 위해 의미 보존 프로그램 변환을 사용하는 것입니다. 이를 통해 소스 코드 변환에 대한 에이전트의 견고성을 제어하면서 기본 공격 전략을 고정할 수 있습니다. 본 연구에서는 Python 챌린지에서 CTF 패밀리를 생성하는 새로운 도구인 Evolve-CTF를 소개합니다. Evolve-CTF를 사용하여 Cybench 및 Intercode 챌린지에서 파생된 패밀리를 통해, 도구 접근 권한이 있는 13개의 에이전트 LLM 구성을 평가했습니다. 연구 결과, 모델은 침투적인 이름 변경 및 코드 삽입 기반 변환에 대해 놀라운 수준의 견고성을 보이지만, 복합 변환 및 심층적인 난독화는 더 정교한 도구 사용을 요구하여 성능에 영향을 미칩니다. 또한 명시적인 추론을 활성화하는 것이 챌린지 패밀리 전체의 해결 성공률에 큰 영향을 미치지 않는다는 것을 확인했습니다. 본 연구는 향후 LLM 평가를 위한 귀중한 기술과 도구를 제공하며, 이 분야에서 현재 최첨단 모델의 기능을 특징짓는 대규모 데이터 세트를 제공합니다.

Original Abstract

Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed light on the robustness and generalisation abilities of agents across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used as the basis for generating a family of semantically-equivalent challenges via semantics-preserving program transformations. This enables controlled evaluation of agent robustness to source code transformations while keeping the underlying exploit strategy fixed. We introduce a new tool, Evolve-CTF, that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to intrusive renaming and code insertion-based transformations, but that composed transformations and deeper obfuscation affect performance by requiring more sophisticated use of tools. We also find that enabling explicit reasoning has little effect on solution success rates across challenge families. Our work contributes a valuable technique and tool for future LLM evaluations, and a large dataset characterising the capabilities of current state-of-the-art models in this domain.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!