2602.02280v1 Feb 02, 2026 cs.SE

RACA: LLM 안전성 테스트를 위한 표현 인식 기반 커버리지 기준

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Zhixin Zhang

Citations: 24

h-index: 3

Zeming Wei

Peking University

Citations: 1,009

h-index: 13

Chengcan Wu

Citations: 24

h-index: 3

Yihao Zhang

Citations: 145

h-index: 6

Xiaokun Luan

Citations: 15

h-index: 2

Meng Sun

Citations: 67

h-index: 4

최근 LLM(Large Language Model)의 발전은 다양한 AI 응용 분야에서 획기적인 성과를 가져왔습니다. 그러나 이러한 LLM의 정교한 기능은 심각한 안전 문제를 야기하며, 특히 탈옥 공격을 통해 유해한 콘텐츠가 생성되는 것이 대표적입니다. 현재 LLM의 안전성 테스트는 주로 정적인 데이터 세트에 의존하며, 이러한 테스트의 품질과 적절성을 평가하기 위한 체계적인 기준이 부족합니다. 기존의 커버리지 기준은 작은 신경망에서는 효과적이었지만, 확장성 문제와 다른 목표로 인해 LLM에는 직접적으로 적용하기 어렵습니다. 이러한 문제점을 해결하기 위해, 본 논문에서는 LLM 안전성 테스트를 위해 특별히 설계된 새로운 커버리지 기준인 RACA를 제안합니다. RACA는 표현 엔지니어링을 활용하여 LLM 내의 안전과 관련된 핵심 개념에 집중함으로써 차원을 축소하고 불필요한 정보를 제거합니다. 이 프레임워크는 세 단계로 구성됩니다. 첫째, 전문가가 선별한 작은 교정 데이터 세트(jailbreak 프롬프트)를 사용하여 안전과 관련된 표현을 식별합니다. 둘째, 식별된 표현을 기반으로 주어진 테스트 스위트에 대한 개념 활성화 점수를 계산합니다. 셋째, 여섯 가지 하위 기준으로 구성된 커버리지 결과를 계산하여 개별적 및 조합적 안전 개념을 평가합니다. RACA의 효과성, 적용 가능성 및 일반화 능력을 검증하기 위한 광범위한 실험을 수행했으며, 그 결과 RACA는 고품질의 탈옥 프롬프트를 성공적으로 식별하며 기존의 뉴런 수준 기준보다 우수함을 보여주었습니다. 또한, RACA는 테스트 세트 우선순위 결정 및 공격 프롬프트 샘플링과 같은 실제 시나리오에서의 활용 가능성을 보여줍니다. 더 나아가, 우리의 연구 결과는 RACA가 다양한 시나리오에서 일반화 능력을 가지며, 다양한 구성에서도 견고함을 갖는다는 것을 확인했습니다. 전반적으로, RACA는 LLM의 안전성을 평가하는 새로운 프레임워크를 제공하며, AI 테스트 분야에 귀중한 기술을 기여합니다.

Original Abstract

Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA's generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!