2603.10521v1 Mar 11, 2026 cs.AI

IH-Challenge: 최첨단 LLM의 지시 계층 구조 개선을 위한 학습 데이터셋

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Stephanie L. Lin

Citations: 26,957

h-index: 9

Michael Pokorny

Citations: 23,883

h-index: 2

Miles Wang

Citations: 4,921

h-index: 9

Sicheng Zhu

Citations: 699

h-index: 13

Chuan Guo

Citations: 56

h-index: 2

J. Felipe

Citations: 22

h-index: 2

Cerón Uribe

Citations: 2

h-index: 1

Christopher A. Choquette-Choo

Citations: 6,167

h-index: 13

Nikhil Kandpal

Citations: 2,314

h-index: 11

Milad Nasr

Citations: 14,127

h-index: 32

S. Toyer

Citations: 1,312

h-index: 10

Yao-Ching Yu

Citations: 53

h-index: 2

Alex Beutel

Citations: 5,295

h-index: 12

Kai Xiao OpenAI

Citations: 0

h-index: 0

지시 계층 구조(IH)는 LLM이 충돌 상황에서 시스템, 개발자, 사용자 및 도구 지시의 우선순위를 결정하는 방식으로, 지시 충돌 해결을 위한 구체적이고 신뢰 기반의 정책을 제공합니다. IH는 jailbreak 공격, 시스템 프롬프트 추출 및 에이전트 프롬프트 주입 공격으로부터 LLM을 보호하는 데 중요한 역할을 합니다. 그러나 강력한 IH 성능을 확보하는 것은 어렵습니다. IH 실패는 지시 준수 실패와 혼동될 수 있으며, 충돌 상황은 미묘할 수 있으며, 모델은 과도한 거부와 같은 단순한 해결책을 학습할 수 있습니다. 본 연구에서는 이러한 어려움을 해결하기 위해 강화 학습 기반의 학습 데이터셋인 IH-Challenge를 소개합니다. IH-Challenge를 사용하여 GPT-5-Mini를 파인 튜닝하고, 온라인 방식으로 적대적 예제를 생성함으로써, 16개의 다양한 평가 벤치마크(84.1%에서 94.1%)에서 IH의 견고성을 평균 +10.0% 향상시켰습니다. 또한, 일반적인 안전성 평가에서 유해한 행동을 6.6%에서 0.7%로 줄이고, 자체적인 정적 에이전트 프롬프트 주입 평가에서 최대 성능을 달성했으며, 모델의 전반적인 성능 저하를 최소화했습니다. IH-Challenge 데이터셋(https://huggingface.co/datasets/openai/ih-challenge)을 공개하여 향후 견고한 지시 계층 구조에 대한 연구를 지원합니다.

Original Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

0 Citations

0 Influential

36 Altmetric

180.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!