2603.16152v1 Mar 17, 2026 cs.LG

HIPO: 제약 조건 강화 학습을 통한 명령 계층 구조

HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

Alvaro Velasquez

Citations: 1

h-index: 1

Jun Luo

Citations: 15

h-index: 2

Sen Lin

Citations: 4

h-index: 1

Yingbin Liang

Citations: 30

h-index: 3

Shaofeng Zou

Citations: 26

h-index: 4

Ke-Chi Chen

Citations: 6

h-index: 1

Nathaniel D. Bastian

Citations: 2,395

h-index: 25

계층적 명령 수행(HIF)은 우선순위가 매겨진 명령 스택을 사용하여 대규모 언어 모델에 프롬프트를 제공하는 문제입니다. RLHF 및 DPO와 같은 일반적인 방법은 주로 단일 목표를 최적화하기 때문에 이 문제에서 실패하는 경향이 있으며, 시스템 프롬프트 준수 여부를 명시적으로 강제하지 못합니다. 반면, 지도 학습은 필터링된 준수 데이터를 모방하는 데 의존하며, 이는 알고리즘 수준에서 우선순위의 비대칭성을 확립하는 데 실패합니다. 본 논문에서는 HIF를 제약 조건 마르코프 결정 프로세스로 정의하는 새로운 정렬 프레임워크인 extsc{HIPO}를 소개합니다. extsc{HIPO}는 시스템 프롬프트를 단순한 입력 컨텍스트가 아닌 엄격한 알고리즘 경계로 격상시킵니다. 원-이중 안전 강화 학습 접근 방식을 사용하여, 이 알고리즘은 시스템 프롬프트 준수 여부를 명시적인 제약 조건으로 동적으로 강제하며, 이 실현 가능한 영역 내에서 사용자 유용성을 극대화합니다. 다양한 모델 아키텍처(예: Qwen, Phi, Llama)에 대한 광범위한 평가 결과, extsc{HIPO}는 시스템 준수 및 사용자 유용성을 모두 크게 향상시키는 것으로 나타났습니다. 또한, 메커니즘 분석 결과, 이 제약 조건 최적화는 모델이 자율적으로 장거리 시스템 토큰에 주의를 기울이도록 유도하며, 이는 복잡한 워크플로우에서 안정적인 LLM 배포를 위한 원칙적인 기반을 제공합니다.

Original Abstract

Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textsc{HIPO}, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textsc{HIPO} elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textsc{HIPO} significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.

1 Citations

0 Influential

12.5 Altmetric

63.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!