2603.08239v1 Mar 09, 2026 cs.LG

섬유 구조 기반 정책 최적화

Fibration Policy Optimization

Chao Xue

Citations: 59

h-index: 5

Chang Li

Citations: 119

h-index: 6

Tshihao Tsu

Citations: 0

h-index: 0

Yaren Zhang

Citations: 5

h-index: 1

Xiaodong He

Citations: 46

h-index: 3

최근 대규모 언어 모델은 다양한 도메인, 전문 영역 분할 및 에이전트 기반 파이프라인을 포괄하는 이질적인 시스템으로 훈련되는 경향이 있습니다. 그러나 현재 널리 사용되는 최적화 방법은 단일 스케일에서 작동하며 토큰 수준, 경로 수준 및 상위 계층의 안정성 제어를 연결하는 체계적인 메커니즘이 부족합니다. 이러한 격차를 해소하기 위해, 우리는 샘플 기반 TV-TRPO의 최초의 정확하고 제약 없는 재정의인 Aggregational Policy Censoring Objective (APC-Obj)를 도출했습니다. 이를 통해 클리핑 기반 대체 설계와 신뢰 영역 최적화가 동일한 문제의 이중 형태임을 입증합니다. 이러한 기반을 바탕으로, 우리는 샘플링된 강화 학습 데이터를 섬유 번들로 구성하고 비율 게이팅을 경로 집계에 대한 기본 수준 게이트와 각 토큰 잔류값에 대한 섬유 수준 게이트로 분해하는 대수적 프레임워크인 Fiber Bundle Gating (FBG)를 개발했습니다. FBG는 온-정책 근처에서 실제 강화 학습 목표에 대한 1차 동의가 증명되었습니다. APC-Obj와 FBG를 기반으로, 우리는 Jacobian이 경로에 대해 블록 대각 행렬이고, 온-정책에서는 단위 행렬로 감소하며, 따라서 토큰 효율성을 향상시키는 더 나은 업데이트 방향을 제공하는 구체적인 목표 함수인 Fibration Policy Optimization (또는 간단히 FiberPO)을 도출했습니다. 이 프레임워크의 합성적 특성은 경로-토큰 사례를 넘어섭니다. 섬유 구조는 새로운 원시 요소 없이 임의의 계층적 깊이까지 동일한 게이팅 메커니즘을 확장하는 섬유 게이팅 계층 (Fibration Gating Hierarchy, FGH)로 대수적으로 결합됩니다. 이는 FiberPO-Domain이라는 4단계 구현을 통해 입증되었으며, 각 단계(도메인, 프롬프트 그룹, 경로, 토큰)에서 독립적인 신뢰 영역 예산을 사용합니다. 이러한 결과들은 신뢰 영역 이론, 합성 대수 구조 및 실질적인 다중 스케일 안정성 제어를 LLM 정책 최적화를 위한 통합 프레임워크로 연결합니다.

Original Abstract

Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!