2603.06333v1 Mar 06, 2026 cs.AI

SAHOO: 재귀적 자기 개선 과정에서 고차원 최적화 목표를 위한 안전한 정렬 방안

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Aman Chadha

Citations: 1,600

h-index: 14

Vinija Jain

Citations: 1,913

h-index: 14

Subramanyam Sahoo

Citations: 7

h-index: 2

Divya Chaudhary

Citations: 8

h-index: 1

재귀적 자기 개선은 이론에서 실제 적용으로 나아가고 있습니다. 현대 시스템은 자신의 출력물을 비판하고 수정하며 평가할 수 있지만, 반복적인 자기 수정은 미묘한 정렬 편향을 초래할 위험이 있습니다. 본 논문에서는 SAHOO라는 실용적인 프레임워크를 소개하며, 이를 통해 세 가지 안전장치를 통해 편향을 모니터링하고 제어합니다. (i) 목표 편향 지수(GDI): 의미, 어휘, 구조 및 분포적 특징을 결합하는 학습 기반의 다중 신호 감지기입니다. (ii) 제약 조건 보존 검사: 구문 정확성 및 환각 방지와 같은 안전에 중요한 불변 조건을 적용합니다. (iii) 회귀 위험 정량화: 이전의 이점을 무효화하는 개선 주기를 식별합니다. 코드 생성, 수학적 추론 및 진실성 관련 189개의 작업에서 SAHOO는 상당한 품질 향상을 가져왔으며, 코드 작업의 경우 18.3%, 추론의 경우 16.8%의 개선을 보였습니다. 또한 두 영역에서 제약 조건을 유지하고 진실성 측면에서 낮은 위반율을 보였습니다. 임계값은 세 주기 동안 18개의 작업으로 구성된 작은 검증 세트를 사용하여 조정되었습니다. 또한, SAHOO는 성능-정렬 간의 경계를 보여주며, 초기에는 효율적인 개선 주기를 보이지만, 시간이 지남에 따라 정렬 비용이 증가하며, 유창성 대 사실성 간의 영역별 긴장을 드러냅니다. 따라서 SAHOO는 재귀적 자기 개선 과정에서 정렬을 유지하는 것을 측정 가능하게 만들고, 배포 가능하게 하며, 대규모로 체계적으로 검증할 수 있도록 합니다.

Original Abstract

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.

1 Citations

0 Influential

7 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!