2602.05164v1 Feb 05, 2026 cs.LG

위치: 능력 제어는 정렬(Alignment)과는 별개의 목표가 되어야 한다

Position: Capability Control Should be a Separate Goal From Alignment

Eleni Triantafillou

Citations: 4

h-index: 1

Shoaib Ahmed Siddiqui

Citations: 4,447

h-index: 17

David Krueger

Citations: 58

h-index: 4

Adrian Weller

Citations: 22

h-index: 3

기초 모델은 광범위한 데이터 분포로 학습되어 다양한 응용 분야를 가능하게 하는 일반적인 능력을 갖추고 있지만, 동시에 잠재적인 오용 및 실패의 가능성을 확대합니다. 본 논문은 능력 제어(모델의 허용 가능한 동작에 대한 제한)가 정렬과는 별개의 목표로 간주되어야 한다고 주장합니다. 정렬은 종종 맥락과 선호도에 따라 결정되지만, 능력 제어는 적대적인 유도 상황에서도 허용 가능한 동작에 대한 엄격한 운영 제한을 설정하는 것을 목표로 합니다. 우리는 모델 수명 주기 전반에 걸쳐 능력 제어 메커니즘을 세 가지 계층으로 분류합니다: (i) 학습 데이터 분포에 대한 데이터 기반 제어, (ii) 가중치 또는 표현 수준의 개입을 통한 학습 기반 제어, (iii) 입력, 출력 및 동작에 대한 배포 후 보호 장치를 통한 시스템 기반 제어. 각 계층은 단독으로 사용될 때 고유한 오류 발생 가능성을 가지므로, 우리는 전체 시스템에 걸쳐 상호 보완적인 제어를 결합하는 심층 방어 접근 방식을 옹호합니다. 또한, 이러한 제어를 달성하는 데 있어 지식의 이중 용도성 및 조합 일반화와 같은 주요 과제를 제시합니다.

Original Abstract

Foundation models are trained on broad data distributions, yielding generalist capabilities that enable many downstream applications but also expand the space of potential misuse and failures. This position paper argues that capability control -- imposing restrictions on permissible model behavior -- should be treated as a distinct goal from alignment. While alignment is often context and preference-driven, capability control aims to impose hard operational limits on permissible behaviors, including under adversarial elicitation. We organize capability control mechanisms across the model lifecycle into three layers: (i) data-based control of the training distribution, (ii) learning-based control via weight- or representation-level interventions, and (iii) system-based control via post-deployment guardrails over inputs, outputs, and actions. Because each layer has characteristic failure modes when used in isolation, we advocate for a defense-in-depth approach that composes complementary controls across the full stack. We further outline key open challenges in achieving such control, including the dual-use nature of knowledge and compositional generalization.

3 Citations

0 Influential

8.5 Altmetric

45.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!