2604.16812v1 Apr 18, 2026 cs.AI

자기 성찰 어댑터: LLM이 학습한 행동을 보고하도록 훈련하는 방법

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Jack Lindsey

Citations: 181

h-index: 3

K. Shenoy

Citations: 0

h-index: 0

Li Yang

Citations: 38

h-index: 4

A. Sheshadri

Citations: 271

h-index: 7

S. Mindermann

Citations: 1,030

h-index: 11

Samuel Marks

Citations: 62

h-index: 2

Rowan Wang

Citations: 27

h-index: 3

모델 개발자나 사용자가 LLM을 미세 조정할 때, 예상치 못한, 의도적으로 해로운, 또는 감지하기 어려운 행동이 발생할 수 있습니다. LLM이 자신의 행동을 자연어로 설명할 수 있다면 감사(audit)하는 것이 훨씬 쉬워질 것입니다. 본 연구에서는 공유된 기본 LLM에서 파생된 여러 LLM의 학습된 행동을 신속하게 식별하는 확장 가능한 방법을 연구합니다. 주어진 모델 $M$에 대해, 저희의 방법은 $M$에서 파생된 모델 $M_i$를 특정 행동 $b_i$를 임베딩하여 미세 조정합니다. $(M_i, b_i)$ 쌍은 레이블이 지정된 훈련 데이터 역할을 합니다. 그런 다음, 저희는 extit{자기 성찰 어댑터}(IA)를 훈련합니다. IA는 미세 조정된 $M_i$ 모델들을 대상으로 훈련된 단일 LoRA 어댑터로, $M_i$가 임베딩된 행동을 언어로 표현하도록 유도합니다. 저희는 IA가 $M$의 매우 다른 방식으로 훈련된 $M_i$의 미세 조정 모델에서도 학습된 행동에 대한 자기 설명을 유도한다는 것을 발견했습니다. 예를 들어, IA는 AuditBench에 적용되어 명시적으로 숨겨진 문제 행동을 식별하는 데 있어 최첨단 성능을 달성합니다. IA는 또한 암호화된 미세 조정 API 공격을 탐지하는 데 사용될 수 있습니다. IA는 모델 크기와 훈련 데이터 다양성과 함께 긍정적인 확장성을 보입니다. 전반적으로, 저희의 결과는 IA가 미세 조정된 LLM을 감사하는 데 있어 확장 가능하고 효과적이며 실용적인 접근 방식임을 시사합니다.

Original Abstract

When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$, our method works by finetuning models $M_i$ from $M$ with implanted behaviors $b_i$; the $(M_i, b_i)$ pairs serve as labeled training data. We then train an \emph{introspection adapter} (IA): a single LoRA adapter jointly trained across the finetunes $M_i$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_i$. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!