2602.05183v2 Feb 05, 2026 cs.LG

LLM 기반 다중 에이전트 강화 학습을 위한 데이터 중심 해석 기법

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

J. Yan

Citations: 17

h-index: 1

Michael Yu

Citations: 367

h-index: 4

Yuqing Sun

Citations: 465

h-index: 10

Alexandra Duffy

Citations: 3

h-index: 1

Tyler Marques

Citations: 8

h-index: 2

Matthew Lyle Olson

Citations: 8

h-index: 2

대규모 언어 모델(LLM)은 점점 더 복잡한 강화 학습 및 다중 에이전트 환경에서 훈련되고 있으며, 이로 인해 훈련 과정에서 행동 변화를 이해하기 어려워지고 있습니다. 최근에는 희소 오토인코더(SAE)가 데이터 중심 해석에 유용한 도구로 밝혀졌습니다. 본 연구에서는 사전 훈련된 SAE와 LLM 요약 방법을 함께 사용하여 'Full-Press Diplomacy'라는 정교한 환경에서 진행된 대규모 강화 학습 훈련 데이터를 분석합니다. 우리는 SAE 특징을 해석 가능한 훈련 동역학 가설로 그룹화하는 방법인 'Meta-Autointerp'를 소개합니다. 우리는 역할 연기 패턴, 퇴화된 출력, 언어 전환과 같은 세부적인 행동뿐만 아니라 고수준의 전략적 행동 및 환경별 버그를 발견했습니다. 자동 평가를 통해 발견된 SAE 메타 특징 중 90%가 유의미함을 확인하고, 놀라운 보상 해킹 행동을 발견했습니다. 그러나 두 가지 사용자 연구를 통해 주관적으로 흥미롭고 유용해 보이는 SAE 특징조차도 인간에게는 무용하거나 오히려 해로울 수 있으며, 대부분의 LLM이 생성한 가설도 마찬가지임을 확인했습니다. 그러나 SAE에서 파생된 가설의 일부는 후속 작업에 예측적으로 유용합니다. 또한, 훈련되지 않은 에이전트의 시스템 프롬프트를 SAE를 활용하여 개선함으로써 성능을 +14.2% 향상시켰습니다. 전반적으로, SAE와 LLM 요약기는 에이전트 행동에 대한 상호 보완적인 정보를 제공하며, 이러한 프레임워크는 훈련 전반에 걸쳐 신뢰할 수 있는 LLM 행동을 보장하기 위한 향후 데이터 중심 해석 연구의 실용적인 출발점이 될 수 있습니다.

Original Abstract

Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!