2604.01151v1 Apr 01, 2026 cs.AI

멀티 에이전트 해석을 통한 다중 에이전트 공모 탐지

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Christian Schroeder de Witt

Citations: 13

h-index: 2

A. Rose

Citations: 92

h-index: 6

C. Cullen

Citations: 3

h-index: 1

Brandon Kaplowitz

Citations: 4

h-index: 1

LLM 에이전트가 다중 에이전트 시스템에 점점 더 많이 배치됨에 따라, 표준적인 인간 감독을 회피할 수 있는 은밀한 협력의 위험이 발생합니다. 모델 활성화에 대한 선형 프로빙은 단일 에이전트 환경에서 사기 탐지에 유망한 결과를 보여주었지만, 공모는 본질적으로 다중 에이전트 현상이며, 내부 표현을 사용하여 에이전트 간의 공모를 탐지하는 것은 아직 연구되지 않았습니다. 우리는 환경 분포 변화 하에서 공모 탐지를 평가하기 위한 벤치마크인 NARCBench를 소개하고, 각 에이전트의 사기 점수를 집계하여 그룹 수준에서 시나리오를 분류하는 다섯 가지 프로빙 기술을 제안합니다. 우리의 프로빙 기술은 동일한 분포 내에서 1.00의 AUROC를 달성했으며, 구조적으로 다른 다중 에이전트 시나리오와 스테가노그래피 기반 블랙잭 카드 계산 작업으로 제로샷으로 전송될 때 0.60~0.86의 AUROC를 달성했습니다. 단일 프로빙 기술이 모든 유형의 공모에서 우수한 성능을 보이지 않는다는 것을 발견했는데, 이는 다양한 형태의 공모가 활성화 공간에서 다르게 나타난다는 것을 시사합니다. 또한, 이 신호가 토큰 수준에서 국소화되어 있으며, 공모하는 에이전트의 활성화는 파트너의 메시지의 인코딩된 부분을 처리할 때 특정하게 증가한다는 초기 증거를 발견했습니다. 이 연구는 다중 에이전트 해석을 향한 중요한 단계이며, 단일 모델에 대한 화이트박스 검사를 다중 에이전트 컨텍스트로 확장하여, 탐지가 에이전트 간의 신호를 집계해야 하는 경우를 포함합니다. 이러한 결과는 모델 내부 정보가 다중 에이전트 공모를 탐지하는 데 텍스트 수준 모니터링을 보완하는 신호를 제공하며, 특히 모델 활성화에 접근할 수 있는 조직에게 유용할 수 있음을 시사합니다. 코드 및 데이터는 https://github.com/aaronrose227/narcbench 에서 확인할 수 있습니다.

Original Abstract

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.

0 Citations

0 Influential

26.4657359028 Altmetric

132.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!