2602.11729v1 Feb 12, 2026 cs.AI

크로스코더를 활용한 교차 아키텍처 모델 디핑: LLM 간 차이점의 비지도 탐색

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Citations: 178

h-index: 6

Citations: 1,088

h-index: 8

모델의 내부 표현을 비교하여 차이점을 식별하는 과정인 모델 디핑(Model diffing)은 새로운 모델에서 안전에 치명적인 행동을 드러내는 유망한 접근 방식입니다. 그러나 지금까지의 적용은 주로 베이스 모델과 파인튜닝 모델을 비교하는 데 집중되어 왔습니다. 새로 출시되는 LLM은 종종 새로운 아키텍처를 채택하므로, 모델 디핑을 널리 활용하기 위해서는 교차 아키텍처 방법론이 필수적입니다. 크로스코더(Crosscoders)는 교차 아키텍처 모델 디핑이 가능한 솔루션 중 하나이지만, 이전까지는 베이스 대 파인튜닝 비교에만 적용되었습니다. 본 연구는 크로스코더를 교차 아키텍처 모델 디핑에 최초로 적용하고, 특정 모델에 고유한 특징을 더 잘 분리하도록 설계된 아키텍처 수정인 전용 특징 크로스코더(Dedicated Feature Crosscoders, DFCs)를 소개합니다. 이 기술을 활용하여 우리는 Qwen3-8B 및 Deepseek-R1-0528-Qwen3-8B의 중국 공산당 동조, Llama3.1-8B-Instruct의 미국 예외주의, GPT-OSS-20B의 저작권 거부 메커니즘과 같은 특징들을 비지도 방식으로 발견했습니다. 결론적으로, 우리의 연구 결과는 교차 아키텍처 크로스코더 모델 디핑을 AI 모델 간의 유의미한 행동 차이를 식별하는 효과적인 방법으로 확립하는 데 기여합니다.

Original Abstract

Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.

6 Citations

0 Influential

4 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!