2602.14869v1 Feb 16, 2026 cs.AI

개념 영향력: 학습 데이터 기여도 분석의 성능과 효율성 향상을 위한 해석 가능성 활용

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Kellin Pelrine

Citations: 605

h-index: 11

Matthew Kowal

Citations: 74

h-index: 3

Adam Gleave

Citations: 1,145

h-index: 16

Gonçalo Paulo

Citations: 1

h-index: 1

Louis Jaburi

Citations: 6

h-index: 2

Tom Tseng

Citations: 1

h-index: 1

Stefan Heimersheim

Citations: 0

h-index: 0

A. D. Tucker

Citations: 21

h-index: 3

Lev McKinney

Citations: 377

h-index: 2

거대 언어 모델의 학습 및 미세 조정이 증가함에 따라, 실무자들은 특정 동작, 특히 의도치 않은 동작을 유발하는 학습 데이터를 식별하는 방법이 필요합니다. 학습 데이터 기여도 분석(TDA) 방법은 데이터 포인트의 영향력을 추정하여 이 문제를 해결합니다. 영향 함수와 같은 기존 접근 방식은 계산 비용이 많이 들 뿐만 아니라 단일 테스트 예제를 기반으로 기여도를 산정하기 때문에, 의미적 유사성보다는 통사적 유사성에 편향된 결과를 초래할 수 있습니다. 이러한 확장성 문제와 추상적 동작에 대한 영향력 문제를 해결하기 위해, 우리는 기여도 분석 과정에서 모델 내의 해석 가능한 구조를 활용합니다. 첫째, 개별 테스트 예제가 아닌 의미적 방향(선형 프로브나 희소 오토인코더 특성 등)에 모델 동작의 원인을 귀속시키는 '개념 영향력(Concept Influence)'을 소개합니다. 둘째, 간단한 프로브 기반 기여도 분석 방법이 개념 영향력의 1차 근사임을 보여주며, 이는 비슷한 성능을 달성하면서도 속도는 10배 이상 빠름을 입증합니다. 우리는 창발적 오정렬 벤치마크와 실제 사후 학습 데이터셋 전반에 걸쳐 개념 영향력과 그 근사법들을 실증적으로 검증하였으며, 기존 영향 함수와 비슷한 성능을 보이면서도 확장성은 훨씬 뛰어남을 보여줍니다. 더 나아가, 전통적인 TDA 파이프라인에 해석 가능한 구조를 통합함으로써 데이터를 통해 모델 동작을 더 확장성 있고 설명 가능하게, 그리고 더 잘 제어할 수 있음을 보여줍니다.

Original Abstract

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!