2602.04735v1 Feb 04, 2026 cs.LG

데이터에서 행동으로: 학습 전에 모델의 의도하지 않은 동작 예측

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Mengru Wang

Citations: 870

h-index: 15

Zhen Xu

Citations: 3,323

h-index: 6

Junfeng Fang

Citations: 689

h-index: 14

Yunzhi Yao

Zhejiang University;Shandong University

Citations: 3,270

h-index: 22

Shumin Deng

Citations: 6,148

h-index: 39

Huajun Chen

Citations: 4,759

h-index: 34

Ningyu Zhang

Citations: 3,676

h-index: 32

대규모 언어 모델(LLM)은 명백한 단서나 악성 콘텐츠가 없더라도, 겉보기에는 무해한 학습 데이터로부터 의도하지 않은 편향을 학습할 수 있습니다. 기존 방법들은 이러한 위험을 미세 조정 전에 탐지하는 데 어려움을 겪으며, 사후 평가 방식은 비용이 많이 들고 비효율적입니다. 이러한 문제를 해결하기 위해, 학습 전에 모델의 의도하지 않은 동작을 예측하는 새로운 방법인 Data2Behavior를 제안합니다. 또한, 후보 데이터를 평균 표현으로 요약하고 이를 기반 모델의 순방향 연산에 주입하여, 데이터 내의 잠재적인 통계적 신호가 모델의 활성화를 조절하고 잠재적인 편향 및 안전 위험을 드러내는 가벼운 방법인 Manipulating Data Features (MDF)를 제안합니다. MDF는 신뢰성 있는 예측을 달성하면서 미세 조정에 필요한 GPU 리소스의 약 20%만을 사용합니다. Qwen3-14B, Qwen2.5-32B-Instruct, 그리고 Gemma-3-12b-it 모델에 대한 실험 결과, MDF가 의도하지 않은 동작을 예측하고 사전 학습 과정의 취약점을 파악하는 데 유용함을 확인했습니다.

Original Abstract

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.

2 Citations

0 Influential

19.5 Altmetric

99.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!