2602.00158v2 Jan 29, 2026 cs.LG

RAPTOR: 릿지 적응 로지스틱 프로브

RAPTOR: Ridge-Adaptive Logistic Probes

Yao Zhu

Citations: 54

h-index: 3

Ziqing Wang

Citations: 44

h-index: 4

Kaize Ding

Citations: 157

h-index: 4

Ziqi Gao

Citations: 146

h-index: 5

Qingcheng Zeng

Citations: 108

h-index: 5

Xuan Zhao

Citations: 29

h-index: 3

F. Ruan

Citations: 1,318

h-index: 14

프로빙 연구는 동결된 LLM의 레이어 표현에 어떤 정보가 인코딩되어 있는지, 경량 예측 모델을 훈련시켜 분석합니다. 분석 외에도 프로브는 종종 '프로빙 후 제어(probe-then-steer)' 파이프라인에서 활용됩니다. 이 파이프라인에서는 프로브로부터 학습된 개념 벡터를 추출하여, 순방향 연산 과정에서 레이어 표현에 더하여 활성화 제어(additive activation steering)를 통해 주입합니다. 이 파이프라인의 효과는 정확하고, 부분적으로 제거해도 방향성이 안정적이며, 획득 비용이 저렴한 개념 벡터를 추정하는 능력에 달려 있습니다. 이러한 요구 사항에 따라, 우리는 L2 정규화를 적용한 간단한 로지스틱 프로브인 RAPTOR(Ridge-Adaptive Logistic Probe)를 제안합니다. 검증 데이터셋을 사용하여 조정된 릿지 강도는 정규화된 가중치로부터 개념 벡터를 생성합니다. 지시사항에 맞춰 튜닝된 LLM과 인간이 작성한 개념 데이터셋에 대한 광범위한 실험 결과, RAPTOR는 정확도 측면에서 강력한 기준 모델과 동등하거나 뛰어난 성능을 보였으며, 방향성 안정성 또한 경쟁력을 갖추고 있으며, 훈련 비용은 훨씬 저렴했습니다. 이러한 정량적 결과는 정성적 다운스트림 제어 시연을 통해 뒷받침됩니다. 마지막으로, Convex Gaussian Min-max Theorem (CGMT)을 사용하여 이상적인 가우시안 교사-학생 모델에서 고차원, 소규모 데이터 환경에서의 릿지 로지스틱 회귀를 메커니즘적으로 설명합니다. 이를 통해 페널티 강도가 프로브의 정확성과 개념 벡터의 안정성에 미치는 영향을 이해하고, 실제 LLM 임베딩에서 관찰되는 추세와 일관된 구조적 예측을 얻을 수 있습니다.

Original Abstract

Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!