2601.22636v1 Jan 30, 2026 cs.AI

Best-of-N 샘플링 하에서 대규모 언어 모델의 적대적 위험에 대한 통계적 추정

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng

Citations: 116

h-index: 7

Xiaodong Liu

Citations: 309

h-index: 9

Weiwei Yang

Citations: 82

h-index: 4

Chenliang Xu

Citations: 14

h-index: 3

Jianfeng Gao

Citations: 82

h-index: 5

Christopher White

Citations: 44

h-index: 3

대규모 언어 모델(LLM)은 일반적으로 단일 시도(single-shot) 또는 저예산 적대적 프롬프팅 환경에서 안전성이 평가되는데, 이는 실제 세계의 위험을 과소평가하는 경향이 있습니다. 실제로 공격자들은 유해한 응답이 생성될 때까지 대규모 병렬 샘플링을 악용하여 모델을 반복적으로 탐색할 수 있습니다. 최근 연구에서 반복적인 샘플링에 따라 공격 성공률이 증가한다는 사실이 밝혀졌으나, 대규모 적대적 위험을 예측하기 위한 원칙적인 방법은 여전히 제한적입니다. 본 논문에서는 Best-of-N 샘플링 하에서의 탈옥(jailbreak) 취약성을 모델링하기 위해, 스케일링을 고려한 Best-of-N 위험 추정 기법인 SABER를 제안합니다. 우리는 베르누이 분포의 켤레 사전 분포(conjugate prior)인 베타 분포를 사용하여 샘플 수준의 성공 확률을 모델링하고, 소규모 예산의 측정 데이터만으로도 큰 N값에 대한 공격 성공률을 신뢰성 있게 외삽(extrapolation)할 수 있는 해석적 스케일링 법칙을 유도합니다. n=100개의 샘플만 사용하여 제안된 앵커 추정기(anchored estimator)로 ASR@1000을 예측한 결과, 베이스라인의 평균 절대 오차가 12.04인 것에 비해 본 방법은 1.66을 기록하여 추정 오차를 86.2% 감소시켰습니다. 실험 결과는 다양한 위험 스케일링 프로파일을 보여주며, 표준 평가에서는 견고해 보이는 모델이라도 병렬적인 적대적 압력 하에서는 급격한 비선형적 위험 증폭을 겪을 수 있음을 시사합니다. 이 연구는 현실적인 LLM 안전성 평가를 위한 저비용의 확장 가능한 방법론을 제공합니다. 코드와 평가 스크립트는 논문 출판 시 후속 연구를 위해 공개될 예정입니다.

Original Abstract

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

4 Citations

0 Influential

4.5 Altmetric

26.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!