2601.22636v2 Jan 30, 2026 cs.AI

최적의 N개 샘플링 환경에서 대규모 언어 모델의 적대적 위험에 대한 통계적 추정

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng

Citations: 116

h-index: 7

Xiaodong Liu

Citations: 309

h-index: 9

Weiwei Yang

Citations: 82

h-index: 4

Chenliang Xu

Citations: 14

h-index: 3

Jianfeng Gao

Citations: 82

h-index: 5

Christopher White

Citations: 44

h-index: 3

대규모 언어 모델(LLM)은 일반적으로 단일 시도 또는 제한된 예산으로 이루어진 적대적 프롬프팅을 통해 안전성을 평가받지만, 이는 실제 위험을 과소평가할 수 있습니다. 실제로 공격자는 대규모 병렬 샘플링을 활용하여 모델을 반복적으로 테스트하여 유해한 응답을 유도할 수 있습니다. 최근 연구에서 반복적인 샘플링이 공격 성공률을 증가시킨다는 결과가 나왔지만, 대규모 적대적 위험을 예측하는 체계적인 방법은 여전히 제한적입니다. 본 연구에서는 최적의 N개 샘플링 환경에서 탈옥(jailbreak) 취약성을 모델링하기 위한 확장(scaling)을 고려한 최적의 N개 샘플링 기반 위험 추정 방법인 SABER을 제안합니다. 우리는 각 샘플 수준의 성공 확률을 베타 분포(Bernoulli 분포의 공액 사전 분포)를 사용하여 모델링하고, 소규모 예산 측정으로부터 대규모 N의 공격 성공률을 신뢰성 있게 추정할 수 있는 분석적 확장 법칙을 도출합니다. n=100개의 샘플만을 사용하여, 제안하는 방법은 기준 방법보다 평균 절대 오차 86.2% 감소한 1.66의 ASR@1000(1000개의 샘플에서 공격 성공률)을 예측합니다. 연구 결과는 이질적인 위험 확장 프로필을 드러내며, 표준 평가에서는 안정적으로 보이는 모델이라도 병렬적인 적대적 공격에 의해 급격하고 비선형적인 위험 증폭을 경험할 수 있음을 보여줍니다. 본 연구는 현실적인 LLM 안전성 평가를 위한 저렴하고 확장 가능한 방법론을 제공합니다. 연구 결과가 발표되면 코드와 평가 스크립트를 공개하여 향후 연구에 기여할 것입니다.

Original Abstract

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

3 Citations

0 Influential

4.5 Altmetric

25.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!