2602.18182v1 Feb 20, 2026 cs.LG

역량이 전부는 아니다: AI의 성향 측정

Capabilities Ain't All You Need: Measuring Propensities in AI

Daniel Romero-Alvarado

Citations: 8

h-index: 2

Fernando Mart'inez-Plumed

Citations: 2,743

h-index: 5

Lorenzo Pacchiardi

Leverhulme Centre for the Future of Intelligence, University of Cambridge

Citations: 212

h-index: 7

Hugo Save

Citations: 3

h-index: 1

Siddhesh Pawar

Citations: 177

h-index: 4

Behzad Mehrbakhsh

Citations: 47

h-index: 4

Pablo Antonio Moreno Casares

Xanadu.ai

Citations: 3,168

h-index: 12

Ben Slater

Citations: 151

h-index: 3

Paolo Bova

Citations: 243

h-index: 5

P. Romero

Citations: 376

h-index: 6

Zachary R. Tyler

Citations: 3

h-index: 1

Jonathan E. Prunty

Citations: 21

h-index: 2

Luning Sun

University of Cambridge

Citations: 520

h-index: 11

J. Hernández-Orallo

Citations: 347

h-index: 7

AI 평가는 주로 역량 측정에 초점을 맞춰왔으며, 문항반응이론(IRT)에서 영감을 받은 공식적인 접근법이 점점 더 많이 적용되고 있다. 하지만 특정 행동을 나타내는 모델의 경향성인 성향(propensity)은 성능 및 안전성 결과를 결정하는 데 중심적인 역할을 한다. 그럼에도 불구하고 전통적인 IRT는 과제에서 모델의 성공을 모델 역량과 과제 요구 수준의 단조 함수(monotonic function)로 설명하는데, 이는 과잉과 결핍이 모두 문제가 될 수 있는 성향을 다루기에는 부적합한 접근법이다. 이에 본 논문에서는 모델의 성공에 대해 이중 로지스틱(bilogistic) 공식을 활용하여 AI의 성향을 측정하는 최초의 공식 프레임워크를 제안한다. 이 공식은 모델의 성향이 '이상적인 대역(ideal band)' 내에 있을 때 높은 성공 확률을 부여한다. 나아가 새롭게 개발된 과제 독립적(task-agnostic) 루브릭이 적용된 거대 언어 모델(LLM)을 사용하여 이 이상적인 대역의 한계를 추정한다. 양방향 중 하나로 성향이 유도된 6개 계열의 LLM 모델에 본 프레임워크를 적용한 결과, 성향이 얼마나 이동했는지와 그것이 과제에 미치는 영향을 측정할 수 있음을 확인했다. 특히, 하나의 벤치마크를 통해 추정된 성향은 남겨진(held-out) 과제에서의 행동을 성공적으로 예측했다. 더 나아가, 성향과 역량을 개별적으로 사용할 때보다 결합했을 때 더 강력한 예측력을 확보할 수 있었다. 거시적으로 본 프레임워크는 엄밀한 성향 측정이 어떻게 수행될 수 있는지, 그리고 AI 행동 예측 시 역량 평가만을 사용하는 것보다 어떠한 이점을 제공하는지를 보여준다.

Original Abstract

AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.

3 Citations

0 Influential

6 Altmetric

33.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!