2603.23971v1 Mar 25, 2026 cs.CL

가격 역전 현상: 저렴한 추론 모델이 오히려 더 많은 비용을 초래하는 경우

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Matei A. Zaharia

Citations: 213

h-index: 5

Ion Stoica

Citations: 1,770

h-index: 8

Lingjiao Chen

Citations: 355

h-index: 7

Chi Zhang

Citations: 1,821

h-index: 6

Yeye He

Citations: 311

h-index: 7

James Zou

Citations: 45

h-index: 3

개발자와 사용자는 점점 더 추론 언어 모델(RLM)의 API 가격을 기준으로 선택하고 있습니다. 하지만 이러한 가격이 실제 추론 비용을 얼마나 정확하게 반영할까요? 본 연구에서는 이 질문에 대한 최초의 체계적인 연구를 수행하여, 경쟁 수학, 과학 질의응답, 코드 생성 및 다중 영역 추론을 포함하는 9가지 다양한 작업에서 8개의 최첨단 RLM을 평가했습니다. 우리는 '가격 역전 현상'을 발견했습니다. 모델 쌍 비교에서 21.8%의 경우, 더 낮은 가격으로 등록된 모델이 실제로는 더 높은 총 비용을 발생시키는 것으로 나타났으며, 역전 규모는 최대 28배에 달했습니다. 예를 들어, Gemini 3 Flash의 가격은 GPT-5.2보다 78% 저렴하지만, 실제로는 모든 작업에서 22% 더 높은 비용이 발생합니다. 이러한 현상의 근본적인 원인은 모델 간의 '생각 토큰' 소비량의 엄청난 차이 때문입니다. 동일한 쿼리에 대해 한 모델은 다른 모델보다 900% 더 많은 '생각 토큰'을 사용할 수 있습니다. 실제로 '생각 토큰' 비용을 제거하면 가격 순위와 비용 순위 간의 역전 현상이 70% 감소하고, Kendall's $ au$를 사용한 순위 상관 관계가 0.563에서 0.873으로 증가합니다. 또한, 쿼리당 비용 예측이 근본적으로 어렵다는 것을 보여줍니다. 동일한 쿼리에 대한 반복 실행에서도 '생각 토큰' 변동이 최대 9.7배까지 나타나, 예측 모델의 최소한의 노이즈 수준을 확립합니다. 본 연구의 결과는 등록된 API 가격이 실제 비용을 나타내는 신뢰할 수 없는 지표임을 보여주며, 비용을 고려한 모델 선택과 투명한 쿼리당 비용 모니터링의 필요성을 강조합니다.

Original Abstract

Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!