2605.15177v1 May 14, 2026 cs.AI

OpenDeepThink: 브래들리-테리 집계를 이용한 병렬 추론

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

Qiuyang Mang

Citations: 156

h-index: 7

Jingbo Shang

Citations: 158

h-index: 4

Shang Zhou

Citations: 19

h-index: 2

Wenhao Chai

Citations: 84

h-index: 3

Kaiyuan Liu

Citations: 65

h-index: 3

Huanzhi Mao

Citations: 556

h-index: 6

LLM의 추론 능력을 향상시키는 주요 방법 중 하나는 테스트 시간 동안의 연산 확장입니다. 기존 방법들은 주로 단일 추론 과정을 확장하여 깊이를 늘리는 데 집중합니다. 여러 후보를 병렬로 샘플링하여 폭을 늘리는 방법은 간단하지만, 최적의 후보를 선택하는 과정에서 병목 현상이 발생합니다. 이는 정답 검증기가 없기 때문이며, LLM의 개별 판단은 노이즈가 많고 편향될 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 pairwise 비교를 통해 후보를 선택하는 집단 기반의 테스트 시간 연산 프레임워크인 OpenDeepThink을 소개합니다. 각 단계에서 LLM은 무작위로 선택된 후보 쌍을 평가하고, 브래들리-테리 방법을 사용하여 평가 결과를 종합하여 전역 순위를 결정합니다. 최고 순위에 해당하는 후보들은 보존되며, 상위 3/4은 비교 과정에서 생성된 자연어 비판을 사용하여 변형됩니다. 하위 1/4은 폐기됩니다. OpenDeepThink은 8번의 LLM 호출 과정을 통해 Gemini 3.1 Pro의 Codeforces Elo 점수를 +405점 향상시켰습니다. 이 파이프라인은 추가적인 튜닝 없이 성능 저하 없이 다양한 모델에 적용 가능하며, 다중 도메인 HLE 벤치마크에서 객관적으로 검증 가능한 영역에서는 성능 향상이 두드러지지만, 주관적인 영역에서는 그 반대의 경향을 보입니다. 우리는 또한 전문가의 평가를 거친 73개의 Codeforces 문제 세트인 CF-73을 공개합니다. 이 데이터 세트는 International Grandmaster 등급을 받았으며, 공식 답변과 99%의 일치도를 보입니다.

Original Abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!