2603.22016v1 Mar 23, 2026 cs.LG

ROM: 실시간 과잉 사고 완화: 스트리밍 기반 탐지 및 개입

ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

Xiaogeng Liu

Citations: 2,503

h-index: 19

Xinyang Wang

Citations: 7

h-index: 1

Chaowei Xiao

Citations: 1,201

h-index: 13

대규모 추론 모델(LRM)은 복잡한 작업에서 높은 정확도를 달성하기 위해 긴 사고 과정(Chain-of-Thought)을 생성하지만, 과잉 사고라는 문제를 겪습니다. 올바른 답을 도출한 후에도 불필요한 추론 단계를 계속 생성하는 경향이 있으며, 이는 지연 시간과 계산 비용을 증가시키고, 답변의 일관성을 저해할 수 있습니다. 기존의 완화 방법은 모델의 기반 구조를 크게 변경해야 하거나, 과잉 사고 패턴을 제대로 반영하지 못하는 수동으로 설계된 휴리스틱에 의존합니다. 본 논문에서는 과잉 사고 완화를 스트리밍 기반 예측 및 제어 문제로 정의하는 첫 번째 방법인 ROM을 제안합니다. ROM은 동결된 대규모 언어 모델의 후반 레이어 숨겨진 상태에 가벼운 탐지 모듈을 연결합니다. 이 모듈은 토큰을 실시간으로 모니터링하며, 과잉 사고가 감지되면 조기에 최종 답변 단계로 전환하도록 유도합니다. 또한, 솔루션의 정확성 경계를 기반으로 한 토큰 수준의 지도 학습과, 편향된 데이터를 줄이는 데이터 증강 전략을 도입했습니다. 7개의 벤치마크에서 ROM은 최고 수준의 정확도(93.51%), 가장 짧은 응답 길이(1,159 토큰), 그리고 가장 뛰어난 응답 효율성을 달성했습니다. 기존 모델과 비교했을 때, ROM은 응답 길이를 47.2% 줄이고 효율성을 121% 향상시켰습니다. 이러한 결과는 스트리밍 기반 탐지가 실시간 과잉 사고 완화를 위한 유망한 접근 방식임을 보여줍니다.

Original Abstract

Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!