2601.21754v2 Jan 29, 2026 cs.AI

언어 기반 시행착오 방식은 경험의 시대에 뒤쳐진다

Language-based Trial and Error Falls Behind in the Era of Experience

Guozheng Ma

Citations: 501

h-index: 12

Shugang Cui

Citations: 7

h-index: 1

Yilun Kong

Citations: 103

h-index: 4

Mengya Gao

Citations: 16

h-index: 1

Yichao Wu

Citations: 16

h-index: 2

Dacheng Tao

Citations: 5,871

h-index: 38

Haotian Luo

Citations: 378

h-index: 6

Haoyu Wang

Citations: 67

h-index: 2

Xiaogang Wang

Citations: 2,112

h-index: 10

Li Shen

Citations: 66

h-index: 4

대규모 언어 모델(LLM)은 언어 기반 에이전트 작업에서 뛰어난 성능을 보이지만, 새로운 비언어적 환경(예: 기호 또는 공간 작업)에 적용하는 데는 한계가 있습니다. 이전 연구에서는 이러한 성능 격차를 사전 훈련 분포와 테스트 분포의 불일치로 설명합니다. 본 연구에서는 주요 병목 현상이 탐색 비용의 과도함임을 보여줍니다. 이러한 작업을 마스터하려면 광범위한 시행착오가 필요하며, 이는 고차원 의미 공간에서 작동하는 매개변수 기반 LLM에게는 계산적으로 지속 불가능합니다. 이러한 문제를 해결하기 위해, 우리는 탐색과 활용을 분리하는 새로운 프레임워크인 SCOUT(Sub-Scale Collaboration On Unseen Tasks)를 제안합니다. 우리는 LLM보다 훨씬 빠르고 큰 규모로 환경 역학을 탐색하는 경량 "스카웃"(예: 작은 MLP)을 활용합니다. 수집된 경로는 지도 학습(SFT)을 통해 LLM을 초기화하는 데 사용되며, 이후 다중 턴 강화 학습(RL)을 통해 잠재적인 세계 지식을 활성화합니다. 실험적으로, SCOUT는 Qwen2.5-3B-Instruct 모델이 평균 0.86의 점수를 달성하여 Gemini-2.5-Pro(0.60)를 포함한 독점 모델보다 훨씬 뛰어난 성능을 보이면서 GPU 사용 시간을 약 60% 절약했습니다.

Original Abstract

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

1 Citations

0 Influential

19 Altmetric

96.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!