2601.16649v1 Jan 23, 2026 cs.AI

LUMINA: 다중 턴 상호작용 에이전트를 위한 장기적 이해

LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Thomas Hehn

Citations: 8

h-index: 1

F. V. Massoli

Citations: 5,300

h-index: 27

Arash Behboodi

Citations: 84

h-index: 5

Tribhuvanesh Orekondy

Citations: 1,463

h-index: 13

Amin Rakhsha

Citations: 252

h-index: 5

Pietro Mazzaglia

Citations: 12

h-index: 2

대형 언어 모델은 많은 독립된 작업에서 우수한 성능을 보이지만, 계획 수립, 상태 추적, 긴 문맥 처리와 같은 기술을 필요로 하는 다중 턴 및 장기적 에이전트 문제에서는 여전히 고전하고 있습니다. 본 연구는 이러한 작업의 성공을 위해 기반 역량을 발전시키는 것의 상대적 중요성을 더 깊이 이해하는 것을 목표로 합니다. 우리는 다중 턴 문제를 위한 오라클 반사실적(oracle counterfactual) 프레임워크를 개발하여, 에이전트가 특정 작업을 완벽하게 수행하는 오라클을 활용할 수 있을 때 성능이 어떻게 변하는지 탐구합니다. 오라클의 지원에 따른 에이전트의 성능 변화를 통해 향후 AI 에이전트 발전에서 해당 기술이 얼마나 결정적인지 측정할 수 있습니다. 또한, 복잡도 조절이 가능하고 절차적으로 생성되는 게임 형태의 작업 모음을 소개합니다. 이러한 통제된 환경은 완벽한 계획 수립이나 오류 없는 상태 추적과 같은 정밀한 오라클 개입을 제공하며, 실제 벤치마크의 교란 요인 없이 각 오라클의 기여도를 분리할 수 있게 합니다. 실험 결과, 계획 수립과 같은 일부 개입은 설정 전반에 걸쳐 일관되게 성능을 향상시키지만, 다른 기술들의 유용성은 환경과 언어 모델의 속성에 따라 달라짐을 확인했습니다. 본 연구는 AI 에이전트 및 언어 모델 개발의 미래 방향을 제시하기 위해 다중 턴 에이전트 환경의 난제를 조명합니다.

Original Abstract

Large language models can perform well on many isolated tasks, yet they continue to struggle on multi-turn, long-horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi-turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent's performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game-like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or flawless state tracking, and make it possible to isolate the contribution of each oracle without confounding effects present in real-world benchmarks. Our results show that while some interventions (e.g., planning) consistently improve performance across settings, the usefulness of other skills is dependent on the properties of the environment and language model. Our work sheds light on the challenges of multi-turn agentic environments to guide the future efforts in the development of AI agents and language models.

0 Citations

0 Influential

13.5 Altmetric

67.5 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 대규모 언어 모델(LLM)이 단일 턴 작업에서는 뛰어나지만 멀티 턴 장기 계획(long-horizon) 에이전트 작업에서는 성능이 저하되는 원인을 분석합니다. 저자들은 'LUMINA'라는 프레임워크를 제안하여, 절차적으로 생성된 세 가지 게임 환경(ListWorld, TreeWorld, GridWorld)에서 오라클(Oracle) 개입을 통해 계획(Planning), 상태 추적(State Tracking), 기록 가지치기(History Pruning) 기술의 중요성을 정량적으로 측정했습니다. 실험 결과, 모델 크기와 환경의 특성에 따라 각 기술이 미치는 영향이 다르며, 특히 개별 단계의 정확도(Step Accuracy)는 높더라도 오류가 누적되어 최종 성공률(Success Rate)이 급격히 떨어지는 '복리 오류(Compounding Errors)' 문제가 장기 작업의 핵심 병목 현상임을 밝혀냈습니다.

Key Innovations

오라클 반사실적 프레임워크(Oracle Counterfactual Framework): 에이전트에게 완벽한 정보를 제공했을 때의 성능 변화를 통해 병목 기술을 식별하는 방법론
절차적 생성 환경(Procedurally Generated Environments): 최적의 행동(Optimal Policy)을 알고 있어 정확한 오라클 개입과 평가가 가능한 3가지 환경(ListWorld, TreeWorld, GridWorld) 제안
오라클 개입 유형의 세분화: 계획(Planning), 상태 추적(State Tracking), 기록 가지치기(History Pruning)로 나누어 각 능력의 기여도 분석
단계별 정확도(Step Accuracy)와 작업 성공률(Success Rate)의 상관관계 분석을 통한 장기 작업 실패 원인 규명

Learning & Inference Impact

추론(Inference) 측면에서, 이 연구는 긴 문맥(Context)을 처리할 때 불필요한 정보를 제거(History Pruning)하거나 명시적인 상태 요약(State Tracking)을 제공하는 것이 작은 모델(4B~8B)의 성능을 크게 향상시킬 수 있음을 보여줍니다. 반면, GPT-4o와 같은 거대 모델에서는 문맥 가지치기가 오히려 성능을 저하시킬 수 있어 모델 크기에 따른 추론 전략의 차별화가 필요함을 시사합니다. 학습(Learning) 및 모델 개발 관점에서는, 에이전트가 장기 작업을 성공하기 위해 단일 단계의 추론 능력보다 오류 없이 일관된 행동을 유지하는 능력(Consistency)이 훨씬 중요함을 강조하며, 향후 에이전트 개발 시 복리 오류를 줄이기 위한 상태 추적 능력 강화가 필수적임을 제안합니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!