2602.02050v1 Feb 02, 2026 cs.AI

대규모 언어 모델 에이전트의 도구 사용 행동 최적화에서 엔트로피의 역할 재고

Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents

Yixia Li

Southern University of Science and Technology

Citations: 190

h-index: 7

Guanhua Chen

Citations: 65

h-index: 3

Yiwen Zhao

Citations: 1

h-index: 1

Guangnan Ye

Citations: 100

h-index: 5

Hongfeng Chai

Citations: 110

h-index: 6

Zeping Li

Citations: 22

h-index: 2

Keyang Chen

Citations: 40

h-index: 3

Yixin Cao

Citations: 6

h-index: 1

Zhenfei Yin

Citations: 171

h-index: 4

Hongru Wang

Citations: 442

h-index: 6

대규모 언어 모델(LLM) 기반의 도구 사용 에이전트는 수학적 추론 및 멀티 홉(multi-hop) 질의응답과 같은 작업에서 탁월한 성능을 보입니다. 그러나 긴 궤적(trajectory)에서 에이전트는 종종 과도하고 품질이 낮은 도구 호출을 발생시켜 지연 시간을 증가시키고 추론 성능을 저하시키며, 이로 인해 도구 사용 행동 관리가 어려워집니다. 본 연구에서는 엔트로피 기반 파일럿 실험을 수행하여 엔트로피 감소와 고품질 도구 호출 사이에 강한 양의 상관관계가 있음을 관찰했습니다. 이러한 발견을 바탕으로 우리는 엔트로피 감소를 감독 신호로 사용할 것을 제안하고, 도구 사용 행동 최적화의 서로 다른 요구 사항을 해결하기 위해 두 가지 보상 전략을 설계했습니다. 희소 결과 보상(Sparse outcome rewards)은 효율성을 개선하기 위해 궤적 수준의 포괄적인 지침을 제공하는 반면, 밀집 과정 보상(dense process rewards)은 성능을 향상시키기 위해 세밀한 감독을 제공합니다. 다양한 도메인에 걸친 실험 결과, 두 가지 보상 설계 모두 도구 사용 행동을 개선하는 것으로 나타났습니다. 전자는 베이스라인 평균 대비 도구 호출을 72.07% 감소시켰으며, 후자는 성능을 22.27% 향상시켰습니다. 이러한 결과는 엔트로피 감소가 도구 사용 행동을 향상시키는 핵심 메커니즘임을 입증하며, 에이전트가 실제 애플리케이션에서 더욱 적응력 있게 동작할 수 있도록 합니다.

Original Abstract

Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!