2601.17563v1 Jan 24, 2026 cs.LG

조건부 상태 변환 추정과 온라인 행동 정렬을 통한 일반화 가능한 모방 학습

Towards Generalisable Imitation Learning Through Conditioned Transition Estimation and Online Behaviour Alignment

Nathan Gavenski

Citations: 649

h-index: 3

Odinaldo Rodrigues

Citations: 652

h-index: 3

Matteo Leonetti

Citations: 9

h-index: 2

최근 관찰 기반 모방 학습(ILfO) 방법은 상당한 발전을 이루었지만, 여전히 다음과 같은 한계점을 가지고 있습니다. 이 방법들은 행동 기반의 지도 최적화가 필요하며, 상태가 단 하나의 최적 행동을 가진다고 가정하고, 종종 교사(teacher)의 행동을 실제 환경 상태를 충분히 고려하지 않고 적용합니다. 실제 관찰된 경로에는 유용한 정보가 존재하지만, 기존 방법들은 지도 없이 이를 추출하는 데 어려움을 겪습니다. 본 연구에서는 이러한 한계점을 해결하기 위해 비지도 관찰 기반 모방 학습(UfO)을 제안합니다. UfO는 두 단계의 과정을 통해 정책을 학습합니다. 먼저, 에이전트는 관찰된 상태 변환을 통해 교사의 실제 행동을 추정하고, 그 다음, 학습된 정책을 개선하기 위해 에이전트의 경로를 교사의 경로와 더욱 일치하도록 조정합니다. 다섯 가지 널리 사용되는 환경에서 수행한 실험 결과, UfO는 교사와 다른 모든 ILfO 방법보다 뛰어난 성능을 보일 뿐만 아니라, 가장 작은 표준 편차를 나타냈습니다. 이 표준 편차의 감소는 예측하지 못한 시나리오에서 더 나은 일반화 성능을 의미합니다.

Original Abstract

State-of-the-art imitation learning from observation methods (ILfO) have recently made significant progress, but they still have some limitations: they need action-based supervised optimisation, assume that states have a single optimal action, and tend to apply teacher actions without full consideration of the actual environment state. While the truth may be out there in observed trajectories, existing methods struggle to extract it without supervision. In this work, we propose Unsupervised Imitation Learning from Observation (UfO) that addresses all of these limitations. UfO learns a policy through a two-stage process, in which the agent first obtains an approximation of the teacher's true actions in the observed state transitions, and then refines the learned policy further by adjusting agent trajectories to closely align them with the teacher's. Experiments we conducted in five widely used environments show that UfO not only outperforms the teacher and all other ILfO methods but also displays the smallest standard deviation. This reduction in standard deviation indicates better generalisation in unseen scenarios.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!