ArtI-Insight

#1 2606.16995v1 Jun 15, 2026

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.

Nathan Gavenski Odinaldo Rodrigues Juarez Monteiro Adriano Veloso Francisco Galuppo

0 Citations

#2 2605.07736v1 May 08, 2026

Online Goal Recognition using Path Signature and Dynamic Time Warping

Online goal recognition in continuous domains poses two central challenges: efficiently encoding large trajectories and effectively comparing them. Recent work addresses these challenges by using custom state-space representations and metrics to compare observations against hypotheses. However, these approaches often overlook well-established encoding techniques used in other domains that offer substantial advantages. This paper introduces a novel method for online goal recognition that leverages path signatures, a compact, expressive representation of rough path theory that efficiently captures key semantic features of trajectories, enabling more meaningful comparisons between them. Experiments show that our method consistently outperforms the state of the art in predictive accuracy and online planning efficiency, while remaining competitive offline.

Felipe Meneguzzi Nathan Gavenski Odinaldo Rodrigues Douglas Tesch L. Amado

0 Citations

#3 2602.19930v1 Feb 23, 2026

Beyond Mimicry: Toward Lifelong Adaptability in Imitation Learning

Imitation learning stands at a crossroads: despite decades of progress, current imitation learning agents remain sophisticated memorisation machines, excelling at replay but failing when contexts shift or goals evolve. This paper argues that this failure is not technical but foundational: imitation learning has been optimised for the wrong objective. We propose a research agenda that redefines success from perfect replay to compositional adaptability. Such adaptability hinges on learning behavioural primitives once and recombining them through novel contexts without retraining. We establish metrics for compositional generalisation, propose hybrid architectures, and outline interdisciplinary research directions drawing on cognitive science and cultural evolution. Agents that embed adaptability at the core of imitation learning thus have an essential capability for operating in an open-ended world.

Felipe Meneguzzi Nathan Gavenski Odinaldo Rodrigues

0 Citations

#4 2601.17563v1 Jan 24, 2026

Towards Generalisable Imitation Learning Through Conditioned Transition Estimation and Online Behaviour Alignment

State-of-the-art imitation learning from observation methods (ILfO) have recently made significant progress, but they still have some limitations: they need action-based supervised optimisation, assume that states have a single optimal action, and tend to apply teacher actions without full consideration of the actual environment state. While the truth may be out there in observed trajectories, existing methods struggle to extract it without supervision. In this work, we propose Unsupervised Imitation Learning from Observation (UfO) that addresses all of these limitations. UfO learns a policy through a two-stage process, in which the agent first obtains an approximation of the teacher's true actions in the observed state transitions, and then refines the learned policy further by adjusting agent trajectories to closely align them with the teacher's. Experiments we conducted in five widely used environments show that UfO not only outperforms the teacher and all other ILfO methods but also displays the smallest standard deviation. This reduction in standard deviation indicates better generalisation in unseen scenarios.

Nathan Gavenski Odinaldo Rodrigues Matteo Leonetti

1 Citations

Odinaldo Rodrigues

Publications

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

Online Goal Recognition using Path Signature and Dynamic Time Warping

Beyond Mimicry: Toward Lifelong Adaptability in Imitation Learning

Towards Generalisable Imitation Learning Through Conditioned Transition Estimation and Online Behaviour Alignment