Nan Ye
Publications
An End-to-end Flight Control Network for High-speed UAV Obstacle Avoidance based on Event-Depth Fusion
Achieving safe, high-speed autonomous flight in complex environments with static, dynamic, or mixed obstacles remains challenging, as a single perception modality is incomplete. Depth cameras are effective for static objects but suffer from motion blur at high speeds. Conversely, event cameras excel at capturing rapid motion but struggle to perceive static scenes. To exploit the complementary strengths of both sensors, we propose an end-to-end flight control network that achieves feature-level fusion of depth images and event data through a bidirectional crossattention module. The end-to-end network is trained via imitation learning, which relies on high-quality supervision. Building on this insight, we design an efficient expert planner using Spherical Principal Search (SPS). This planner reduces computational complexity from $O(n^2)$ to $O(n)$ while generating smoother trajectories, achieving over 80% success rate at 17m/s--nearly 20% higher than traditional planners. Simulation experiments show that our method attains a 70-80% success rate at 17 m/s across varied scenes, surpassing single-modality and unidirectional fusion models by 10-20%. These results demonstrate that bidirectional fusion effectively integrates event and depth information, enabling more reliable obstacle avoidance in complex environments with both static and dynamic objects.
Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning
Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.
V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks
Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g., "get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out "silent failures" where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.