2601.14945v1 Jan 21, 2026 cs.RO

TIDAL: 시간적으로 분산된 확산 및 액션 루프를 통한 고주파 VLA 제어

TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control

Zheng Li

Citations: 506

h-index: 8

Haoran Wang

Citations: 47

h-index: 1

Ruofei Bai

Citations: 72

h-index: 5

Jun Li

Citations: 29

h-index: 4

Meng Yee Michael Chuah

Citations: 2

h-index: 1

W.-Y. Yau

Citations: 1

h-index: 1

Yuteng Sun

Citations: 0

h-index: 0

대규모 Vision-Language-Action (VLA) 모델은 의미론적 일반화 능력을 제공하지만, 높은 추론 지연 시간으로 인해 낮은 주파수의 일괄 처리 및 실행 방식으로 제한됩니다. 이러한 주파수 불일치는 실행 과정에서 목표물이 움직이는 동적 환경에서 오류를 발생시키는 '실행 사각지대'를 만듭니다. 본 연구에서는 의미론적 추론과 고주파 액추에이션을 분리하는 계층적 프레임워크인 TIDAL (Temporally Interleaved Diffusion and Action Loop)을 제안합니다. TIDAL은 확산 기반 VLA 시스템의 백본 구조에 독립적인 모듈이며, 이중 주파수 아키텍처를 사용하여 계산 자원을 재분배합니다. 구체적으로, 저주파 매크로-의도 루프는 의미론적 임베딩을 캐싱하고, 고주파 마이크로-제어 루프는 단일 단계 흐름 통합과 실행을 번갈아 수행합니다. 이러한 설계는 에지 하드웨어에서 약 9Hz의 제어 업데이트를 가능하게 합니다 (기존 약 2.4Hz 대비), 이는 추가적인 오버헤드를 증가시키지 않습니다. 결과적으로 발생하는 지연 시간 문제를 해결하기 위해, 본 연구에서는 정책이 stale (오래된) 의미론적 의도와 실시간 고유 수용 정보를 함께 사용하여 예측적 보상을 학습하는 시간적으로 비정렬된 학습 전략을 도입합니다. 또한, 정적 비전 인코더의 속도 민감성을 해결하기 위해, 미분 운동 예측기를 통합합니다. TIDAL은 시스템 수준 최적화와 독립적인 아키텍처적 접근 방식을 채택합니다. 실험 결과, 동적 가로채기 작업에서 기존 개방형 루프 방식 대비 2배의 성능 향상을 보였습니다. 정적 성공률은 약간 감소했지만, 피드백 빈도는 4배 증가하고, 의미론적 임베딩의 유효 범위를 기본 액션 덩어리 크기 이상으로 확장할 수 있습니다. 일시 중지되지 않은 추론 환경에서, 표준 방식이 지연으로 인해 실패하는 경우에도 TIDAL은 안정적인 성능을 유지합니다.

Original Abstract

Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.

2 Citations

0 Influential

4 Altmetric

22.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!