2605.07721v1 May 08, 2026 cs.CL

메모리 효율적인 루프 트랜스포머: 루프 언어 모델에서 연산과 메모리 분리

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

F. V. Massoli

Citations: 5,300

h-index: 27

Arash Behboodi

Citations: 84

h-index: 5

Victor Conchello Vendrell

Citations: 2

h-index: 1

Arnau Padr'es Masdemont

Citations: 22

h-index: 1

Jordi Ros-Giralt

Citations: 4

h-index: 1

Niccolò Grillo

Citations: 2

h-index: 1

순환형 LLM 아키텍처는 추론 능력을 향상시키는 유망한 접근 방식으로, 중간 토큰을 생성하지 않고 임베딩 공간에서 다단계 연산을 수행할 수 있습니다. Ouro와 같은 모델은 내부 표현을 반복적으로 업데이트하면서 각 반복에서 표준 Key-Value (KV) 캐시를 유지하여, 메모리 사용량이 추론 깊이에 따라 선형적으로 증가합니다. 결과적으로, 추론 반복 횟수를 늘리면 메모리 사용량이 기하급수적으로 증가하여 이러한 아키텍처의 실질적인 확장성을 제한할 수 있습니다. 본 연구에서는 연산 깊이와 메모리 사용량을 분리하는 새로운 아키텍처인 Memory-Efficient Looped Transformer (MELT)를 제안합니다. MELT는 각 레이어마다 표준 KV 캐시를 사용하는 대신, 레이어당 단일 KV 캐시를 유지하며, 이 캐시는 학습 가능한 게이팅 메커니즘을 통해 시간이 지남에 따라 업데이트됩니다. 이 아키텍처에서 안정적이고 효율적인 학습을 가능하게 하기 위해, 우리는 2단계의 청크 기반 학습 방법을 제안합니다. 첫 번째 단계는 LoopLM 초기 모델에서 MELT 모델로의 보간된 전환이며, 두 번째 단계는 어텐션 정렬 증류를 통해 성능을 향상시킵니다. 실험 결과, 사전 학습된 Ouro 파라미터에서 파인튜닝된 MELT 모델은 유사한 크기의 표준 LLM보다 성능이 우수하며, Ouro의 메모리 사용량보다 훨씬 작고, 자체적으로도 비슷한 수준의 메모리 사용량을 유지합니다. 전체적으로, MELT는 LoopLM의 성능을 저하시키지 않으면서 상수 메모리로 반복적인 추론을 수행하며, 경량화된 사후 학습 절차만 사용합니다.

Original Abstract

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

1 Citations

0 Influential

13.5 Altmetric

68.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!