2603.21096v1 Mar 22, 2026 cs.LG

챕터 혼합: 트랜스포머에서 학습된 메모리 확장

Mixture of Chapters: Scaling Learnt Memory in Transformers

Pritish Saha

Citations: 3

h-index: 1

Tasmay Pankaj Tibrewal

Citations: 0

h-index: 0

Ankit Meda

Citations: 0

h-index: 0

Kunal Singh

Citations: 212

h-index: 4

Pradeep Moturi

Citations: 14

h-index: 2

트랜스포머는 학습 과정에서 획득한 지식을 저장하고 구성하기 위한 명시적인 구조적 메커니즘이 부족합니다. 본 논문에서는 학습 가능한 희소 메모리 뱅크를 소개합니다. 이는 임의로 초기화되고 전체적으로 학습되는 잠재 토큰 집합으로, 트랜스포머 레이어가 크로스-어텐션을 통해 쿼리하여 저장된 지식을 검색합니다. 메모리 용량을 확장하면서 과도한 어텐션 비용을 줄이기 위해, Mixture-of-Experts 아키텍처에서 영감을 받은 챕터 기반 라우팅을 제안합니다. 이는 메모리 뱅크를 챕터로 분할하고, 각 입력에 대해 관련 하위 집합을 선택하는 라우터를 학습시키는 방식입니다. 이를 통해 262K개의 메모리 토큰으로 확장하면서도 계산 가능성을 유지할 수 있습니다. 본 논문에서는 제안하는 방법을 표준 트랜스포머(동일한 FLOP 설정)와 비교하여 사전 학습 및 지시 튜닝에서 다양한 벤치마크를 사용하여 평가합니다. 실험 결과, 제안하는 모델은 동일한 FLOP 설정의 기준 모델보다 우수한 성능을 보였으며, 이는 새로운 확장 가능성을 제시합니다. 또한, 명시적인 연관 기억이 모델 파라미터에 암묵적으로 캡처되는 것과 상호 보완적인 용량을 제공한다는 것을 보여줍니다. 또한, 지속적인 학습 과정에서 지식 유지 능력이 향상되었으며, 학습 단계 간 전환(예: 사전 학습에서 지시 튜닝) 시에도 정보 손실에 대한 강건성을 보이는 것을 확인했습니다.

Original Abstract

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!