2602.03359v1 Feb 03, 2026 cs.LG

MeKi: 메모리 기반 전문가 지식 주입을 통한 효율적인 LLM 확장

MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

Yehui Tang

Citations: 18

h-index: 3

Ning Ding

Citations: 36

h-index: 3

Fangcheng Liu

Citations: 175

h-index: 7

Kyungrae Kim

Citations: 11

h-index: 2

Linji Hao

Citations: 5

h-index: 1

Kyeng-Hun Lee

Citations: 11

h-index: 3

Hyeonmok Ko

Citations: 11

h-index: 3

대규모 언어 모델(LLM)의 성능 향상은 일반적으로 파라미터 수 또는 추론 시 계산량을 늘리는 방식으로 이루어집니다. 그러나 이러한 방식은 제한된 RAM 및 NPU 자원을 가진 엣지 장치에 적용하기에는 비현실적입니다. 그럼에도 불구하고, 스마트폰과 같은 엣지 장치에서 우수한 성능의 LLM을 배포하는 것은 사용자 경험에 매우 중요합니다. 이러한 문제를 해결하기 위해, 우리는 MeKi(Memory-based Expert Knowledge Injection)라는 새로운 시스템을 제안합니다. MeKi는 FLOPs가 아닌 저장 공간을 활용하여 LLM의 용량을 확장합니다. MeKi는 각 Transformer 레이어에 토큰 수준의 메모리 전문가를 탑재하여, 미리 저장된 의미론적 지식을 생성 과정에 주입합니다. 학습 용량과 추론 효율성 간의 격차를 해소하기 위해, 학습 중에 사용되는 파라미터 행렬을 작은 정적 조회 테이블에 통합하는 재파라미터화 전략을 사용합니다. MeKi는 지식을 ROM으로 옮겨 모델 용량과 계산 비용을 분리하여, 추론 지연 시간을 0으로 만듭니다. 광범위한 실험 결과, MeKi는 동일한 추론 속도를 가지면서 기존의 밀집 LLM에 비해 훨씬 뛰어난 성능을 보이며, 엣지 장치 LLM을 위한 메모리 기반 확장 방식의 효과를 입증합니다. 프로젝트 홈페이지는 https://github.com/ningding-o/MeKi 입니다.

Original Abstract

Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at https://github.com/ningding-o/MeKi.

3 Citations

0 Influential

36.695286648076 Altmetric

186.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!