2604.19167v1 Apr 21, 2026 cs.LG

LBLLM: 세 단계 증류를 통한 대규모 언어 모델의 경량화된 이진화

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Xu-Yao Zhang

Citations: 19

h-index: 3

Yi Yang

Citations: 45

h-index: 3

Siqing Song

Citations: 7

h-index: 1

Chuang Wang

Citations: 7

h-index: 1

Yong Lang

Citations: 0

h-index: 0

대규모 언어 모델(LLM)을 리소스 제약 환경에 배포하는 것은 높은 계산 및 메모리 요구 사항으로 인해 어려움을 겪습니다. 본 논문에서는 효과적인 W(1+1)A4 양자화를 달성하는 경량화된 이진화 프레임워크인 LBLLM을 제안합니다. LBLLM은 다음과 같은 세 단계 양자화 전략을 사용합니다. (1) PTQ를 통해 고품질 양자화 모델을 초기화합니다. (2) 레이어별 증류를 통해 이진화된 가중치, 그룹별 비트맵 및 양자화 매개변수를 양자화하고, 활성화 값은 전체 정밀도를 유지합니다. (3) 학습 가능한 활성화 양자화 인자를 사용하여 활성화 값을 동적으로 4비트로 양자화합니다. 이러한 분리된 설계는 가중치와 활성화 양자화 간의 간섭을 줄여 학습 안정성을 높이고 추론 정확도를 향상시킵니다. LBLLM은 단일 GPU를 사용하여 0.016B 토큰으로만 학습되었으며, 언어 모델링, 상식 질의응답 및 언어 이해 작업에서 기존의 최첨단 이진화 방법보다 우수한 성능을 보입니다. 이러한 결과는 LLM의 극단적인 저비트 양자화가 추가적인 고정밀 채널이나 회전 행렬을 사용하지 않고도 실용적이고 매우 효과적일 수 있음을 보여주며, 리소스가 제한된 환경에서 효율적인 LLM 배포를 위한 유망한 경로를 제시합니다.

Original Abstract

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!