2601.07892v1 Jan 12, 2026 cs.LG

Sherry: 정밀한 희소화를 통한 하드웨어 효율적인 1.25비트 삼항 양자화

Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification

Guanghua Yu

Citations: 50

h-index: 4

Jianchen Zhu

Citations: 13

h-index: 3

Hong Huang

Citations: 20

h-index: 3

Decheng Wu

Citations: 60

h-index: 4

Qiangqiang Hu

Citations: 188

h-index: 9

Jinhai Yang

Citations: 201

h-index: 4

Xue Liu

Citations: 17

h-index: 3

Dapeng Wu

Citations: 25

h-index: 3

대규모 언어 모델(LLM)을 자원 제약적인 엣지 장치에 배포하는 것은 과도한 메모리 및 계산 요구 사항으로 인해 점점 더 어려워지고 있습니다. 삼항 양자화는 가중치를 {-1, 0, +1}로 줄여 매력적인 해결책을 제공하지만, 현재 구현 방식은 기존 하드웨어와의 근본적인 불일치를 겪고 있습니다. 대부분의 기존 방법은 2비트 정렬 패킹(상당한 비트 낭비 발생)과 1.67비트 불규칙 패킹(추론 속도 저하) 중 하나를 선택해야 합니다. 이러한 문제점을 해결하기 위해, 우리는 하드웨어 효율적인 삼항 양자화 프레임워크인 Sherry를 제안합니다. Sherry는 3:4의 정밀한 희소화를 도입하여, 가중치 블록 4개를 5비트로 패킹하여 정규화된 1.25비트 폭을 달성하고, 2의 거듭제곱 정렬을 복원합니다. 또한, 희소 삼항 학습에서 발생하는 가중치 고정 문제를 발견했으며, 이는 표현력의 붕괴를 초래합니다. 이를 해결하기 위해, Sherry는 훈련 과정에서 표현력의 다양성을 유지하는 어닐링 잔류 시냅스 메커니즘인 Arenas를 도입했습니다. LLaMA-3.2 모델을 사용하여 수행한 5가지 벤치마크 실험 결과, Sherry는 최첨단 삼항 성능과 동등한 성능을 보이면서 모델 크기를 크게 줄였습니다. 특히, Intel i7-14700HX CPU에서, 저희의 1B 모델은 SOTA 기준과 비교하여 0%의 정확도 손실 없이 25%의 비트 절약과 10%의 속도 향상을 달성했습니다. 관련 코드는 https://github.com/Tencent/AngelSlim 에서 확인할 수 있습니다.

Original Abstract

The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to {-1, 0, +1}, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA-3.2 across five benchmarks demonstrate that Sherry matches state-of-the-art ternary performance while significantly reducing model size. Notably, on an Intel i7-14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at https://github.com/Tencent/AngelSlim .

3 Citations

0 Influential

55.522788812843 Altmetric

280.6 Score

Original PDF

494

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!