2606.16352v1 Jun 15, 2026 cs.LG

Communication-Efficient Verifiable Attention for LLM Inference

Tianwei Zhang
Tianwei Zhang
Citations: 920
h-index: 14
Jason Zeng
Jason Zeng
Citations: 9
h-index: 2
Michael Heinrich
Michael Heinrich
Citations: 9
h-index: 2
Ming Wu
Ming Wu
Citations: 7
h-index: 2
Ziqun Chen
Ziqun Chen
Citations: 6
h-index: 1
Huiying Lan
Huiying Lan
Citations: 785
h-index: 5
Rui Tan
Rui Tan
Citations: 0
h-index: 0

Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

0 Citations
0 Influential
7 Altmetric
35.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!