2604.15768v2 Apr 17, 2026 cs.DC

cuNNQS-SCI: 신경망 양자 상태 기반 고성능 구성 상호작용 선택을 위한 완전 GPU 가속 프레임워크

cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States

Haoquan Long

Citations: 0

h-index: 0

Dingwen Tao

Citations: 26

h-index: 3

Daran Sun

Citations: 0

h-index: 0

Bowen Kan

Citations: 40

h-index: 4

Hairui Zhao

Citations: 29

h-index: 3

Haoxu Li

Citations: 8

h-index: 2

Yicheng Liu

Citations: 29

h-index: 2

Ankang Feng

Citations: 35

h-index: 3

Yida Gu

Citations: 18

h-index: 3

Zhenyu Li

Citations: 39

h-index: 5

Honghui Shang

Citations: 5

h-index: 2

Yunquan Zhang

Citations: 19

h-index: 3

Ninghui Sun

Citations: 30

h-index: 3

Guangming Tan

Citations: 11

h-index: 2

Peng Zhou

Citations: 17

h-index: 2

Wenjing Huang

Citations: 12

h-index: 2

인공지능 기반 방법은 복잡한 다체 시스템에 대한 슈뢰딩거 방정식을 정확하게 푸는 데 있어 핵심적인 과제를 해결하는 데 상당한 성공을 거두었습니다. 신경망 양자 상태(NNQS) 접근 방식 중, NNQS-SCI (선택된 구성 상호작용) 방법은 높은 정확성과 확장성을 인정받아 최첨단 기술로 자리 잡았습니다. 그러나, 하이브리드 CPU-GPU 아키텍처로 인해 더 큰 시스템에 적용하는 데 심각한 제약이 있습니다. 특히, 중앙 집중식 CPU 기반의 글로벌 중복 제거는 통신 병목 현상을 유발하여 확장성을 제한하며, 호스트에 상주하는 결합 구성 생성은 엄청난 계산 오버헤드를 발생시킵니다. 본 논문에서는 이러한 병목 현상을 극복하기 위해 설계된 완전 GPU 가속 SCI 프레임워크인 cuNNQS-SCI를 소개합니다. cuNNQS-SCI는 먼저 분산되고 로드 밸런싱된 글로벌 중복 제거 알고리즘을 통합하여 확장성 측면에서 중복과 통신 오버헤드를 최소화합니다. 계산 능력의 제한을 해결하기 위해, 정확한 결합 구성 생성을 위한 특수하고 세분화된 CUDA 커널을 사용합니다. 마지막으로, 이 완전 가속으로 인해 발생하는 단일 GPU 메모리 제한을 극복하기 위해, GPU 측 풀링, 스트리밍 미니 배치, 그리고 오버랩 오프로딩을 특징으로 하는 GPU 메모리 중심 런타임을 통합합니다. 이러한 설계는 훨씬 더 큰 구성 공간을 가능하게 하며, 병목 지점을 호스트 측 제한에서 장치 내 추론으로 전환합니다. 우리의 평가 결과는 cuNNQS-SCI가 해결 가능한 문제의 규모를 근본적으로 확장한다는 것을 보여줍니다. 64개의 GPU를 갖춘 NVIDIA A100 클러스터에서, cuNNQS-SCI는 매우 최적화된 NNQS-SCI 기준 성능에 비해 최대 2.32배의 전체 속도 향상을 달성하면서 동일한 화학적 정확도를 유지합니다. 또한, 강력한 확장성 테스트에서 90% 이상의 병렬 효율성을 유지하며 뛰어난 분산 성능을 보여줍니다.

Original Abstract

AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!