2601.18844v1 Jan 26, 2026 cs.SE

LLM을 활용한 정적 버그 탐지 시 오탐 감소: 산업 환경에서의 경험적 연구

Reducing False Positives in Static Bug Detection with LLMs: An Empirical Study in Industry

Xueying Du

Citations: 773

h-index: 8

Jiayi Feng

Citations: 412

h-index: 3

Yi Zou

Citations: 140

h-index: 4

Wei Xu

Citations: 27

h-index: 3

Jie Ma

Citations: 29

h-index: 3

Wei Zhang

Citations: 88

h-index: 3

Sisi Liu

Citations: 7

h-index: 2

Xin Peng

Citations: 796

h-index: 10

Yiling Lou

Citations: 1,175

h-index: 12

정적 분석 도구(SAT)는 소프트웨어 품질 향상을 위해 학계와 산업계 모두에서 널리 사용되지만, 특히 대규모 엔터프라이즈 시스템에서 높은 오탐율로 인해 실제 활용에 어려움을 겪는 경우가 많습니다. 이러한 오탐은 상당한 수동 검토 작업을 요구하며, 산업 현장의 코드 리뷰 과정에서 심각한 비효율성을 초래합니다. 최근 연구에서는 대규모 언어 모델(LLM)이 오픈 소스 벤치마크에서 오탐 감소에 잠재력을 보여주었지만, 실제 엔터프라이즈 환경에서의 효과는 아직 명확하지 않습니다. 본 연구는 이러한 격차를 해소하기 위해, 중국 최대 IT 기업 중 하나인 텐센트에서 수행된 다양한 LLM 기반 오탐 감소 기술에 대한 최초의 종합적인 경험적 연구를 진행했습니다. 텐센트의 엔터프라이즈 맞춤형 SAT에서 수집된 데이터와 대규모 광고 및 마케팅 서비스 소프트웨어를 기반으로, 세 가지 일반적인 버그 유형을 포괄하는 433개의 알람 데이터셋(오탐 328개, 실제 탐지 105개)을 구축했습니다. 개발자와의 인터뷰 및 데이터 분석을 통해, 오탐의 빈도가 높으며 상당한 수동 노력을 낭비한다는 사실을 확인했습니다(예: 알람당 10-20분의 수동 검토). 또한, 본 연구 결과는 LLM이 산업 환경에서 오탐을 줄이는 데 큰 잠재력을 가지고 있음을 보여줍니다(예: LLM과 정적 분석을 결합한 하이브리드 기술은 높은 재현율로 94-98%의 오탐을 제거). 더욱이, LLM 기반 기술은 비용 효율적이며, 알람당 비용은 2.1-109.5초, 0.0011-0.12달러로, 수동 검토에 비해 훨씬 저렴합니다. 마지막으로, 사례 분석을 통해 산업 환경에서 LLM 기반 오탐 감소 기술의 주요 한계를 파악했습니다.

Original Abstract

Static analysis tools (SATs) are widely adopted in both academia and industry for improving software quality, yet their practical use is often hindered by high false positive rates, especially in large-scale enterprise systems. These false alarms demand substantial manual inspection, creating severe inefficiencies in industrial code review. While recent work has demonstrated the potential of large language models (LLMs) for false alarm reduction on open-source benchmarks, their effectiveness in real-world enterprise settings remains unclear. To bridge this gap, we conduct the first comprehensive empirical study of diverse LLM-based false alarm reduction techniques in an industrial context at Tencent, one of the largest IT companies in China. Using data from Tencent's enterprise-customized SAT on its large-scale Advertising and Marketing Services software, we construct a dataset of 433 alarms (328 false positives, 105 true positives) covering three common bug types. Through interviewing developers and analyzing the data, our results highlight the prevalence of false positives, which wastes substantial manual effort (e.g., 10-20 minutes of manual inspection per alarm). Meanwhile, our results show the huge potential of LLMs for reducing false alarms in industrial settings (e.g., hybrid techniques of LLM and static analysis eliminate 94-98% of false positives with high recall). Furthermore, LLM-based techniques are cost-effective, with per-alarm costs as low as 2.1-109.5 seconds and $0.0011-$0.12, representing orders-of-magnitude savings compared to manual review. Finally, our case analysis further identifies key limitations of LLM-based false alarm reduction in industrial settings.

6 Citations

0 Influential

6 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!