2601.16354v1 Jan 22, 2026 cs.CR

NOIR: 오픈 소스 LLM을 활용한 코드 생성 시 개인 정보 보호 기술

NOIR: Privacy-Preserving Generation of Code with Open-Source LLMs

Khoa Nguyen

Citations: 3

h-index: 1

Khiem Ton

Citations: 4

h-index: 2

Nhathai Phan

Citations: 1,426

h-index: 19

Issa Khalil

Citations: 43

h-index: 3

Khang Tran

Citations: 6

h-index: 2

Cristian Borcea

Citations: 8

h-index: 2

Ruoming Jin

Citations: 93

h-index: 6

Abdallah Khreishah

Citations: 63

h-index: 3

My T. Thai

Citations: 7

h-index: 2

대규모 언어 모델(LLM) 기반 코드 생성은 소프트웨어 개발 성능을 향상시키지만, 서비스 제공 업체(클라우드)가 클라이언트의 프롬프트와 생성된 코드를 관찰할 수 있다는 점에서 지적 재산 및 데이터 보안 위험을 초래합니다. 이러한 문제를 해결하기 위해, 우리는 클라이언트의 프롬프트와 생성된 코드를 클라우드로부터 보호하는 첫 번째 프레임워크인 NOIR을 제안합니다. NOIR은 클라이언트 측에서 인코더와 디코더를 사용하여 프롬프트의 임베딩을 클라우드로 전송하고, LLM으로부터 풍부한 임베딩을 받아, 이를 디코딩하여 클라이언트 측에서 코드를 생성합니다. 클라우드가 임베딩을 통해 프롬프트와 생성된 코드를 추론할 수 있기 때문에, NOIR은 구별 불가능성을 달성하기 위한 새로운 메커니즘을 도입합니다. 여기에는 프롬프트와 코드에 사용되는 어휘 수준에서 로컬 차등 개인 정보 보호, 클라이언트 측의 데이터 독립적이고 랜덤화된 토크나이저 등이 포함됩니다. 이러한 구성 요소는 정직하지만 호기심 많은 클라우드에 의한 재구성 및 빈도 분석 공격에 효과적으로 대응합니다. 오픈 소스 LLM을 사용한 광범위한 분석 결과는 NOIR이 기존 방법보다 성능이 뛰어나며, Evalplus(MBPP 및 HumanEval, Pass@1 76.7 및 77.4) 및 BigCodeBench(Pass@1 38.7, 원본 LLM 대비 1.77% 감소) 벤치마크에서 강력한 개인 정보 보호 수준을 유지하면서도 우수한 성능을 보인다는 것을 보여줍니다.

Original Abstract

Although boosting software development performance, large language model (LLM)-powered code generation introduces intellectual property and data security risks rooted in the fact that a service provider (cloud) observes a client's prompts and generated code, which can be proprietary in commercial systems. To mitigate this problem, we propose NOIR, the first framework to protect the client's prompts and generated code from the cloud. NOIR uses an encoder and a decoder at the client to encode and send the prompts' embeddings to the cloud to get enriched embeddings from the LLM, which are then decoded to generate the code locally at the client. Since the cloud can use the embeddings to infer the prompt and the generated code, NOIR introduces a new mechanism to achieve indistinguishability, a local differential privacy protection at the token embedding level, in the vocabulary used in the prompts and code, and a data-independent and randomized tokenizer on the client side. These components effectively defend against reconstruction and frequency analysis attacks by an honest-but-curious cloud. Extensive analysis and results using open-source LLMs show that NOIR significantly outperforms existing baselines on benchmarks, including the Evalplus (MBPP and HumanEval, Pass@1 of 76.7 and 77.4), and BigCodeBench (Pass@1 of 38.7, only a 1.77% drop from the original LLM) under strong privacy against attacks.

2 Citations

2 Influential

9.5 Altmetric

53.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!