2601.07072v1 Jan 11, 2026 cs.CR

검색 장벽 극복: LLM 시스템에서의 야생 환경에서의 간접 프롬프트 주입

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Hongyan Chang

Citations: 5

h-index: 1

Ergute Bao

Citations: 147

h-index: 5

Xinjian Luo

Citations: 448

h-index: 9

Ting Yu

Citations: 43

h-index: 4

대규모 언어 모델(LLM)은 점점 더 많은 정보를 외부 데이터베이스에서 가져오는 데 의존합니다. 이는 새로운 공격 경로를 야기합니다. 바로 간접 프롬프트 주입(Indirect Prompt Injection, IPI) 공격으로, 악성 명령이 데이터베이스에 숨겨져 있다가 모델이 해당 정보를 검색하면 모델의 동작을 조작합니다. 이전 연구에서는 이러한 위험을 강조했지만, 실제로 악성 콘텐츠가 검색되도록 하는 가장 어려운 단계를 회피하는 경우가 많았습니다. 실제로는 최적화되지 않은 IPI 공격은 자연스러운 검색 쿼리에서 거의 검색되지 않으며, 이는 실제 영향에 대한 불확실성을 야기합니다. 저희는 이 문제를 해결하기 위해 악성 콘텐츠를 검색을 보장하는 트리거 조각과 임의의 공격 목표를 인코딩하는 공격 조각으로 분해했습니다. 이 아이디어를 바탕으로, 효율적이고 효과적인 블랙박스 공격 알고리즘을 설계했습니다. 이 알고리즘은 임의의 공격 조각에 대해 검색을 보장하는 간결한 트리거 조각을 생성합니다. 저희의 공격은 임베딩 모델에 대한 API 액세스만 필요하며, 비용 효율적입니다(OpenAI의 임베딩 모델에서 사용자 쿼리당 약 0.21달러). 또한 11개의 벤치마크와 8개의 임베딩 모델(오픈 소스 모델 및 독점 서비스 포함)에서 거의 100%의 검색률을 달성했습니다. 이 공격을 기반으로, 저희는 자연스러운 쿼리와 현실적인 외부 데이터베이스 환경에서 최초의 완전한 IPI 공격을 시연합니다. 이는 RAG(Retrieval-Augmented Generation) 및 에이전트 시스템을 포함하며, 다양한 공격 목표를 포함합니다. 이러한 결과는 IPI가 실용적이고 심각한 위협임을 입증합니다. 예를 들어, 사용자가 자주 묻는 질문에 대한 이메일을 요약하는 자연스러운 쿼리를 수행할 때, 단 하나의 악성 이메일만으로 GPT-4o를 속여 SSH 키를 추출하도록 유도할 수 있으며, 다중 에이전트 워크플로우에서 80% 이상의 성공률을 보였습니다. 또한 여러 방어 기법을 평가한 결과, 악성 텍스트의 검색을 막기에 충분하지 않으며, 이는 검색 자체가 중요한 취약점임을 강조합니다.

Original Abstract

Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real-world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black-box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost-efficient (as little as $0.21 per target user query on OpenAI's embedding models), and achieves near-100% retrieval across 11 benchmarks and 8 embedding models (including both open-source models and proprietary services). Based on this attack, we present the first end-to-end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability.

5 Citations

0 Influential

4.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!