2601.13528v1 Jan 20, 2026 cs.CR

안전 장치가 적용된 모델의 출력을 활용하여 유해 기능을 유도하기

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Christina Q. Knight

Citations: 29

h-index: 4

Mrinank Sharma

Citations: 1,682

h-index: 10

Jackson Kaunismaa

Citations: 4

h-index: 1

Avery Griffin

Citations: 14

h-index: 3

John Hughes

Citations: 7,798

h-index: 36

Erik Jones

Citations: 133

h-index: 4

모델 개발자는 최첨단 모델에 안전 장치를 구현하여 오용을 방지합니다. 예를 들어, 위험한 출력을 필터링하기 위해 분류기를 사용합니다. 본 연구에서는 강력한 안전 장치가 적용된 모델이라도, '유도 공격(elicitation attacks)'을 통해 오픈 소스 모델에서 유해 기능을 유도할 수 있음을 보여줍니다. 우리의 유도 공격은 세 단계로 구성됩니다. (i) 목표로 하는 유해 작업과 관련된 영역에서 위험한 정보를 요구하지 않는 프롬프트를 구성합니다. (ii) 이러한 프롬프트에 대해 안전 장치가 적용된 최첨단 모델로부터 응답을 얻습니다. (iii) 오픈 소스 모델을 이러한 프롬프트-출력 쌍을 사용하여 미세 조정합니다. 요청된 프롬프트는 직접적인 피해를 유발할 수 없기 때문에, 최첨단 모델의 안전 장치에 의해 거부되지 않습니다. 우리는 이러한 유도 공격을 유해 화학 물질 합성 및 가공 분야에서 평가하고, 우리의 공격이 기본 오픈 소스 모델과 제약 없는 최첨단 모델 간의 기능 격차의 약 40%를 회복할 수 있음을 보여줍니다. 또한, 유도 공격의 효과는 최첨단 모델의 능력과 생성된 미세 조정 데이터의 양에 따라 증가한다는 것을 보여줍니다. 본 연구는 출력 수준의 안전 장치만으로는 생태계 수준의 위험을 완화하는 데 어려움이 있음을 보여줍니다.

Original Abstract

Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.

5 Citations

1 Influential

18 Altmetric

97.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!