2602.24210v1 Feb 27, 2026 cs.CL

제어 가능한 추론 모델은 개인 정보 보호 기능이 뛰어난 모델이다

Controllable Reasoning Models Are Private Thinkers

Iryna Gurevych

Citations: 1,356

h-index: 17

Haritz Puerto

ELLIS Institute Tübingen

Citations: 179

h-index: 8

Haonan Li

Citations: 117

h-index: 4

Xudong Han

Citations: 77

h-index: 5

Timothy Baldwin

Citations: 883

h-index: 14

추론 모델을 기반으로 작동하는 AI 에이전트는 민감한 사용자 데이터에 접근해야 합니다. 그러나 이러한 모델의 추론 과정은 제어하기 어렵기 때문에, 의도치 않게 개인 정보가 외부로 유출될 위험이 있습니다. 본 연구에서는 모델이 최종 답변뿐만 아니라 추론 과정에서도 지시 사항을 따르도록 훈련하는 방법을 제안합니다. 이를 통해 다양한 제약 조건 하에서도 지시 사항을 따르도록 훈련하면, 개인 정보 보호 능력이 향상될 것이라고 가정합니다. 이를 입증하기 위해, 우리는 추론 과정에 명시적인 제한이 있는 새로운 지시 사항 준수 데이터 세트를 사용하여 모델을 미세 조정했습니다. 또한, 추론 과정과 답변 생성 과정을 분리하는 LoRA 어댑터를 사용하여 새로운 생성 전략을 도입했습니다. 1.7B에서 14B 파라미터에 이르는 두 가지 모델 패밀리의 6개 모델을 사용하여 두 가지 지시 사항 준수 벤치마크와 두 가지 개인 정보 보호 벤치마크에서 제안하는 방법을 평가했습니다. 그 결과, 지시 사항 준수 성능이 최대 20.9점, 개인 정보 보호 벤치마크에서 최대 51.9%p 향상되는 상당한 개선 효과를 얻었습니다. 그러나 이러한 개선은 추론 성능과 지시 사항 준수 능력 간의 균형으로 인해 작업 유용성을 저하시킬 수 있습니다. 전반적으로, 본 연구 결과는 추론 모델의 지시 사항 준수 능력을 향상시키는 것이 개인 정보를 크게 향상시킬 수 있음을 보여주며, 향후 개인 정보 보호 기능이 강화된 에이전트 개발에 유망한 방향을 제시합니다. 본 연구의 코드와 데이터는 https://github.com/UKPLab/arxiv2026-controllable-reasoning-models 에서 확인할 수 있습니다.

Original Abstract

AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models

0 Citations

0 Influential

33.993061443341 Altmetric

170.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!