2604.00938v1 Apr 01, 2026 cs.LG

WARP: NLP 트랜스포머 모델의 보장된 내부 레이어 복구

WARP: Guaranteed Inner-Layer Repair of NLP Transformers

Hsin-Ling Hsu

Citations: 23

h-index: 2

Min-Yu Chen

Citations: 5

h-index: 1

Nai-Chia Chen

Citations: 0

h-index: 0

Yanru Chen

Citations: 1,021

h-index: 7

Yi Chang

Citations: 14

h-index: 2

Fang Yu

Citations: 1

h-index: 1

트랜스포머 기반의 자연어 처리 모델은 여전히 적대적 공격에 취약하지만, 기존의 복구 방법들은 근본적인 트레이드오프를 가지고 있습니다. 기울기 기반 방법은 유연성을 제공하지만 검증 가능성이 부족하고 과적합되는 경향이 있으며, 복구 보장을 제공하는 방법들은 최종 레이어 또는 작은 네트워크에만 적용되어 복구 가능한 파라미터 탐색 공간을 크게 제한합니다. 본 연구에서는 Transformer 모델의 마지막 레이어를 넘어 복구를 확장하는 제약 기반 복구 프레임워크인 WARP (Weight-Adjusted Repair with Provability)를 제시합니다. WARP는 로짓 간격의 1차 선형화를 기반으로 정의된 볼록 이차 프로그램으로, 복구를 정의하여 고차원의 파라미터 공간에 대한 효율적인 최적화를 가능하게 합니다. 1차 근사 조건 하에서, 이러한 정의는 세 가지 샘플별 보장을 유도합니다. (i) 복구된 입력에 대한 올바른 분류를 보장하는 양의 마진 제약 조건, (ii) 지정된 유지 집합에 대한 보존 제약 조건, (iii) 리프시츠 연속성으로부터 파생된 인증된 강건성 반경입니다. 다양한 모델 아키텍처에서의 적용 가능성을 보장하기 위해, 최적화 환경을 조정하는 감도 기반의 전처리 단계를 도입했습니다. 또한, 약한 가정 하에서 반복적인 최적화 절차가 모든 복구 제약 조건을 만족하는 해로 수렴한다는 것을 보였습니다. 다양한 레이어 아키텍처를 가진 인코더 전용 Transformer 모델에 대한 실험적 평가 결과, 이러한 보장이 실제로 유지되며 적대적 입력에 대한 강건성이 향상됨을 확인했습니다. 본 연구의 결과는 체계적인 제약 기반 최적화를 통해 보장되고 일반화 가능한 Transformer 복구가 가능하다는 것을 보여줍니다.

Original Abstract

Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!