2606.05817v1 Jun 04, 2026 cs.LG

Consistency Training Along the Transformer Stack

D. Africa
D. Africa
Citations: 79
h-index: 5
Neil Shah
Neil Shah
Citations: 131
h-index: 7
Sukrati Gautam
Sukrati Gautam
Citations: 29
h-index: 2
Arav Dhoot
Arav Dhoot
Citations: 6
h-index: 2
Bryan Maruyama
Bryan Maruyama
Citations: 2
h-index: 1
Rohan Kapoor
Rohan Kapoor
Citations: 13
h-index: 2
R. Sidey
R. Sidey
Citations: 12
h-index: 2
Prakhar Gupta
Prakhar Gupta
Carnegie Mellon University
Citations: 7,503
h-index: 14
Zi Huang
Zi Huang
Citations: 110
h-index: 3
Caroline Wei
Caroline Wei
Citations: 2
h-index: 1

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

2 Citations
0 Influential
7 Altmetric
37.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!