2604.07801v1 Apr 09, 2026 cs.CL

TEMPER: 정량적 추론에서의 감정적 교란 테스트

TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

Benjamin Z. Reichman

Citations: 87

h-index: 7

Larry Heck

Citations: 46

h-index: 4

Atahan Dokme

Citations: 1

h-index: 1

대규모 언어 모델은 명확하고 감정적으로 중립적인 언어로 작성된 정량적 추론 과제에 대해 훈련되고 평가됩니다. 그러나 실제 시나리오에서는 쿼리가 종종 좌절감, 긴급성 또는 열정으로 표현됩니다. 숫자 정보가 모두 유지될 때, 감정적 표현만으로도 추론 능력이 저하될까요? 이를 조사하기 위해, 모든 수량과 관계를 유지하면서 문제를 감정적인 변형으로 다시 작성하는 제어된 감정 번역 프레임워크를 개발했습니다. 이 프레임워크를 사용하여 GSM8K, MultiArith, ARC-Challenge 데이터셋에 대해 의미적으로 검증된 감정-중립 쌍 5,400개를 포함하는 Temper-5400 데이터셋을 구축하고, 10억 개에서 최첨단 규모까지의 18개 모델을 평가했습니다. 주요 결과는 다음과 같습니다. 첫째, 모든 숫자 정보가 유지되더라도 감정적인 표현은 정확도를 2~10%p 정도 저하시킵니다. 둘째, 감정적인 변형을 중립적으로 변환하면 대부분의 성능 손실이 회복되어, 저하가 콘텐츠 손상이 아닌 감정적인 스타일과 관련되어 있으며, 중립화가 추론 시간의 경량적인 완화 전략으로 사용될 수 있음을 보여줍니다. 감정이 없는 문장 변형은 그러한 저하를 유발하지 않으며, 이는 표면적인 변화가 아닌 감정적인 내용이 문제임을 시사합니다. 감정 외에도, 벤치마크 구축 절차는 제어된 스타일 번역 및 로버스트성 평가를 위한 일반적인 프레임워크를 제공합니다.

Original Abstract

Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!