2603.19144v1 Mar 19, 2026 cs.CL

UGID: 대규모 언어 모델의 편향 해소를 위한 통합 그래프 동형 이성질체

UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Hongbo Liu

Citations: 35

h-index: 3

Lijie Hu

Citations: 24

h-index: 1

Zikang Ding

Citations: 4

h-index: 1

Wenbo Jiang

Citations: 3

h-index: 1

Junchi Yao

Citations: 79

h-index: 3

Junhao Li

Citations: 0

h-index: 0

Yi Zhang

Citations: 23

h-index: 3

대규모 언어 모델(LLM)은 뚜렷한 사회적 편향을 나타냅니다. 출력 수준 또는 데이터 최적화 기반의 편향 해소 방법은 이러한 편향을 완전히 해결하지 못하며, 많은 기존 연구에서 편향이 내부 표현에 내재되어 있다는 것을 보여주었습니다. 본 논문에서는 대규모 언어 모델의 내부 표현 수준에서 편향을 해소하는 프레임워크인 통합 그래프 동형 이성질체(UGID)를 제안합니다. UGID는 트랜스포머를 구조화된 계산 그래프로 모델링하며, 여기서 어텐션 메커니즘은 그래프의 연결(엣지)을 정의하고, 은닉 상태는 그래프의 노드를 정의합니다. 구체적으로, 편향 해소는 반사실적 입력에 대한 그래프 구조의 불변성을 강제하는 방식으로 정의되며, 민감한 속성에 대한 차이는 허용됩니다. UGID는 편향에 민감한 영역에서 어텐션 연결과 은닉 표현을 동시에 제약하여, 아키텍처 구성 요소 간의 편향 이동을 효과적으로 방지합니다. 효과적인 행동적 정렬을 달성하면서 일반적인 능력을 저하시키지 않기 위해, 우리는 민감한 로짓에 대한 로그 공간 제약을 도입하고, 정의적인 의미를 보존하기 위한 선택적 앵커 기반의 목적 함수를 사용합니다. 대규모 언어 모델에 대한 광범위한 실험 결과, UGID는 분포 내 및 분포 외 환경 모두에서 편향을 효과적으로 줄이고, 내부 구조적 불일치를 크게 줄이며, 모델의 안전성과 유용성을 유지함을 보여줍니다.

Original Abstract

Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!