2603.15527v1 Mar 16, 2026 cs.AI

LLM 정렬 과정에서 발생하는 딜레마와 충돌은 해결 가능한가? 우선순위 그래프를 통한 고찰

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Xiaowen Chu

Citations: 802

h-index: 17

Xianglong Liu

Citations: 82

h-index: 5

Eunsol Choi

Citations: 15

h-index: 2

Zhenheng Tang

Citations: 142

h-index: 5

Qian Wang

Citations: 341

h-index: 10

Bo Li

Citations: 134

h-index: 6

대규모 언어 모델(LLM)이 더욱 강력하고 자율적으로 발전함에 따라, 다양한 시나리오에서 충돌과 딜레마에 직면하는 경우가 늘어나고 있습니다. 본 연구에서는 이러한 다양한 충돌을 요약하고 분류합니다. 그런 다음, LLM의 선호도를 모델링하여 다양한 선택을 내릴 수 있도록 우선순위 그래프를 구축합니다. 여기서 지시사항과 가치는 노드이며, 엣지는 모델의 출력 분포에 의해 결정되는 컨텍스트별 우선순위를 나타냅니다. 이 그래프는 통일되고 안정적인 LLM 정렬이 매우 어렵다는 것을 보여주는데, 이는 그래프가 정적이지 않으며 다양한 컨텍스트에서 반드시 일관성을 유지하지 않기 때문입니다. 또한, 이 그래프는 잠재적인 취약점, 즉 '우선순위 해킹'을 드러냅니다. 공격자는 오해를 불러일으키는 컨텍스트를 설계하여 그래프를 조작하고 안전 정렬을 우회할 수 있습니다. 이러한 문제를 해결하기 위해, LLM이 외부 소스를 참조하여 컨텍스트를 검증하고 조작에 저항할 수 있도록 런타임 검증 메커니즘을 제안합니다. 이러한 접근 방식은 견고성을 향상시키지만, 많은 윤리적 및 가치적 딜레마는 철학적으로 해결 불가능하며, 이는 AI 정렬의 미래에 대한 장기적인 과제라는 점을 인정합니다.

Original Abstract

As Large Language Models (LLMs) become more powerful and autonomous, they increasingly face conflicts and dilemmas in many scenarios. We first summarize and taxonomize these diverse conflicts. Then, we model the LLM's preferences to make different choices as a priority graph, where instructions and values are nodes, and the edges represent context-specific priorities determined by the model's output distribution. This graph reveals that a unified stable LLM alignment is very challenging, because the graph is neither static nor necessarily consistent in different contexts. Besides, it also reveals a potential vulnerability: priority hacking, where adversaries can craft deceptive contexts to manipulate the graph and bypass safety alignments. To counter this, we propose a runtime verification mechanism, enabling LLMs to query external sources to ground their context and resist manipulation. While this approach enhances robustness, we also acknowledge that many ethical and value dilemmas are philosophically irreducible, posing a long-term, open challenge for the future of AI alignment.

1 Citations

1 Influential

8.5 Altmetric

45.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!