2603.05485v1 Mar 05, 2026 cs.AI

편향성을 증명 가능하게 하는 LLM 평가 모델 개발: 편향성 제한 평가를 통한 접근

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Ben Feuer

Citations: 1,206

h-index: 12

Lucas Rosenblatt

Citations: 259

h-index: 8

Oussama Elachqar

Citations: 306

h-index: 7

인공지능 모델이 단순한 챗봇을 넘어 더욱 복잡한 워크플로우로 발전함에 따라, 인공지능 시스템이 자율적이고 자체 유지되는 피드백 루프에 활용되는 시기가 점점 가까워지고 있습니다. 이러한 자율적인 인공지능 시스템은 자동화되고 검증 가능한 보상 및 피드백에 의존하며, 진실 데이터가 부족하거나 불확실한 환경에서, LLM을 활용한 평가 모델은 실질적인 보상 제공원천이 될 수 있습니다. LLM 평가 모델은 지속적으로 개선되고 있지만, 아직까지는 강력한 수준의 보장을 제공하는 시스템이 부족하며, 특히 편향 벡터가 알려지지 않거나 적대적으로 발견되는 경우 문제가 됩니다. 이러한 문제를 해결하기 위해, 우리는 평균 편향성 제한(A-BB)이라는 알고리즘 프레임워크를 제안합니다. A-BB는 LLM 평가 모델의 측정 가능한 편향으로 인해 발생하는 피해/영향을 공식적으로 줄이는 것을 보장합니다. Arena-Hard-Auto 데이터셋을 사용하여 4개의 LLM 평가 모델을 평가한 결과, (tau=0.5, delta=0.01)의 편향성 제한 보장을 달성하면서, 서식 및 체계적 편향 설정에서 원래 순위와 61-99%의 상관관계를 유지했습니다. 대부분의 평가 모델-편향 조합에서 80% 이상의 상관관계를 보였습니다. 우리의 결과를 재현할 수 있는 코드는 https://github.com/penfever/bias-bounded-evaluation 에서 확인할 수 있습니다.

Original Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

0 Citations

0 Influential

29.4657359028 Altmetric

147.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!