2605.04624v1 May 06, 2026 cs.AI

AuditRepairBench: 에이전트 수정을 위한 평가자 채널 순위 불안정성을 평가하기 위한 페어링 실행 추적 데이터셋

AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Yuelin Hu

Citations: 5

h-index: 2

Zhengxue Cheng

Citations: 263

h-index: 9

Wei Liu

Citations: 12

h-index: 3

Zhen Yu

Citations: 7

h-index: 1

Lida Song

Citations: 0

h-index: 0

에이전트 수정 성능 순위는 평가자 재구성 시 변경될 수 있으며, 내부적으로 후보 수정 방법을 선택할 때 평가자로부터 얻은 정보를 활용하는 방법들이 이러한 순위 변화의 상당 부분을 차지합니다. 본 논문에서는 이러한 문제점을 공개 순위 시스템에서 확인하고, AuditRepairBench를 공개합니다. AuditRepairBench는 576,000개의 등록된 셀(96,000개 실행)로 구성된 페어링 실행 추적 데이터셋으로, 정의된 관찰 가능성 경계 내에서 평가자 채널 차단으로 인한 순위 불안정성을 구현합니다. 모듈형 스크리닝 아키텍처는 네 가지 교체 가능한 구현 방식을 통해 경로 차단을 결정합니다. 이러한 방식에는 학습된 영향 프록시, 학습된 모델을 사용하지 않는 규칙 기반 채널 노출 비율, 반사실적 민감도 프록시, 그리고 희소한 인간 검토 프록시가 포함됩니다. 이들은 스크리닝 후 결과를 생성하여 셀 수준의 수정 함수, 집합 값 레이블, 계층화된 시스템 점수, 그리고 집합 값 순위 시스템을 제공합니다. 본 리소스는 80개의 소스 코드 레벨 채널 수정 부분에 대한 메커니즘 기반 검증을 통해 검증되었으며, 파이프라인 개발자와 분리된 두 그룹의 어노테이터가 스크리닝 설계에 대한 정보를 알지 못한 채 커플링 패턴을 독립적으로 발견하는 프로토콜을 사용했습니다. 결과적으로, 이들의 79개 사례에 대해 앙상블 모델의 pooled AUROC는 0.83을 달성했습니다. 또한, 구현의 견고성, 불확실성 전파(95% coverage를 0.81에서 0.95로 증가), 그리고 커뮤니티 평가자와의 Spearman 상관 관계(0.65)를 확인했습니다. 스크리닝을 기반으로 한 블라인딩 패치는 55~74% (평균 62%)의 순위 변화를 줄이는 효과를 가지며, 이는 50줄 미만의 코드 변경으로 달성되었습니다. 반면, 임의의 채널 블라인딩은 최대 7%의 감소만 가져오고, 일반적인 재학습은 최대 13%의 감소만 가져옵니다. AuditRepairBench-Lite는 12,000개의 셀로 구성된 데이터셋에 규칙 기반 설정을 적용하여, 24시간의 GPU 시간을 사용하여 Kendall τ = 0.88의 순위 일관성을 유지하며, 42GB의 주요 배포 파일입니다.

Original Abstract

Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and release AuditRepairBench, a paired-execution trace corpus of 576,000 registered cells (96,000 executed) that operationalizes evaluator-channel-blocking ranking instability within a declared observability boundary. A modular screening architecture decides pathway-blocking through four interchangeable implementations, a learned influence proxy, a rule-based channel-exposure ratio that uses no trained model, a counterfactual sensitivity proxy, and a sparse human-audit proxy, combined into a screening posterior that feeds a cell-level flip functional, a set-valued label, a stratified system score, and a set-valued leaderboard. The resource is supported by mechanism-anchored validation on an 80-case source-level channel-surgery subset, an independent-discovery protocol under which two annotator groups separated from the pipeline developers discover coupling patterns blinded to the screening design and the frozen ensemble attains pooled AUROC 0.83 on their 79 cases, implementation robustness, uncertainty propagation that raises 95% coverage from 0.81 to 0.95, and forward transfer with pooled community-evaluator Spearman \r{ho} = 0.65. Screening-guided blinding patches reduce rank displacement by 55--74% (mean 62%) at fewer than 50 lines of code, whereas random channel blinding produces at most 7% reduction and generic retraining at most 13%. AuditRepairBench-Lite, a rule-only configuration on a 12,000-cell subset, preserves the leaderboard at Kendall τ = 0.88 under twenty-four GPU-hours and is the primary release artifact at 42 GB.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!