2601.04770v2 Jan 08, 2026 cs.AI

SciIF: 엄밀한 과학 지능을 향한 과학적 지시 이행 벤치마킹

SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

Encheng Su

Citations: 46

h-index: 3

Jianyu Wu

Citations: 39

h-index: 4

Chen Tang

Citations: 33

h-index: 3

Lintao Wang

Citations: 108

h-index: 5

Jinouwen Zhang

Citations: 7

h-index: 2

Yizhou Wang

Citations: 45

h-index: 3

Xinzhu Ma

Citations: 73

h-index: 3

Shixiang Tang

Citations: 41

h-index: 3

Yuan Meng

Citations: 31

h-index: 3

Houqiang Li

Citations: 340

h-index: 5

Pengze Li

Citations: 438

h-index: 11

Aoran Wang

Citations: 97

h-index: 5

대규모 언어 모델(LLM)이 일반 지식 검색에서 복잡한 과학적 발견의 영역으로 전환됨에 따라, 그 평가 기준 또한 과학적 탐구의 엄밀한 규범을 통합해야 합니다. 기존 벤치마크들은 결정적인 사각지대를 가지고 있습니다. 일반적인 지시 이행 지표는 표면적인 형식에만 초점을 맞추는 반면, 도메인 특화 과학 벤치마크는 최종 정답의 정확성만 평가하여 종종 잘못된 논리로 올바른 결과에 도달한 모델에게도 점수를 부여합니다. 이러한 간극을 해결하기 위해 우리는 과학적 타당성을 확립하는 제약 조건을 엄격히 준수하며 문제를 해결하는 능력인 '과학적 지시 이행(scientific instruction following)'을 도입합니다. 구체적으로 우리는 대학 수준의 문제와 과학적 조건(예: 경계 검사 및 가정), 의미론적 안정성(예: 단위 및 기호 관례), 특정 프로세스(예: 필수 수치 해석 방법)라는 세 가지 축에 걸친 고정된 제약 조건 목록을 결합하여 이 능력을 평가하는 다학제적 벤치마크인 SciIF를 제안합니다. 특징적으로 SciIF는 감사 가능성(auditability)을 강조하여, 모델이 암묵적인 준수가 아닌 제약 조건 충족에 대한 명시적인 증거를 제공하도록 요구합니다. 솔루션의 정확성과 다중 제약 조건 준수 여부를 모두 측정함으로써 SciIF는 복합적 추론 실패에 대한 세밀한 진단을 가능하게 하며, LLM이 과학의 엄격한 논리적 프레임워크 내에서 신뢰할 수 있는 에이전트로 기능할 수 있도록 보장합니다.

Original Abstract

As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks and assumptions), semantic stability (e.g., unit and symbol conventions), and specific processes(e.g., required numerical methods). Uniquely, SciIF emphasizes auditability, requiring models to provide explicit evidence of constraint satisfaction rather than implicit compliance. By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures, ensuring that LLMs can function as reliable agents within the strict logical frameworks of science.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!