2602.22971v1 Feb 26, 2026 cs.AI

SPM-Bench: 주사 탐침 현미경을 위한 대규모 언어 모델 성능 평가

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Ben Wang

Citations: 27,797

h-index: 5

P. Xiao

Citations: 86

h-index: 5

Xiaogang Li

Citations: 2

h-index: 1

Jiayin Wang

Citations: 79

h-index: 5

Kejun Yu

Citations: 2

h-index: 1

Xulin Liu

Citations: 12

h-index: 2

Bingrui Zhao

Citations: 6

h-index: 2

Hu Wei

Citations: 597

h-index: 3

Cheng Xu

Citations: 14

h-index: 2

Yueqian Chen

Citations: 29

h-index: 3

Zichao Chen

Citations: 1

h-index: 1

Zeyu Wang

Citations: 157

h-index: 1

Wen Xiao

Citations: 88

h-index: 3

최근 대규모 언어 모델(LLM)이 일반적인 추론 분야에서 뛰어난 성과를 보이고 있지만, 전문적인 과학 분야에서는 데이터 오염, 복잡성 부족, 과도한 인력 비용 등의 문제로 인해 기존 벤치마크의 한계가 드러나고 있습니다. 본 연구에서는 주사 탐침 현미경(SPM)에 특화된, 박사 수준의 다중 모드 벤치마크인 SPM-Bench를 제시합니다. 우리는 데이터 신뢰성을 높이고 비용을 절감하는 완전 자동화 데이터 생성 파이프라인을 제안합니다. Anchor-Gated Sieve (AGS) 기술을 활용하여 arXiv 및 2023년부터 2025년 사이에 발행된 학술지 논문에서 고품질의 이미지-텍스트 쌍을 효율적으로 추출합니다. VLMs가 로컬 환경에서 고품질 이미지를 추출할 수 있도록 하는 하이브리드 클라우드-로컬 아키텍처를 통해, 토큰 사용량을 극적으로 줄이면서 데이터 세트의 순도를 유지합니다. LLM의 성능을 정확하고 객관적으로 평가하기 위해 Strict Imperfection Penalty F1 (SIP-F1) 점수를 도입했습니다. 이 지표는 엄격한 능력 계층 구조를 확립할 뿐만 아니라, 모델의 '성격'(보수적, 공격적, 도박적 또는 현명)을 처음으로 정량화합니다. 이러한 결과를 모델이 보고하는 신뢰도 및 인식된 난이도와 연관시켜, 복잡한 물리적 시나리오에서 현재 AI의 실제 추론 범위를 밝힙니다. 이러한 연구 결과는 SPM-Bench를 자동화된 과학 데이터 생성의 일반적인 패러다임으로 확립하는 데 기여합니다.

Original Abstract

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates "llbox" for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model "personalities" (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!