2605.06524v1 May 07, 2026 cs.AI

인간과 기계를 구별하는 데 있어서 결과보다 과정이 더 중요하다

Process Matters more than Output for Distinguishing Humans from Machines

Thomas L. Griffiths

Citations: 27

h-index: 3

Milena Rmus

Citations: 103

h-index: 4

Mathew D. Hardy

Citations: 260

h-index: 6

Mayank Agrawal

Citations: 408

h-index: 6

대규모 언어 모델과 자율 에이전트가 온라인 환경에 널리 사용됨에 따라, 인간과 기계를 신뢰성 있게 구별하는 것이 점점 더 중요해지고 있습니다. 기존의 접근 방식은 시스템이 인간과 구별할 수 없는 행동이나 반응을 생성할 수 있는지 평가하며, 이는 앨런 튜링이 제안한 지능의 기준으로 결과를 중시하는 관점을 따릅니다. 인지 과학은 대안적인 관점을 제시합니다. 즉, 행동이 생성되는 과정을 평가하는 것입니다. 인지 과정이 인간과 기계를 신뢰성 있게 구별할 수 있는지 테스트하기 위해, 우리는 진단적인 과정 수준의 특징을 유도하도록 설계된 30개의 인지 과제 세트인 CogCAPTCHA30을 소개합니다. 이 과제 세트에서, 과정 수준의 특징은 성능 지표만으로는 제공할 수 없는 더 강력한 구별력을 제공하며, 결과가 동일한 경우에도 인간과 에이전트를 신뢰성 있게 구별합니다(평균 과정 특징 분류기 AUC = 0.88). 에이전트의 과정 차이를 평가하기 위해, 우리는 상용 최첨단 에이전트(Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), 1070만 건의 인간 의사 결정으로 미세 조정된 언어 모델인 Centaur, 그리고 Qwen2.5-1.5B-Instruct에 적용된 두 가지 작업별 미세 조정 방법(행위 수준의 지도 학습(A-SFT) 및 과정 수준의 미세 조정(P-SFT))을 비교합니다. 인간 의사 결정에 대한 광범위한 미세 조정은 상용 에이전트에 비해 인간과 유사한 작업 과정을 향상시키지만, 작업별 과정 수준의 지도는 행동 모방을 더욱 향상시킵니다. 그러나 지도된 과정 목표가 자연스럽게 작업 간에 일반화되지 않는 경우, 이러한 장점은 교차 작업 전송 시에 감소합니다. 명시적인 과정 수준의 지도는 인간의 행동 모방을 향상시킬 수 있지만, 이는 적절한 작업별 과정 표현이 사용 가능할 때만 가능하며, 이는 기계에서 인간과 유사한 인지 과정을 달성하기 위한 과정 정의의 중요한 제약 요소를 강조합니다.

Original Abstract

Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!