2604.08064v1 Apr 09, 2026 cs.AI

ImplicitMemBench: 대규모 언어 모델에서 무의식적인 행동 적응 측정

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

Weitao Ma

Citations: 2,815

h-index: 9

Xiaocheng Feng

Citations: 10,016

h-index: 30

Xiachong Feng

Citations: 788

h-index: 13

Chonghan Qin

Citations: 20

h-index: 3

Lingpeng Kong

Citations: 290

h-index: 7

기존의 LLM 에이전트 메모리 벤치마크는 사실에 대한 명시적인 회수를 평가하지만, 경험이 의식적인 회수 없이 자동화된 행동으로 이어지는 암묵적 기억은 간과합니다. 이러한 간극은 매우 중요합니다. 효과적인 어시스턴트는 학습된 절차를 자동으로 적용하거나, 명시적인 알림 없이 실패한 행동을 피해야 합니다. 본 연구에서는 암묵적 기억을 평가하는 최초의 체계적인 벤치마크인 ImplicitMemBench를 소개합니다. 이는 표준적인 인지과학적 관점에서 파생된 세 가지 인지적으로 기반한 구성 요소를 활용합니다: 절차적 기억 (간섭 후 단일 시도 기술 습득), 프라이밍 (짝을 이룬 실험/대조 인스턴스를 통한 주제 기반 편향), 그리고 고전적 조건 형성 (조건된 자극-무조건된 자극 (CS-US) 연관이 첫 번째 결정을 형성). 300개의 항목으로 구성된 본 벤치마크는 통일된 학습/프라이밍-간섭-테스트 프로토콜을 사용하며, 첫 번째 시도 점수를 기준으로 평가합니다. 17개의 모델에 대한 평가 결과, 심각한 한계점이 드러났습니다. 단 하나의 모델도 전체적으로 66%를 초과하지 못했으며, 가장 높은 성능을 보인 모델은 DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), 그리고 GPT-5 (63.0%)로, 이는 인간 기준점수보다 훨씬 낮은 수치입니다. 분석 결과, 극심한 비대칭성 (억제 17.6% vs. 선호 75.0%)과 보편적인 병목 현상이 발견되었으며, 이는 매개변수 확장 이상의 아키텍처 혁신을 요구합니다. ImplicitMemBench는 평가의 초점을

Original Abstract

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".

0 Citations

0 Influential

15 Altmetric

75.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!