2601.16540v2 Jan 23, 2026 cs.SD

모델은 우리처럼 듣는가? 오디오 LLM의 표현적 일관성과 자연스러운 EEG 신호의 상관 관계 연구

Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEG

Kaiwen Wei

Citations: 54

h-index: 4

Haoyu Yang

Citations: 42

h-index: 2

Yu Tian

Citations: 2

h-index: 1

Jiang Zhong

Citations: 5

h-index: 1

Xin Xiao

Citations: 4

h-index: 1

Xiao Dong

Citations: 1

h-index: 1

Yu Mao

Citations: 9

h-index: 2

Hao Wu

Citations: 253

h-index: 5

오디오 대규모 언어 모델(Audio LLM)은 음성 인식과 언어 이해를 통합하는 강력한 기능을 보여주었습니다. 그러나 이러한 모델의 내부 표현이 자연스러운 청취 과정에서 인간의 신경 활동과 얼마나 일치하는지는 아직 충분히 연구되지 않았습니다. 본 연구에서는 12개의 공개된 Audio LLM과 뇌파(EEG) 신호 간의 계층별 표현적 일관성을 2개의 데이터 세트를 사용하여 체계적으로 분석합니다. 특히, 스피어만 상관 관계 기반 표현적 유사성 분석(RSA)을 포함한 8가지 유사성 지표를 사용하여 문장 내 표현적 구조를 분석했습니다. 분석 결과, 다음과 같은 3가지 주요 결과를 얻었습니다: (1) 모델 순위는 다양한 유사성 지표에 따라 크게 달라지는 경향을 보입니다; (2) 심층 의존적인 일관성 피크와 250-500ms 시간 창에서 RSA가 현저하게 증가하는 시공간적 일관성 패턴을 확인했으며, 이는 N400과 관련된 신경 활동과 일치합니다; (3) 제안된 삼모드 이웃 일관성(TNC) 기준을 사용하여 식별된 부정적인 운율은 기하학적 유사성을 감소시키면서 공분산 기반 의존성을 증가시키는 감정적 분리 현상을 발견했습니다. 이러한 결과는 Audio LLM의 표현적 메커니즘에 대한 새로운 신경생물학적 통찰력을 제공합니다.

Original Abstract

Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!