2602.09040v1 Jan 30, 2026 eess.AS

합쳐진 임베딩 예측 아키텍처에서의 자기 지도 음성 표현 학습을 위한 소프트 클러스터링 앵커

Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Yann LeCun

Citations: 4,554

h-index: 22

Judah Goldfeder

Citations: 129

h-index: 7

Ravid Shwartz-Ziv

Citations: 5,257

h-index: 21

Georgios Ioannides

Citations: 15

h-index: 3

Adrian Kieback

San Diego State University, James Silberrad Brown Center for AI

Citations: 3

h-index: 1

Linsey Pang

Citations: 1

h-index: 1

Aman Chadha

Citations: 30

h-index: 3

Aaron Elkins

Citations: 12

h-index: 2

합쳐진 임베딩 예측 아키텍처(JEPA)는 자기 지도 음성 표현 학습에 유망한 접근 방식을 제공하지만, 명시적인 기준점 없이 표현 붕괴가 발생할 수 있습니다. 본 논문에서는 음성 데이터의 멜 스펙트로그램에 대해 한 번만 가우시안 혼합 모델(GMM)을 적용하고, 이를 사용하여 학습 과정 전반에 걸쳐 GMM의 고정된 소프트 포스터리어를 보조 목표로 사용하는 GMM-앵커드 JEPA를 제안합니다. 감쇠된 감독 학습 스케줄을 통해 GMM 정규화가 초기 학습 단계에서 JEPA 목표보다 우세하게 작용하도록 하며, 점진적으로 JEPA 목표에 의해 대체됩니다. HuBERT 및 WavLM과 달리, 본 연구는 반복적인 재클러스터링을 필요로 하지 않고, 입력 특징을 한 번만 소프트 방식으로 클러스터링합니다. 약 5만 시간 분량의 음성 데이터를 사용하여 GMM 앵커링은 WavLM 기반 모델과 동일한 연산량에서 ASR (음성 인식, 28.68% vs. 33.22% WER), 감정 인식 (67.76% vs. 65.46%), 슬롯 채우기 (64.7% vs. 59.1% F1) 성능을 향상시켰습니다. 클러스터 분석 결과, GMM-앵커드 표현은 최대 98%의 엔트로피를 달성하는 반면, WavLM 기반 모델은 31%의 엔트로피를 달성하여, 훨씬 더 균일한 클러스터 활용도를 보임을 알 수 있습니다. 관련 코드는 https://github.com/gioannides/clustering-anchored-jepa 에서 확인할 수 있습니다.

Original Abstract

Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering-anchored-jepa.

0 Citations

0 Influential

39.95879734614 Altmetric

199.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!