2601.21612v1 Jan 29, 2026 eess.AS

표현 정규화 기반 컨볼루션 오디오 트랜스포머: 오디오 이해를 위한 방법

Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Chenda Li

Citations: 1,002

h-index: 17

Yanmin Qian

Citations: 229

h-index: 8

Bing Han

Citations: 16

h-index: 3

Chushu Zhou

Citations: 22

h-index: 1

Wangyou Zhang

Citations: 383

h-index: 9

Yifan Yang

Shanghai Jiao Tong University

Citations: 937

h-index: 14

Wei Wang

Citations: 74

h-index: 5

자기 지도 학습(SSL)은 오디오 이해 분야에서 괄목할 만한 발전을 이루었습니다. 그러나 기존 방법들은 일반적으로 단일 수준의 세분성으로 작동하여 복잡한 오디오 신호에 내재된 다양한 시간 및 스펙트럼 구조를 모델링하는 데 한계가 있습니다. 또한, 처음부터 표현을 학습하는 것은 계산 비용이 많이 들며, 수렴하는 데 상당한 훈련 시간이 필요합니다. 본 연구에서는 이러한 문제점을 해결하기 위한 통합 프레임워크인 컨볼루션 오디오 트랜스포머(CAT)를 제안합니다. 첫째, CAT는 계층적 오디오 특징을 캡처하기 위해 다양한 세분성 수준의 정보를 집계하는 멀티 해상도 블록을 포함합니다. 둘째, 훈련 효율성을 향상시키기 위해 표현 정규화(Representation Regularization) 객관 함수를 도입했습니다. 생성 모델링에서 영감을 받아, 이 보조 작업은 학생 모델을 외부의 사전 훈련된 인코더에서 얻은 고품질의 의미론적 표현과 일치시켜 예측하도록 안내합니다. 실험 결과, CAT는 오디오 이해 벤치마크에서 기존 방법보다 훨씬 뛰어난 성능을 보였습니다. 특히, AudioSet 20k 데이터셋에서 기존 방법보다 5배 더 빠른 수렴 속도를 달성했습니다. 코드와 체크포인트는 곧 https://github.com/realzhouchushu/CAT 에서 공개될 예정입니다.

Original Abstract

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.

0 Citations

0 Influential

35.431471805599 Altmetric

177.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!