2602.07077v1 Feb 06, 2026 cs.SD

CALM: 대규모 오디오-언어 모델을 위한 클래스 조건부 희소 어텐션 벡터

CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

Mit Csail

Citations: 1,816

h-index: 19

Videet Mehta

Citations: 10

h-index: 1

Limin Wang

Citations: 24

h-index: 1

Hildegard Kuehne

Citations: 178

h-index: 7

Rogério Feris

Citations: 410

h-index: 11

James R. Glass

Citations: 3

h-index: 1

M. J. Mirza

Citations: 16

h-index: 1

대규모 오디오-언어 모델(LALM)은 오디오 질의 응답(AQA) 및 추상적 추론과 같은 다양한 하위 작업에서 뛰어난 제로샷 성능을 보이지만, 특정 판별 작업(예: 오디오 분류)에서는 여전히 특수 모델에 비해 성능이 뒤쳐지는 경향이 있습니다. 최근 연구에 따르면, LALM 내의 어텐션 헤드 집합의 일부를 선택하여 간단한 투표 방식을 통해 분류와 같은 하위 작업에서 강력한 판별적 특징 추출기로 활용할 수 있습니다. 그러나 기존 방법은 선택된 모든 헤드에 동일한 가중치를 부여하여, 각 헤드가 모든 의미 범주에 대해 동일하게 기여한다고 암묵적으로 가정합니다. 본 연구에서는 대규모 오디오-언어 모델을 위한 클래스 조건부 희소 어텐션 벡터(Class-Conditional Sparse Attention Vectors)라는 퓨샷 분류 방법을 제안합니다. 이 방법은 어텐션 헤드에 대한 클래스에 종속적인 중요도 가중치를 학습합니다. 이러한 구조를 통해 개별 헤드가 서로 다른 의미 범주에 특화되고, 추정된 신뢰도에 비례하여 앙상블 예측에 기여할 수 있습니다. 여러 퓨샷 오디오 및 오디오-비디오 분류 벤치마크 및 작업에 대한 실험 결과, 제안하는 방법이 오디오 분류, 오디오-비디오 분류 및 위조 탐지 작업에서 각각 최대 14.52%, 1.53%, 8.35%의 절대적인 성능 향상을 보이며, 최첨단 투표 기반 접근 방식을 꾸준히 능가함을 확인했습니다.

Original Abstract

Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!