2601.14728v1 Jan 21, 2026 eess.AS

AQAScore: 오디오 질문 답변을 통한 텍스트-오디오 생성의 의미적 정렬 평가

AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

Kai-Wei Chang

MIT CSAIL

Citations: 596

h-index: 13

Chun-Yi Kuan

Citations: 534

h-index: 13

Hung-yi Lee

Citations: 197

h-index: 7

텍스트-오디오 생성 기술이 현실감과 다양성 측면에서 상당한 발전을 이루었지만, 평가 지표 개발은 이러한 발전을 따라가지 못하고 있습니다. 널리 사용되는 방법들은 주로 CLAPScore와 같이 임베딩 유사성을 기반으로 하며, 이는 전반적인 관련성을 측정하는 데 효과적이지만, 세밀한 의미적 정렬 및 구성적 추론 능력 측면에서는 한계가 있습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 오디오 정보를 활용하는 대규모 언어 모델(ALLM)의 추론 능력을 활용하는, 모델에 독립적인 평가 프레임워크인 AQAScore를 소개합니다. AQAScore는 평가를 확률적 의미 검증 작업으로 재구성하며, 개방형 텍스트 생성에 의존하는 대신, 특정 의미 질문에 대한 "예"라는 답변의 정확한 로그 확률을 계산하여 정렬 정도를 추정합니다. AQAScore는 인간 평가 기준 관련성, 쌍대 비교, 구성적 추론 작업 등 다양한 벤치마크에서 평가되었습니다. 실험 결과는 AQAScore가 유사성 기반 지표 및 생성 기반 기준보다 인간 판단과의 상관관계가 높다는 것을 보여주며, 이는 미묘한 의미적 불일치를 효과적으로 파악하고, 기반 ALLM의 능력에 따라 확장될 수 있음을 입증합니다.

Original Abstract

Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.

5 Citations

1 Influential

6.5 Altmetric

39.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!