2604.13715v1 Apr 15, 2026 cs.SD

세분화된 시간 인지 능력 향상을 위한 연구: 오디오-사이드 타임 프롬프트를 활용한 대규모 오디오-언어 모델의 추가 학습

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Pengfei Cai

Citations: 36

h-index: 4

Qing Gu

Citations: 30

h-index: 3

Ian McLoughlin

Citations: 198

h-index: 8

Nan Jiang

Citations: 27

h-index: 3

Jun Liu

Citations: 50

h-index: 4

Yan Shi

Citations: 5

h-index: 1

Lirong Dai

Citations: 7

h-index: 2

Yan Song

Citations: 52

h-index: 3

대규모 오디오-언어 모델(LALM)은 일반적인 오디오 이해 능력을 가능하게 하며, 다양한 오디오 작업에서 뛰어난 성능을 보여줍니다. 그러나 이러한 모델은 여전히 시간 인지 능력(예: 이벤트 시작 및 종료 추론)에 어려움을 겪으며, 이는 세분화된 시나리오에서 활용도를 제한합니다. 이러한 문제를 해결하기 위해, 우리는 오디오-사이드 타임 프롬프트를 제안하고, 강화 학습(RL)을 활용하여 세분화된 시간 인지 능력을 위한 TimePro-RL 프레임워크를 개발했습니다. 구체적으로, 우리는 타임스탬프를 임베딩으로 인코딩하고 오디오 특징 시퀀스 내에 이를 시간 좌표로 삽입하여 모델에 프롬프트를 제공합니다. 또한, 지도 학습(SFT)을 수행한 후 강화 학습을 도입하여 시간 정렬 성능을 직접적으로 최적화합니다. 실험 결과, TimePro-RL은 오디오 기반 객체 인식, 음향 이벤트 탐지 및 밀집 오디오 캡셔닝 등 다양한 오디오 시간 관련 작업에서 상당한 성능 향상을 달성했으며, 이는 TimePro-RL의 강력한 효과를 입증합니다.

Original Abstract

Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!