2601.09413v1 Jan 14, 2026 cs.SD

Speech-Hands: 전방위 인식을 활용한 음성 인식 및 오디오 추론을 위한 자기 성찰 기반 음성 에이전트 접근 방식

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Zhen Wan

Citations: 40

h-index: 3

Chao-Han Huck Yang

Citations: 85

h-index: 3

Jinchuan Tian

Citations: 848

h-index: 14

Hanrong Ye

Citations: 120

h-index: 5

Ankita Pasad

Citations: 1,107

h-index: 12

Szu-Wei Fu

Citations: 144

h-index: 4

Arushi Goel

Citations: 518

h-index: 8

Ryo Hachiuma

Citations: 97

h-index: 6

Shizhe Diao

Citations: 58

h-index: 2

Kunal Dhawan

Citations: 363

h-index: 11

Sreyan Ghosh

Citations: 49

h-index: 4

Y. Hirota

Citations: 20

h-index: 1

Zhehuai Chen

Citations: 518

h-index: 12

Rafael Valle

Citations: 30

h-index: 2

Ehsan Hosseini-Asl

Citations: 2,356

h-index: 19

Chenhui Chu

Citations: 432

h-index: 5

Shinji Watanabe

Citations: 72

h-index: 5

Y. Wang

Citations: 143

h-index: 4

Boris Ginsburg

Citations: 103

h-index: 5

본 논문에서는 음성 인식 및 외부 오디오 정보 인지 작업에서 모델이 스스로를 신뢰할지, 아니면 외부 정보를 참고할지를 결정하는 '자기 성찰' 능력을 학습하는 음성 에이전트 프레임워크를 소개합니다. 저희 연구는 중요한 사실, 하지만 직관에 어긋나는 발견에서 시작되었습니다. 즉, 오디오 인식과 외부 음향 이해 작업 모두에 대해 획일적으로 모델을 미세 조정하면 성능이 저하되는 경우가 많습니다. 이는 모델이 노이즈가 많은 추론에 쉽게 속아 넘어갈 수 있기 때문입니다. 이러한 문제를 해결하기 위해, 저희는 'Speech-Hands' 프레임워크를 통해 이 문제를 명시적인 자기 성찰 결정 문제로 재구성했습니다. 이 학습 가능한 자기 성찰 메커니즘은 모델이 잘못된 외부 정보에 의해 잘못된 방향으로 나아가는 것을 방지하는 데 효과적임이 입증되었습니다. 저희는 이 에이전트 행동 메커니즘이 음성 인식에서 복잡한 다지선다 오디오 추론으로 자연스럽게 일반화될 수 있음을 보여줍니다. OpenASR 벤치마크에서 'Speech-Hands'는 일관되게 강력한 기준 모델보다 12.1%의 WER (단어 오류율) 향상을 보였습니다. 또한, 모델은 77.37%의 정확도와 높은 F1 점수를 달성하며, 다양한 오디오 질의응답 데이터 세트에 걸쳐 강력한 일반화 능력과 신뢰성을 보여줍니다. 저희의 연구는 인지 및 의사 결정을 통합함으로써 더욱 신뢰할 수 있고 탄력적인 오디오 인텔리전스를 위한 실용적인 방법을 제시합니다.

Original Abstract

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

1 Citations

0 Influential

9.5 Altmetric

48.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!