2601.05991v1 Jan 09, 2026 cs.AI

개방형 어휘 3D 지시 모호성 탐지

Open-Vocabulary 3D Instruction Ambiguity Detection

Jiayu Ding

Citations: 13

h-index: 3

Haoran Tang

Citations: 247

h-index: 7

Ge Li

Citations: 230

h-index: 5

Hongbo Jin

Citations: 36

h-index: 4

Wei Gao

Citations: 60

h-index: 3

안전이 중요한 영역에서 언어적 모호성은 심각한 결과를 초래할 수 있습니다. 예를 들어, 수술 환경에서 "바이알을 건네줘"와 같은 모호한 명령은 치명적인 오류로 이어질 수 있습니다. 그러나 대부분의 체화된 인공지능(Embodied AI) 연구는 이를 간과하고 있으며, 지시가 명확하다고 가정하고 확인보다는 실행에 초점을 맞추고 있습니다. 이러한 치명적인 안전 격차를 해소하기 위해, 우리는 모델이 주어진 3D 장면 내에서 특정 명령이 단 하나의 명확한 의미를 갖는지 판단해야 하는 새롭고 근본적인 과제인 '개방형 어휘 3D 지시 모호성 탐지'를 최초로 정의합니다. 이 연구를 지원하기 위해 우리는 700개 이상의 다양한 3D 장면과 약 2만 2천 개의 지시 사항을 포함하는 대규모 벤치마크인 Ambi3D를 구축했습니다. 우리의 분석에 따르면 놀라운 한계점이 드러났는데, 최첨단 3D 대형 언어 모델(LLM)조차도 지시의 모호성 여부를 신뢰할 수 있게 판단하는 데 어려움을 겪는다는 것입니다. 이러한 과제를 해결하기 위해 우리는 AmbiVer를 제안합니다. 이는 다중 시점에서 명시적인 시각적 증거를 수집하고, 이를 사용하여 비전-언어 모델(VLM)이 지시의 모호성을 판단하도록 유도하는 2단계 프레임워크입니다. 광범위한 실험을 통해 우리 과제의 난이도와 AmbiVer의 효과를 입증하였으며, 이는 더 안전하고 신뢰할 수 있는 체화된 인공지능을 위한 길을 열어줍니다. 코드와 데이터셋은 https://jiayuding031020.github.io/ambi3d/ 에서 이용 가능합니다.

Original Abstract

In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!