2604.01371v1 Apr 01, 2026 cs.CV

AffordTissue: 도구-행동 특이적인 조직 상호작용을 위한 정밀한 사용 가능성 예측

AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

L. Seenivasan

Citations: 535

h-index: 11

Jiru Xu

Citations: 2

h-index: 1

Chenhao Yu

Citations: 11

h-index: 1

Chenyang Jing

Citations: 2

h-index: 1

Mathias Unberath

Citations: 86

h-index: 7

A. Maksutova

Citations: 0

h-index: 0

Haoyan Ding

Citations: 3

h-index: 1

Yiqing Shen

Citations: 293

h-index: 10

수술 자동화는 학습 기반 데모 및 시각-언어-행동 모델의 발전으로 인해 외과의의 숙련된 조작 능력을 달성하는 방향으로 빠르게 발전하고 있습니다. 이러한 모델들은 실험실 환경에서 성공을 거두었지만, 실제 임상 환경으로 적용하는 것은 여전히 어려운 과제입니다. 현재의 방법들은 수술 도구가 조직 표면의 어느 부분과 상호작용할지에 대한 예측 능력이 제한적이며, 도구-행동에 특화된 안전 영역을 명시적으로 제어할 수 있는 입력 기능을 제공하지 못합니다. 이러한 문제점을 해결하기 위해, 저희는 담낭 절제술(cholecystectomy) 과정에서 도구-행동에 특화된 조직 사용 가능성 영역을 밀도화된 히트맵 형태로 예측하는 다중 모달 프레임워크인 AffordTissue를 제안합니다. 저희의 접근 방식은 여러 시점에서 도구의 움직임과 조직의 역학을 캡처하는 시간 기반 시각 인코더, 다양한 도구-행동 쌍에 대한 일반화 능력을 가능하게 하는 언어 조건부 입력, 그리고 밀도화된 사용 가능성 예측을 위한 DiT 스타일 디코더를 결합합니다. 저희는 103건의 담낭 절제술 수술에서 추출하고 주석을 단 15,638개의 비디오 클립을 활용하여, 첫 번째 조직 사용 가능성 벤치마크를 구축했습니다. 이 벤치마크는 4가지 수술 도구(후크, 집게, 가위, 절단기)와 관련된 여섯 가지 고유한 도구-행동 쌍(절개, 잡기, 절단, 자르기)을 포함합니다. 실험 결과는 저희의 task-specific 아키텍처가 기존의 시각-언어 모델(Molmo-VLM)보다 상당한 성능 향상을 보이며(20.6 px ASSD vs. 60.2 px), 밀도화된 수술 사용 가능성 예측에 있어서 대규모 기반 모델보다 우수함을 보여줍니다. AffordTissue는 도구-행동에 특화된 조직 사용 가능성 영역을 예측함으로써, 안전한 수술 자동화를 위한 명시적인 공간 추론을 제공하며, 잠재적으로 적절한 조직 영역으로의 명확한 정책 지침을 제공하고, 수술 도구가 예측된 안전 영역 밖으로 벗어날 경우 조기에 안전하게 작업을 중단할 수 있도록 합니다.

Original Abstract

Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!