2603.05969v1 Mar 06, 2026 cs.CV

상상해보세요: 변화를 어떻게 설명할까? 변화 설명 생성을 위한 명시적인 절차 모델링

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Zixin Guo

Citations: 41

h-index: 3

Min Cao

Citations: 7

h-index: 2

Guibo Zhu

Citations: 38

h-index: 3

J. Laaksonen

Citations: 732

h-index: 10

Jiayang Sun

Citations: 2,778

h-index: 27

변화 설명 생성은 시각적으로 유사한 두 이미지 간의 차이점을 명시적으로 설명하는 기술입니다. 기존 방법들은 정적인 이미지 쌍을 기반으로 작동하기 때문에, 변화 과정의 중요한 시간적 역학을 간과합니다. 이는 변화가 무엇인지뿐만 아니라 어떻게 발생하는지를 이해하는 데 핵심적인 요소입니다. 본 논문에서는 변화 모델링을 정적인 이미지 비교에서 동적인 절차 모델링으로 재구성하는 새로운 프레임워크인 ProCap을 소개합니다. ProCap은 두 단계로 구성됩니다. 첫 번째 단계에서는 절차 인코더를 훈련하여 일련의 주요 프레임에서 변화 과정을 학습합니다. 이러한 주요 프레임은 암시적인 절차적 역학을 명시적으로 만들기 위해 중간 프레임을 자동으로 생성한 다음, 중복을 줄이기 위해 샘플링하여 얻습니다. 그런 다음 인코더는 캡션 정보를 활용한 마스크 복원 작업을 통해 이러한 주요 프레임의 잠재적인 역학을 학습합니다. 두 번째 단계에서는 훈련된 인코더를 캡션 생성 모델의 인코더-디코더 구조에 통합합니다. 이전 단계에서 얻은 명시적인 프레임에 의존하는 대신 (이는 계산 오버헤드와 시각적 노이즈에 대한 민감성을 초래합니다), 학습 가능한 절차 쿼리를 도입하여 인코더가 잠재적인 절차 표현을 추론하도록 유도합니다. 디코더는 이 표현을 텍스트로 변환합니다. 전체 모델은 캡션 손실을 사용하여 end-to-end 방식으로 훈련되며, 이를 통해 인코더의 출력이 시간적으로 일관적이고 캡션과 일치하도록 보장합니다. 세 개의 데이터 세트에 대한 실험 결과는 ProCap의 효과를 입증합니다. 코드 및 사전 훈련된 모델은 https://github.com/BlueberryOreo/ProCap 에서 제공됩니다.

Original Abstract

Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder-decoder model for captioning. Instead of relying on explicit frames from the previous stage -- a process incurring computational overhead and sensitivity to visual noise -- we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre-trained models are available at https://github.com/BlueberryOreo/ProCap

0 Citations

0 Influential

33.5 Altmetric

167.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!