2605.01720v2 May 03, 2026 cs.CV

SignVerse-2M: 55개 이상의 수화 언어를 포함하는 2백만 개의 동영상 클립으로 구성된, 자세 정보를 기반으로 한 수화 데이터셋

SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages

Dimitris N. Metaxas

Citations: 55

h-index: 4

Sen Fang

Citations: 55

h-index: 3

Yanxin Zhang

Citations: 2

h-index: 1

Hongbin Zhong

Citations: 32

h-index: 3

기존의 대규모 수화 언어 데이터셋은 일반적으로 원시 동영상과 텍스트의 정렬 수준에서만 정보를 제공하며, 종종 실험실 환경에서 제작됩니다. 이러한 데이터셋은 의미 이해에 중요하지만, 개방형 환경에서의 인식 및 번역, 또는 최신 자세 기반 수화 동영상 생성 프레임워크를 위한 통합 인터페이스를 직접적으로 제공하지 않습니다. 1. RGB 기반의 사전 학습된 인식 모델은 녹화 시 고정된 배경 또는 의상 조건에 크게 의존하며, 스타일 불변의 자세 처리 모델보다 개방형 환경에서 덜 견고합니다. 2. 최근의 자세 기반 이미지/동영상 생성 모델은 대부분 DWPose와 같은 통합된 키포인트 표현 방식을 제어 인터페이스로 사용합니다. 현재 수화 언어 분야는 이러한 최신 자세 기반 패러다임과 직접적으로 연동될 수 있으며, 동시에 실제 환경의 개방적인 시나리오를 목표로 하는 데이터 리소스가 부족합니다. 본 논문에서는 수화 언어 자세 모델링 및 평가를 위한 대규모 다국어 자세 기반 데이터셋인 SignVerse-2M을 소개합니다. 공개적으로 이용 가능한 다국어 수화 언어 동영상 리소스를 기반으로 구축되었으며, DWPose를 활용한 통합 전처리 파이프라인을 통해 원시 동영상을 2D 자세 시퀀스로 변환하여 모델링에 직접 사용할 수 있도록 약 2백만 개의 클립으로 구성된 데이터 코퍼스를 구축했습니다. 많은 실험실 데이터셋과 달리, 본 리소스는 실제 동영상의 녹화 조건과 화자 다양성을 유지하면서 통합된 자세 표현을 통해 외형 변화를 줄입니다. 이러한 목표를 달성하기 위해, 데이터 구축 파이프라인, 작업 정의, 그리고 간단한 SignDW Transformer 기준 모델을 제공하여 본 리소스가 다국어 자세 공간 모델링에 얼마나 적합하며, 최신 자세 기반 파이프라인과의 호환성이 있는지 보여주고, 이 리소스가 지원할 수 있는 평가 지표와 현재의 한계를 논의합니다.

Original Abstract

Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 55 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!