2606.11670v1 Jun 10, 2026 cs.CV

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

Chengzhuo Tong
Chengzhuo Tong
Citations: 48
h-index: 3
Yuanxing Zhang
Yuanxing Zhang
Citations: 313
h-index: 10
Zijie Meng
Zijie Meng
Citations: 115
h-index: 6
Xiaoqiang Liu
Xiaoqiang Liu
Citations: 333
h-index: 7
Jiwen Liu
Jiwen Liu
Citations: 90
h-index: 5
Yulong Xu
Yulong Xu
Citations: 40
h-index: 2
Pengfei Wan
Pengfei Wan
Citations: 71
h-index: 3
Yufei Liu
Yufei Liu
Citations: 49
h-index: 3

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

2 Citations
0 Influential
5 Altmetric
27.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!