CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
Article 2025 en
Authors
BZ
Bojia Zi
SZ
Shihao Zhao
XQ
Xianbiao Qi
Abstract
1 min read
Video inpainting is a crucial task with diverse applications, including fine-grained video editing, video recovery, and video dewatermarking. However, most existing video inpainting methods primarily focus on visual content completion while neglecting text information. There are only a limited number of text-guided video inpainting techniques, and these techniques struggle with maintaining visual quality and exhibit poor semantic representation capabilities. In this paper, we introduce CoCoCo, a text-guided video inpainting diffusion framework. To address the aforementioned challenges, we enhance both the training data and model structure. Specifically, we devise an instance-aware region selection strategy for masked area sampling and develop a novel motion block that incorporates efficient 3D full attention and textual cross attention. Additionally, our CoCoCo framework can be seamlessly integrated with various personalized text-to-image diffusion models through a delicate training-free transfer mechanism. Comprehensive experiments demonstrate that CoCoCo can create high-quality visual content with enhanced temporal consistency, improved text controllability, and better compatibility with personalized image models.
Discussion(0)
No comments yet. Be the first to comment.