A Review of Scene Dynamic Control in Text-Guided Video Prediction Large Models
Author:
Clc Number:

TP 391.7

Fund Project:

This work is supported by National Natural Science Foundation of China (U21A20487, 62372440)

  • Article
  • | |
  • Metrics
  • |
  • Reference [75]
  • |
  • Related [6]
  • | | |
  • Comments
    Abstract:

    In recent years, the rapid development of generative AI has made text-driven video prediction large models a hot topic in academia and industry. Video prediction and generation should address temporal dynamics and consistency, requiring precise control of scene structures, subject behaviors, camera movements, and semantic expressions. One major challenge is accurately controlling scene dynamics in video prediction to achieve high-quality, semantically consistent outputs. Researchers have proposed key control methods, including camera control enhancement, reference video control, semantic consistency enhancement, and subject feature control improvement. These methods aim to improve generation quality, ensuring outputs align with historical context while meeting user needs. This paper systematically explores the core concepts, advantages, limitations, and future directions of these four control approaches.

    Reference
    [1] Skorokhodov I, Menapace W, Siarohin A, et al. Hierarchical Patch Diffusion Models for High-Resolution Video Generation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7569-7579.
    [2] Qing Z, Zhang S, Wang J, et al. Hierarchical spatio-temporal decoupling for text-to-video generation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 6635-6645.
    [3] Ma H, Mahdizadehaghdam S, Wu B, et al. Maskint: Video editing via interpolative non-autoregressive masked transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7403-7412.
    [4] Wang Y, Bao J, Weng W, et al. Microcinema: A divide-and-conquer approach for text-to-video generation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 8414-8424.
    [5] Brooks T, Peebles B, Holmes C, et al. Video generation models as world simulators. 2024 [J]. URL https://openai. com/research/video-generation-models-as-world-simulators, 2024, 3(
    [6] 彭宇新. 文本到视频生成:研究现状、进展和挑战 [J]. 电子与信息学报, 2024, 46(1632-1644.
    [7] Xu D, Nie W, Liu C, et al. CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation [J]. arXiv preprint arXiv:2406.02509, 2024,
    [8] Ni H, Egger B, Lohit S, et al. TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 9015-9025.
    [9] Lv J, Huang Y, Yan M, et al. GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 1430-1440.
    [11] Alonso E, Jelley A, Micheli V, et al. Diffusion for World Modeling: Visual Details Matter in Atari [J]. arXiv preprint arXiv:2405.12399, 2024,
    [12] Meng F, Liao J, Tan X, et al. Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [J]. arXiv preprint arXiv:2410.05363, 2024,
    [13] Xiang J, Liu G, Gu Y, et al. Pandora: Towards General World Model with Natural Language Actions and Video States [J]. arXiv preprint arXiv:2406.09455, 2024,
    [14] Yang M, Du Y, Ghasemipour K, et al. Learning interactive real-world simulators [J]. arXiv preprint arXiv:2310.06114, 2023,
    [15] Yang S, Walker J, Parker-Holder J, et al. Video as the New Language for Real-World Decision Making [J]. arXiv preprint arXiv:2402.17139, 2024,
    [16] Zhao G, Wang X, Zhu Z, et al. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation [J]. arXiv preprint arXiv:2403.06845, 2024,
    [17] Zheng C and Vedaldi A. Free3d: Consistent novel view synthesis without 3d representation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 9720-9731.
    [18] He H, Xu Y, Guo Y, et al. Cameractrl: Enabling camera control for text-to-video generation [J]. arXiv preprint arXiv:2404.02101, 2024,
    [19] Liu L, Liu Q, Qian S, et al. Text-Animator: Controllable Visual Text Video Generation [J]. arXiv preprint arXiv:2406.17777, 2024,
    [20] Lu Y, Zhu L, Fan H, et al. Flowzero: Zero-shot text-to-video synthesis with llm-driven dynamic scene syntax [J]. arXiv preprint arXiv:2311.15813, 2023,
    [21] Long F, Qiu Z, Yao T, et al. Videodrafter: Content-consistent multi-scene video generation with llm [J]. arXiv preprint arXiv:2401.01256, 2024,
    [22] Ren W, Yang H, Zhang G, et al. ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [J]. Transactions on Machine Learning Research,
    [23] Wu J, Li X, Zeng Y, et al. MotionBooth: Motion-Aware Customized Text-to-Video Generation [J]. arXiv preprint arXiv:2406.17758, 2024,
    [24] Guo Y, Yang C, Rao A, et al. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [C] // The Twelfth International Conference on Learning Representations, 2024:
    [25] Polyak A, Zohar A, Brown A, et al. Movie Gen: A Cast of Media Foundation Models [J]. arXiv [cs.CV], 2024,
    [26] Chen X, Wang Y, Zhang L, et al. Seine: Short-to-long video diffusion model for generative transition and prediction [C] // The Twelfth International Conference on Learning Representations, 2023:
    [27] Yang S, Hou L, Huang H, et al. Direct-a-video: Customized video generation with user-directed camera movement and object motion [C] // ACM SIGGRAPH 2024 Conference Papers, 2024: 1-12.
    [28] Wang Z, Yuan Z, Wang X, et al. Motionctrl: A unified and flexible motion controller for video generation [C] // ACM SIGGRAPH 2024 Conference Papers, 2024: 1-11.
    [29] Ling P, Bu J, Zhang P, et al. MotionClone: Training-Free Motion Cloning for Controllable Video Generation [J]. arXiv preprint arXiv:2406.05338, 2024,
    [30] Xie Y, Yao C-H, Voleti V, et al. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency [J]. arXiv preprint arXiv:2407.17470, 2024,
    [31] Wu T, Zhang Y, Wang X, et al. CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [J]. arXiv preprint arXiv:2408.13239, 2024,
    [32] Hong S, Seo J, Shin H, et al. DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [J]. arXiv preprint arXiv:2305.14330, 2023,
    [33] Roush A, Zakirov E, Shirokov A, et al. LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators [J]. arXiv preprint arXiv:2311.03716, 2023,
    [34] Xing Z, Dai Q, Weng Z, et al. AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [J]. arXiv preprint arXiv:2406.06465, 2024,
    [35] Gao K, Shi J, Zhang H, et al. ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [J]. arXiv preprint arXiv:2406.10981, 2024,
    [36] Liu S, Ren Z, Gupta S, et al. PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation [J]. arXiv preprint arXiv:2409.18964, 2024,
    [37] Ma Z, Zhou D, Yeh C-H, et al. Magic-me: Identity-specific video customized diffusion [J]. arXiv preprint arXiv:2402.09368, 2024,
    [38] He X, Liu Q, Qian S, et al. ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [J]. arXiv preprint arXiv:2404.15275, 2024,
    [39] Gu Y, Zhou Y, Wu B, et al. Videoswap: Customized video subject swapping with interactive semantic point correspondence [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7621-7630.
    [40] Chen H, Wang X, Zhang Y, et al. DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control [J]. arXiv preprint arXiv:2405.12796, 2024,
    [41] Hong W, Ding M, Zheng W, et al. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers [C] // The Eleventh International Conference on Learning Representations, 2023:
    [42] Yang Z, Teng J, Zheng W, et al. Cogvideox: Text-to-video diffusion models with an expert transformer [J]. arXiv preprint arXiv:2408.06072, 2024,
    [43] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models [C] // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 10684-10695.
    [44] Cao S, Yin Y, Huang L, et al. Efficient-vqgan: Towards high-resolution image generation with efficient vision transformers [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 7368-7377.
    [45] Chefer H, Alaluf Y, Vinker Y, et al. Attend-And-Excite: Attention-Based Semantic Guidance for Text-To-Image Diffusion Models [J]. ACM Transactions on Graphics, 2023, 42(4):
    [46] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C] // Advances in Neural Information Processing Systems, 2017: 5998-6008.
    [47] Chen M, Laina I and Vedaldi A. Training-Free Layout Control with Cross-Attention Guidance [C] // 2024: 5343-5353.
    [48] Blattmann A, Dockhorn T, Kulal S, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets [J]. arXiv preprint arXiv:2311.15127, 2023,
    [49] An J, Zhang S, Yang H, et al. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation [J]. arXiv preprint arXiv:2304.08477, 2023,
    [50] Blattmann A, Rombach R, Ling H, et al. Align your latents: High-resolution video synthesis with latent diffusion models [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 22563-22575.
    [51] Girdhar R, Singh M, Brown A, et al. Emu video: Factorizing text-to-video generation by explicit image conditioning [J]. arXiv preprint arXiv:2311.10709, 2023,
    [52] Yu L, Cheng Y, Sohn K, et al. Magvit: Masked generative video transformer [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10459-10469.
    [53] Gupta A, Yu L, Sohn K, et al. Photorealistic video generation with diffusion models [J]. arXiv preprint arXiv:2312.06662, 2023,
    [54] Yu L, Lezama J, Gundavarapu NB, et al. Language Model Beats Diffusion-Tokenizer is key to visual generation [C] // The Twelfth International Conference on Learning Representations, 2024:
    [55] Chen H, Zhang Y, Cun X, et al. Videocrafter2: Overcoming data limitations for high-quality video diffusion models [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7310-7320.
    [56] Khachatryan L, Movsisyan A, Tadevosyan V, et al. Text2video-zero: Text-to-image diffusion models are zero-shot video generators [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 15954-15964.
    [57] Wang X, Zhang S, Yuan H, et al. A recipe for scaling up text-to-video generation with text-free videos [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 6572-6582.
    [58] Menapace W, Siarohin A, Skorokhodov I, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7038-7048.
    [59] Wu F, Liu L, Hao F, et al. Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 18113-18122.
    [60] 诸林云. 基于Transformer和对比学习的文本生成图像方法 [J]. 中国科技论文, 2023, 18(793-798+812.
    [61] Sun M, Wang W, Qin Z, et al. GLOBER: coherent non-autoregressive video generation via global guided video decoder [J]. Advances in Neural Information Processing Systems, 2024, 36(
    [62] Cai S, Ceylan D, Gadelha M, et al. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7611-7620.
    [63] Hu L. Animate anyone: Consistent and controllable image-to-video synthesis for character animation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 8153-8163.
    [64] Shi X, Huang Z, Wang F-Y, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling [C] // ACM SIGGRAPH 2024 Conference Papers, 2024: 1-11.
    [65] Chefer H, Zada S, Paiss R, et al. Still-Moving: Customized Video Generation without Customized Video Data [J]. arXiv preprint arXiv:2407.08674, 2024,
    [66] Cheang C-L, Chen G, Jing Y, et al. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [J]. arXiv preprint arXiv:2410.06158, 2024,
    [67] Cheng J, Wu F, Tian Y, et al. Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge [C] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, June 14, 2020 - June 19, 2020, 2020: 10908-10917.
    [68] Huang H, Feng Y, Shi C, et al. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator [J]. Advances in Neural Information Processing Systems, 2024, 36(
    [69] Wang A, Wu B, Chen S, et al. SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 13384-13394.
    [70] 吕志刚. 基于语义增强和特征融合的文本生成图像方法 [J]. 计算机工程与应用, 2024, 1-13.
    [71] 张丽红. 基于CLIP模型和文本重建的人脸图像生成方法研究 [J]. 测试技术学报, 2024, 38(154-160.
    [72] 马鹏森. 基于空间注意力及条件增强的文本生成图像方法 [J]. 山东大学学报(工学版) 2024, 1-7.
    [73] 倪郑威. 基于多文本描述的图像生成方法 [J]. 电信科学, 2024, 40(73-85.
    [74] Wu F, Cheng J, Wang X, et al. Image Hallucination From Attribute Pairs [J]. IEEE Transactions on Cybernetics, 2022, 52(1): 568-581.
    [75] Wang T, Li L, Lin K, et al. Disco: Disentangled control for realistic human dance generation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 9326-9336.
    [76] Xu Z, Zhang J, Liew JH, et al. Magicanimate: Temporally consistent human image animation using diffusion model [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 1481-1490.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

WU Fuxiang, CHENG Jun. A Review of Scene Dynamic Control in Text-Guided Video Prediction Large Models[J]. Journal of Integration Technology,2025,14(1):9-24

Copy
Share
Article Metrics
  • Abstract:352
  • PDF: 680
  • HTML: 0
  • Cited by: 0
History
  • Received:December 01,2024
  • Revised:December 08,2024
  • Adopted:December 11,2024
  • Online: December 11,2024
Article QR Code