Identity-Preserving Text-to-Video Generation by Frequency Decomposition

S Yuan, J Huang, X He, Y Ge, Y Shi, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with
consistent human identity. It is an important task in video generation but remains an open …

Non-uniform timestep sampling: Towards faster diffusion model training

T Zheng, C Geng, PT Jiang, B Wan, H Zhang… - Proceedings of the …, 2024 - dl.acm.org
Diffusion models have garnered significant success in generative tasks, emerging as the
predominant model in this domain. Despite their success, the substantial computational …

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

H Chen, X Wang, Y Zhou, B Huang, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …

Motion Prompting: Controlling Video Generation with Motion Trajectories

D Geng, C Herrmann, J Hur, F Cole, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Motion control is crucial for generating expressive and compelling video content; however,
most existing video generation models rely mainly on text prompts for control, which struggle …

Cami2v: Camera-controlled image-to-video diffusion model

G Zheng, T Li, R Jiang, Y Lu, T Wu, X Li - arxiv preprint arxiv:2410.15957, 2024 - arxiv.org
Recently, camera pose, as a user-friendly and physics-related condition, has been
introduced into text-to-video diffusion model for camera control. However, existing methods …

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

J Wu, C Tang, J Wang, Y Zeng, X Li, Y Tong - arxiv preprint arxiv …, 2024 - arxiv.org
Story visualization, the task of creating visual narratives from textual descriptions, has seen
progress with text-to-image generation models. However, these models often lack effective …

DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models

H Kim, S Beak, H Joo - arxiv preprint arxiv:2501.08333, 2025 - arxiv.org
Understanding the ability of humans to use objects is crucial for AI to improve daily life.
Existing studies for learning such ability focus on human-object patterns (eg, contact, spatial …

Trajectory Attention for Fine-grained Video Motion Control

Z **ao, W Ouyang, Y Zhou, S Yang, L Yang, J Si… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in video generation have been greatly driven by video diffusion
models, with camera motion control emerging as a crucial challenge in creating view …

MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning

Y Han, J Zhu, Y Feng, X Ji, K He, X Li, Y Liu - arxiv preprint arxiv …, 2024 - arxiv.org
Current diffusion-based face animation methods generally adopt a ReferenceNet (a copy of
U-Net) and a large amount of curated self-acquired data to learn appearance features, as …

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Y Wu, Z Zhang, Y Li, Y Xu, A Kag, Y Sui… - arxiv preprint arxiv …, 2024 - arxiv.org
We have witnessed the unprecedented success of diffusion-based video generation over
the past year. Recently proposed models from the community have wielded the power to …