Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2024 - Springer
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

Temporally aligned audio for video with autoregression

I Viertola, V Iashin, E Rahtu - arxiv preprint arxiv:2409.13689, 2024 - arxiv.org
We introduce V-AURA, the first autoregressive model to achieve high temporal alignment
and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature …

Video-guided foley sound generation with multimodal controls

Z Chen, P Seetharaman, B Russell, O Nieto… - arxiv preprint arxiv …, 2024 - arxiv.org
Generating sound effects for videos often requires creating artistic sound effects that diverge
significantly from real-life sources and flexible control in the sound design. To address this …

Taming multimodal joint training for high-quality video-to-audio synthesis

HK Cheng, M Ishii, A Hayakawa, T Shibuya… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose to synthesize high-quality and synchronized audio, given video and optional
text conditions, using a novel multimodal joint training framework MMAudio. In contrast to …

From vision to audio and beyond: A unified model for audio-visual representation and generation

K Su, X Liu, E Shlizerman - arxiv preprint arxiv:2409.19132, 2024 - arxiv.org
Video encompasses both visual and auditory data, creating a perceptually rich experience
where these two modalities complement each other. As such, videos are a valuable type of …

LoVA: Long-form Video-to-Audio Generation

X Cheng, X Wang, Y Wu, Y Wang, R Song - arxiv preprint arxiv …, 2024 - arxiv.org
Video-to-audio (V2A) generation is important for video editing and post-processing,
enabling the creation of semantics-aligned audio for silent video. However, most existing …

Generative AI for Cel-Animation: A Survey

Y Tang, J Guo, P Liu, Z Wang, H Hua, JX Zhong… - arxiv preprint arxiv …, 2025 - arxiv.org
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential
steps, including storyboarding, layout design, keyframe animation, inbetweening, and …

Images that Sound: Composing Images and Sounds on a Single Canvas

Z Chen, D Geng, A Owens - arxiv preprint arxiv:2405.12221, 2024 - arxiv.org
Spectrograms are 2D representations of sound that look very different from the images found
in our visual world. And natural images, when played as spectrograms, make unnatural …

Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework

K Su - 2024 - digital.lib.washington.edu
The interplay between audio and visual signals, rich in correlations across various scales,
significantly impacts human perception and drives a consistent demand for audio-visual …

[PDF][PDF] Generative and parametric models for interactive neural synthesis in speech and audio

MJC Largo - 2024 - oa.upm.es
Speech synthesis is a multifaceted process that encompasses both acoustic signals and
articulatory dynamics. Traditional neural audio synthesis methods often rely exclusively on …