Masked generative video-to-audio transformers with enhanced synchronicity
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …
plausible sounds that match the scene. Importantly, the generated sound onsets should …
Temporally aligned audio for video with autoregression
We introduce V-AURA, the first autoregressive model to achieve high temporal alignment
and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature …
and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature …
Video-guided foley sound generation with multimodal controls
Generating sound effects for videos often requires creating artistic sound effects that diverge
significantly from real-life sources and flexible control in the sound design. To address this …
significantly from real-life sources and flexible control in the sound design. To address this …
Taming multimodal joint training for high-quality video-to-audio synthesis
We propose to synthesize high-quality and synchronized audio, given video and optional
text conditions, using a novel multimodal joint training framework MMAudio. In contrast to …
text conditions, using a novel multimodal joint training framework MMAudio. In contrast to …
From vision to audio and beyond: A unified model for audio-visual representation and generation
Video encompasses both visual and auditory data, creating a perceptually rich experience
where these two modalities complement each other. As such, videos are a valuable type of …
where these two modalities complement each other. As such, videos are a valuable type of …
LoVA: Long-form Video-to-Audio Generation
Video-to-audio (V2A) generation is important for video editing and post-processing,
enabling the creation of semantics-aligned audio for silent video. However, most existing …
enabling the creation of semantics-aligned audio for silent video. However, most existing …
Generative AI for Cel-Animation: A Survey
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential
steps, including storyboarding, layout design, keyframe animation, inbetweening, and …
steps, including storyboarding, layout design, keyframe animation, inbetweening, and …
Images that Sound: Composing Images and Sounds on a Single Canvas
Spectrograms are 2D representations of sound that look very different from the images found
in our visual world. And natural images, when played as spectrograms, make unnatural …
in our visual world. And natural images, when played as spectrograms, make unnatural …
Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework
K Su - 2024 - digital.lib.washington.edu
The interplay between audio and visual signals, rich in correlations across various scales,
significantly impacts human perception and drives a consistent demand for audio-visual …
significantly impacts human perception and drives a consistent demand for audio-visual …
[PDF][PDF] Generative and parametric models for interactive neural synthesis in speech and audio
MJC Largo - 2024 - oa.upm.es
Speech synthesis is a multifaceted process that encompasses both acoustic signals and
articulatory dynamics. Traditional neural audio synthesis methods often rely exclusively on …
articulatory dynamics. Traditional neural audio synthesis methods often rely exclusively on …