Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2024 - Springer
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

Y Zhang, Y Gu, Y Zeng, Z **ng, Y Wang, Z Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing
with videos, enabling an immersive audio-visual experience. Despite its wide range of …

Temporally aligned audio for video with autoregression

I Viertola, V Iashin, E Rahtu - arxiv preprint arxiv:2409.13689, 2024 - arxiv.org
We introduce V-AURA, the first autoregressive model to achieve high temporal alignment
and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature …

Foleygen: Visually-guided audio generation

X Mei, V Nagaraja, G Le Lan, Z Ni… - 2024 IEEE 34th …, 2024 - ieeexplore.ieee.org
Recent advancements in audio generation tasks, such as text-to-audio and text-to-music
generation, have been spurred by the evolution of deep learning models and large-scale …

Draw an audio: Leveraging multi-instruction for video-to-audio synthesis

Q Yang, B Mao, Z Wang, X Nie, P Gao, Y Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects
to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a …

Video-guided foley sound generation with multimodal controls

Z Chen, P Seetharaman, B Russell, O Nieto… - arxiv preprint arxiv …, 2024 - arxiv.org
Generating sound effects for videos often requires creating artistic sound effects that diverge
significantly from real-life sources and flexible control in the sound design. To address this …

[HTML][HTML] Artificial Taste: Advances and Innovative Applications in Healthcare

L Wang, Y Li, Y Zhang, B Zheng - Applied Sciences, 2025 - mdpi.com
Background: Scientists have recently developed a technology that induces artificial taste
through electronic stimulation. However, scattered reports have made it difficult to …

Taming multimodal joint training for high-quality video-to-audio synthesis

HK Cheng, M Ishii, A Hayakawa, T Shibuya… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose to synthesize high-quality and synchronized audio, given video and optional
text conditions, using a novel multimodal joint training framework MMAudio. In contrast to …

Gotta hear them all: Sound source aware vision to audio generation

W Guo, H Wang, W Cai, J Ma - arxiv preprint arxiv:2411.15447, 2024 - arxiv.org
Vision-to-audio (V2A) synthesis has broad applications in multimedia. Recent
advancements of V2A methods have made it possible to generate relevant audios from …

Vintage: Joint video and text conditioning for holistic audio generation

SS Kushwaha, Y Tian - arxiv preprint arxiv:2412.10768, 2024 - arxiv.org
Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-
audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds …