Vintage: Joint video and text conditioning for holistic audio generation

SS Kushwaha, Y Tian - arxiv preprint arxiv:2412.10768, 2024 - arxiv.org
Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-
audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds …