Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

N Majumder, CY Hung, D Ghosal, WN Hsu… - Proceedings of the …, 2024 - dl.acm.org
Generative multimodal content is increasingly prevalent in much of the content creation
arena, as it has the potential to allow artists and media personnel to create pre-production …

Mustango: Toward controllable text-to-music generation

J Melechovsky, Z Guo, D Ghosal, N Majumder… - arxiv preprint arxiv …, 2023 - arxiv.org
With recent advancements in text-to-audio and text-to-music based on latent diffusion
models, the quality of generated content has been reaching new heights. The controllability …

Loop copilot: Conducting ai ensembles for music generation and iterative editing

Y Zhang, A Maezawa, G **a, K Yamamoto… - arxiv preprint arxiv …, 2023 - arxiv.org
Creating music is iterative, requiring varied methods at each stage. However, existing AI
music systems fall short in orchestrating multiple subsystems for diverse needs. To address …

Tiva: Time-aligned video-to-audio generation

X Wang, Y Wang, Y Wu, R Song, X Tan… - Proceedings of the …, 2024 - dl.acm.org
Video-to-audio generation is crucial for autonomous video editing and post-processing,
which aims to generate high-quality audio for silent videos with semantic similarity and …

VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation

R Huang, Y Wang, R Hu, X Xu, Z Hong… - Proceedings of the …, 2024 - dl.acm.org
Voice large language models (LLMs) cast voice synthesis as a language modeling task in a
discrete space, and have demonstrated significant progress to date. Despite the recent …

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

H Liu, R Huang, Y Liu, H Cao, J Wang… - Proceedings of the …, 2024 - dl.acm.org
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the
forefront of various generative tasks. However, their iterative sampling process poses a …

Tango 2: Aligning diffusion-based text-to-audio generative models through direct preference optimization

N Majumder, CY Hung, D Ghosal, WN Hsu… - ACM Multimedia …, 2024 - openreview.net
Generative multimodal content is increasingly prevalent in much of the content creation
arena, as it has the potential to allow artists and media personnel to create pre-production …

Dance-to-music generation with encoder-based textual inversion of diffusion models

S Li, W Dong, Y Zhang, F Tang, C Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
The harmonious integration of music with dance movements is pivotal in vividly conveying
the artistic essence of dance. This alignment also significantly elevates the immersive quality …

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

H Liu, J Wang, R Huang, Y Liu, H Lu, W Xue… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-
audio generation, yet their iterative sampling processes impose substantial computational …

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

T Xu, J Li, X Chen, X Yao, S Liu - arxiv preprint arxiv:2405.02801, 2024 - arxiv.org
In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements,
facilitating the generation of music, images, and other forms of artistic expression across …