Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Controlllm: Augment language models with tools by searching on graphs

Z Liu, Z Lai, Z Gao, E Cui, Z Li, X Zhu, L Lu… - … on Computer Vision, 2024 - Springer
We present ControlLLM, a novel framework that enables large language models (LLMs) to
utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable …

Vidmuse: A simple video-to-music generation framework with long-short-term modeling

Z Tian, Z Liu, R Yuan, J Pan, Q Liu, X Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we systematically study music generation conditioned solely on the video. First,
we present a large-scale dataset comprising 360K video-music pairs, including various …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Magicquill: An intelligent interactive image editing system

Z Liu, Y Yu, H Ouyang, Q Wang, KL Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Image editing involves a variety of complex tasks and requires efficient and precise
manipulation techniques. In this paper, we present MagicQuill, an integrated image editing …

From Efficient Multimodal Models to World Models: A Survey

X Mai, Z Tao, J Lin, H Wang, Y Chang, Y Kang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Models (MLMs) are becoming a significant research focus, combining
powerful large language models with multimodal learning to perform complex tasks across …

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

X Chi, Y Wang, A Cheng, P Fang, Z Tian, Y He… - arxiv preprint arxiv …, 2024 - arxiv.org
Massive multi-modality datasets play a significant role in facilitating the success of large
video-language models. However, current video-language datasets primarily provide text …

EVA: An Embodied World Model for Future Video Anticipation

X Chi, H Zhang, CK Fan, X Qi, R Zhang, A Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
World models integrate raw data from various modalities, such as images and language to
simulate comprehensive interactions in the world, thereby displaying crucial roles in fields …

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

KT Pham, J Chen, Q Chen - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
We present TALE, a novel training-free framework harnessing the generative capabilities of
text-to-image diffusion models to address the cross-domain image composition task that …

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Y Qi, H Li, Y Song, X Wu, J Luo - arxiv preprint arxiv:2412.08158, 2024 - arxiv.org
The exploration of various vision-language tasks, such as visual captioning, visual question
answering, and visual commonsense reasoning, is an important area in artificial intelligence …