Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Controlllm: Augment language models with tools by searching on graphs
We present ControlLLM, a novel framework that enables large language models (LLMs) to
utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable …
utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable …
Vidmuse: A simple video-to-music generation framework with long-short-term modeling
In this work, we systematically study music generation conditioned solely on the video. First,
we present a large-scale dataset comprising 360K video-music pairs, including various …
we present a large-scale dataset comprising 360K video-music pairs, including various …
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
Magicquill: An intelligent interactive image editing system
Image editing involves a variety of complex tasks and requires efficient and precise
manipulation techniques. In this paper, we present MagicQuill, an integrated image editing …
manipulation techniques. In this paper, we present MagicQuill, an integrated image editing …
From Efficient Multimodal Models to World Models: A Survey
Multimodal Large Models (MLMs) are becoming a significant research focus, combining
powerful large language models with multimodal learning to perform complex tasks across …
powerful large language models with multimodal learning to perform complex tasks across …
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Massive multi-modality datasets play a significant role in facilitating the success of large
video-language models. However, current video-language datasets primarily provide text …
video-language models. However, current video-language datasets primarily provide text …
EVA: An Embodied World Model for Future Video Anticipation
World models integrate raw data from various modalities, such as images and language to
simulate comprehensive interactions in the world, thereby displaying crucial roles in fields …
simulate comprehensive interactions in the world, thereby displaying crucial roles in fields …
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization
We present TALE, a novel training-free framework harnessing the generative capabilities of
text-to-image diffusion models to address the cross-domain image composition task that …
text-to-image diffusion models to address the cross-domain image composition task that …
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey
Y Qi, H Li, Y Song, X Wu, J Luo - arxiv preprint arxiv:2412.08158, 2024 - arxiv.org
The exploration of various vision-language tasks, such as visual captioning, visual question
answering, and visual commonsense reasoning, is an important area in artificial intelligence …
answering, and visual commonsense reasoning, is an important area in artificial intelligence …