- Academic Search

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Speichern Zitieren Zitiert von: 12 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Controlllm: Augment language models with tools by searching on graphs

Z Liu, Z Lai, Z Gao, E Cui, Z Li, X Zhu, L Lu… - … on Computer Vision, 2024 - Springer

We present ControlLLM, a novel framework that enables large language models (LLMs) to
utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable …

Speichern Zitieren Zitiert von: 30 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Vidmuse: A simple video-to-music generation framework with long-short-term modeling

Z Tian, Z Liu, R Yuan, J Pan, Q Liu, X Tan… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we systematically study music generation conditioned solely on the video. First,
we present a large-scale dataset comprising 360K video-music pairs, including various …

Speichern Zitieren Zitiert von: 6 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Speichern Zitieren Zitiert von: 5 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Magicquill: An intelligent interactive image editing system

Z Liu, Y Yu, H Ouyang, Q Wang, KL Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org

Image editing involves a variety of complex tasks and requires efficient and precise
manipulation techniques. In this paper, we present MagicQuill, an integrated image editing …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

From Efficient Multimodal Models to World Models: A Survey

X Mai, Z Tao, J Lin, H Wang, Y Chang, Y Kang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Models (MLMs) are becoming a significant research focus, combining
powerful large language models with multimodal learning to perform complex tasks across …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

X Chi, Y Wang, A Cheng, P Fang, Z Tian, Y He… - arxiv preprint arxiv …, 2024 - arxiv.org

Massive multi-modality datasets play a significant role in facilitating the success of large
video-language models. However, current video-language datasets primarily provide text …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

EVA: An Embodied World Model for Future Video Anticipation

X Chi, H Zhang, CK Fan, X Qi, R Zhang, A Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

World models integrate raw data from various modalities, such as images and language to
simulate comprehensive interactions in the world, thereby displaying crucial roles in fields …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

KT Pham, J Chen, Q Chen - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org

We present TALE, a novel training-free framework harnessing the generative capabilities of
text-to-image diffusion models to address the cross-domain image composition task that …

Speichern Zitieren Ähnliche Artikel Alle 3 Versionen

[Free GPT-4]

[PDF] arxiv.org

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Y Qi, H Li, Y Song, X Wu, J Luo - arxiv preprint arxiv:2412.08158, 2024 - arxiv.org

The exploration of various vision-language tasks, such as visual captioning, visual question
answering, and visual commonsense reasoning, is an important area in artificial intelligence …

Speichern Zitieren Ähnliche Artikel HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

LLMs Meet Multimodal Generation and Editing: A Survey

Foundation models for music: A survey

Controlllm: Augment language models with tools by searching on graphs

Vidmuse: A simple video-to-music generation framework with long-short-term modeling

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Magicquill: An intelligent interactive image editing system

From Efficient Multimodal Models to World Models: A Survey

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

EVA: An Embodied World Model for Future Video Anticipation

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey