- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Spara Citera Citerat av 1208 Relaterade artiklar Alla 12 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Spara Citera Citerat av 88 Relaterade artiklar

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - Forty-first International …, 2024 - openreview.net

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Spara Citera Citerat av 498 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arxiv preprint arxiv …, 2023 - arxiv.org

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Spara Citera Citerat av 556 Relaterade artiklar Alla 9 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Spara Citera Citerat av 327 Relaterade artiklar Alla 7 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Spara Citera Citerat av 125 Relaterade artiklar Alla 5 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Spara Citera Citerat av 530 Relaterade artiklar Alla 9 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Spara Citera Citerat av 119 Relaterade artiklar Alla 7 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models

Y Chu, J Xu, X Zhou, Q Yang, S Zhang, Z Yan… - arxiv preprint arxiv …, 2023 - arxiv.org

Recently, instruction-following audio-language models have received broad attention for
audio interaction with humans. However, the absence of pre-trained audio models capable …

Spara Citera Citerat av 232 Relaterade artiklar Alla 2 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Spara Citera Citerat av 101 Relaterade artiklar Alla 6 versionerna Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Clotho: An audio captioning dataset

A Survey of Multimodel Large Language Models

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

Next-gpt: Any-to-any multimodal llm

Audioldm: Text-to-audio generation with latent diffusion models

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

Internvideo2: Scaling foundation models for multimodal video understanding

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models

Onellm: One framework to align all modalities with language