Google Acadêmico

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Salvar Citar Citado por 228 Artigos relacionados Todas as 6 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org

In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

Salvar Citar Citado por 27 Artigos relacionados Todas as 3 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Salvar Citar Citado por 130 Artigos relacionados Todas as 5 versões

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Pengi: An audio language model for audio tasks

S Deshmukh, B Elizalde, R Singh… - Advances in Neural …, 2023 - proceedings.neurips.cc

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-
Supervised Learning and Zero-Shot Learning techniques. These approaches have led to …

Salvar Citar Citado por 148 Artigos relacionados Todas as 7 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Salvar Citar Citado por 110 Artigos relacionados Todas as 6 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Salvar Citar Citado por 162 Artigos relacionados Todas as 8 versões

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

Uni-moe: Scaling unified multimodal llms with mixture of experts

Y Li, S Jiang, B Hu, L Wang, W Zhong… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Recent advancements in Multimodal Large Language Models (MLLMs) underscore the
significance of scalable models and data to boost performance, yet this often incurs …

Salvar Citar Citado por 26 Artigos relacionados Todas as 5 versões

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Chatbridge: Bridging modalities with large language model as a language catalyst

Z Zhao, L Guo, T Yue, S Chen, S Shao, X Zhu… - arxiv preprint arxiv …, 2023 - arxiv.org

Building general-purpose models that can perceive diverse real-world modalities and solve
various tasks is an appealing target in artificial intelligence. In this paper, we present …

Salvar Citar Citado por 54 Artigos relacionados Todas as 2 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Semanticodec: An ultra low bitrate semantic audio codec for general sound

H Liu, X Xu, Y Yuan, M Wu, W Wang… - IEEE Journal of …, 2024 - ieeexplore.ieee.org

Large language models (LLMs) have significantly advanced audio processing through
audio codecs that convert audio into discrete tokens, enabling the application of language …

Salvar Citar Citado por 20 Artigos relacionados Todas as 5 versões

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Diverse and aligned audio-to-video generation via text-to-video model adaptation

G Yariv, I Gat, S Benaim, L Wolf, I Schwartz… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

We consider the task of generating diverse and realistic videos guided by natural audio
samples from a wide variety of semantic classes. For this task, the videos are required to be …

Salvar Citar Citado por 32 Artigos relacionados Todas as 6 versões Ver em HTML

Criar alerta

Citar

Pesquisa avançada

Salvo em "Minha biblioteca"

Beats: Audio pre-training with acoustic tokenizers

Mm-llms: Recent advances in multimodal large language models

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

Internvideo2: Scaling foundation models for multimodal video understanding

Pengi: An audio language model for audio tasks

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

Uni-moe: Scaling unified multimodal llms with mixture of experts

Chatbridge: Bridging modalities with large language model as a language catalyst

Semanticodec: An ultra low bitrate semantic audio codec for general sound

Diverse and aligned audio-to-video generation via text-to-video model adaptation