Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Pengi: An audio language model for audio tasks

S Deshmukh, B Elizalde, R Singh… - Advances in Neural …, 2023 - proceedings.neurips.cc
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-
Supervised Learning and Zero-Shot Learning techniques. These approaches have led to …

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Uni-moe: Scaling unified multimodal llms with mixture of experts

Y Li, S Jiang, B Hu, L Wang, W Zhong… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the
significance of scalable models and data to boost performance, yet this often incurs …

Chatbridge: Bridging modalities with large language model as a language catalyst

Z Zhao, L Guo, T Yue, S Chen, S Shao, X Zhu… - arxiv preprint arxiv …, 2023 - arxiv.org
Building general-purpose models that can perceive diverse real-world modalities and solve
various tasks is an appealing target in artificial intelligence. In this paper, we present …

Semanticodec: An ultra low bitrate semantic audio codec for general sound

H Liu, X Xu, Y Yuan, M Wu, W Wang… - IEEE Journal of …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have significantly advanced audio processing through
audio codecs that convert audio into discrete tokens, enabling the application of language …

Diverse and aligned audio-to-video generation via text-to-video model adaptation

G Yariv, I Gat, S Benaim, L Wolf, I Schwartz… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
We consider the task of generating diverse and realistic videos guided by natural audio
samples from a wide variety of semantic classes. For this task, the videos are required to be …