Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2024 - Springer
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Kosmos-2.5: A multimodal literate model

T Lv, Y Huang, J Chen, Y Zhao, Y Jia, L Cui… - arxiv preprint arxiv …, 2023 - arxiv.org
The automatic reading of text-intensive images represents a significant advancement toward
achieving Artificial General Intelligence (AGI). In this paper we present KOSMOS-2.5, a …

Anygpt: Unified multimodal llm with discrete sequence modeling

J Zhan, J Dai, J Ye, Y Zhou, D Zhang, Z Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete
representations for the unified processing of various modalities, including speech, text …

Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain

W Zhang, M Cai, T Zhang, Y Zhuang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Multimodal large language models (MLLMs) have demonstrated remarkable success in
vision and visual-language tasks within the natural image domain. Owing to the significant …

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Unifiedmllm: Enabling unified representation for multi-modal multi-tasks with large language model

Z Li, W Wang, YQ Cai, X Qi, P Wang, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Significant advancements has recently been achieved in the field of multi-modal large
language models (MLLMs), demonstrating their remarkable capabilities in understanding …

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

H Chen, X Wang, Y Zhou, B Huang, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …