Obelics: An open web-scale filtered dataset of interleaved image-text documents

H Laurençon, L Saulnier, L Tronchon… - Advances in …, 2024 - proceedings.neurips.cc
Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …

Datacomp: In search of the next generation of multimodal datasets

SY Gadre, G Ilharco, A Fang… - Advances in …, 2024 - proceedings.neurips.cc
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

P Zhang, X Dong, B Wang, Y Cao, C Xu… - arxiv preprint arxiv …, 2023 - arxiv.org
We propose InternLM-XComposer, a vision-language large model that enables advanced
image-text comprehension and composition. The innovative nature of our model is …

Improving multimodal datasets with image captioning

T Nguyen, SY Gadre, G Ilharco… - Advances in Neural …, 2024 - proceedings.neurips.cc
Massive web datasets play a key role in the success of large vision-language models like
CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to …

BrainCLIP: Bridging brain and visual-linguistic representation via CLIP for generic natural visual stimulus decoding

Y Liu, Y Ma, W Zhou, G Zhu, N Zheng - arxiv preprint arxiv:2302.12971, 2023 - arxiv.org
Due to the lack of paired samples and the low signal-to-noise ratio of functional MRI (fMRI)
signals, reconstructing perceived natural images or decoding their semantic contents from …

Survey of different large language model architectures: Trends, benchmarks, and challenges

M Shao, A Basit, R Karri, M Shafique - IEEE Access, 2024 - ieeexplore.ieee.org
Large Language Models (LLMs) represent a class of deep learning models adept at
understanding natural language and generating coherent responses to various prompts or …

Omchat: A recipe to train multimodal language models with strong long context and video understanding

T Zhao, Q Zhang, K Lee, P Liu, L Zhang, C Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …

Large Remote Sensing Model: Progress and Prospects

L ZHANG, L ZHANG, Q YUAN - Geomatics and Information Science …, 2023 - ch.whu.edu.cn
In recent years, significant advancements in large language models and visual foundation
models in the field of artificial intelligence have attracted scholars' attention to the potential of …

Cvlue: A new benchmark dataset for chinese vision-language understanding evaluation

Y Wang, Y Liu, F Yu, C Huang, K Li, Z Wan… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid development of Chinese vision-language models (VLMs), most existing
Chinese vision-language (VL) datasets are constructed on Western-centric images from …