Obelics: An open web-scale filtered dataset of interleaved image-text documents
Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …
outperform models trained on image-text pairs on various multimodal benchmarks …
Datacomp: In search of the next generation of multimodal datasets
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …
Diffusion and GPT-4, yet their design does not receive the same research attention as model …
Deepseek-vl: towards real-world vision-language understanding
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …
world vision and language understanding applications. Our approach is structured around …
Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition
We propose InternLM-XComposer, a vision-language large model that enables advanced
image-text comprehension and composition. The innovative nature of our model is …
image-text comprehension and composition. The innovative nature of our model is …
Improving multimodal datasets with image captioning
Massive web datasets play a key role in the success of large vision-language models like
CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to …
CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to …
BrainCLIP: Bridging brain and visual-linguistic representation via CLIP for generic natural visual stimulus decoding
Due to the lack of paired samples and the low signal-to-noise ratio of functional MRI (fMRI)
signals, reconstructing perceived natural images or decoding their semantic contents from …
signals, reconstructing perceived natural images or decoding their semantic contents from …
Survey of different large language model architectures: Trends, benchmarks, and challenges
Large Language Models (LLMs) represent a class of deep learning models adept at
understanding natural language and generating coherent responses to various prompts or …
understanding natural language and generating coherent responses to various prompts or …
Omchat: A recipe to train multimodal language models with strong long context and video understanding
We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …
Large Remote Sensing Model: Progress and Prospects
L ZHANG, L ZHANG, Q YUAN - Geomatics and Information Science …, 2023 - ch.whu.edu.cn
In recent years, significant advancements in large language models and visual foundation
models in the field of artificial intelligence have attracted scholars' attention to the potential of …
models in the field of artificial intelligence have attracted scholars' attention to the potential of …
Cvlue: A new benchmark dataset for chinese vision-language understanding evaluation
Y Wang, Y Liu, F Yu, C Huang, K Li, Z Wan… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid development of Chinese vision-language models (VLMs), most existing
Chinese vision-language (VL) datasets are constructed on Western-centric images from …
Chinese vision-language (VL) datasets are constructed on Western-centric images from …