From pixels to insights: A survey on automatic chart understanding in the era of large foundation models

KH Huang, HP Chan, YR Fung, H Qiu… - … on Knowledge and …, 2024 - ieeexplore.ieee.org
Data visualization in the form of charts plays a pivotal role in data analysis, offering critical
insights and aiding in informed decision-making. Automatic chart understanding has …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025 - proceedings.neurips.cc
The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T **a, J Liu, C Li, H Hajishirzi… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive
problem-solving skills in many tasks and domains, but their ability in mathematical …

Webarena: A realistic web environment for building autonomous agents

S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar… - arxiv preprint arxiv …, 2023 - arxiv.org
With advances in generative AI, there is now potential for autonomous agents to manage
daily tasks via natural language commands. However, current agents are primarily created …

Gpt-4v (ision) is a generalist web agent, if grounded

B Zheng, B Gou, J Kil, H Sun, Y Su - arxiv preprint arxiv:2401.01614, 2024 - arxiv.org
The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …

Vary: Scaling up the vision vocabulary for large vision-language model

H Wei, L Kong, J Chen, L Zhao, Z Ge, J Yang… - … on Computer Vision, 2024 - Springer
Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie,
CLIP, for common vision tasks. However, for some special task that needs dense and fine …