Google Tudós

From pixels to insights: A survey on automatic chart understanding in the era of large foundation models

KH Huang, HP Chan, YR Fung, H Qiu… - … on Knowledge and …, 2024 - ieeexplore.ieee.org

Data visualization in the form of charts plays a pivotal role in data analysis, offering critical
insights and aiding in informed decision-making. Automatic chart understanding has …

Mentés Hivatkozás Idézetek száma: 18 Kapcsolódó cikkek Mind a(z) 6 változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Mentés Hivatkozás Idézetek száma: 37 Kapcsolódó cikkek Mind a(z) 3 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Mentés Hivatkozás Idézetek száma: 2852 Kapcsolódó cikkek Mind a(z) 4 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025 - proceedings.neurips.cc

The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

Mentés Hivatkozás Idézetek száma: 161 Kapcsolódó cikkek Mind a(z) 4 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

Mentés Hivatkozás Idézetek száma: 258 Kapcsolódó cikkek Mind a(z) 6 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Mentés Hivatkozás Idézetek száma: 225 Kapcsolódó cikkek Mind a(z) 6 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T **a, J Liu, C Li, H Hajishirzi… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive
problem-solving skills in many tasks and domains, but their ability in mathematical …

Mentés Hivatkozás Idézetek száma: 404 Kapcsolódó cikkek Mind a(z) 4 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Webarena: A realistic web environment for building autonomous agents

S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar… - arxiv preprint arxiv …, 2023 - arxiv.org

With advances in generative AI, there is now potential for autonomous agents to manage
daily tasks via natural language commands. However, current agents are primarily created …

Mentés Hivatkozás Idézetek száma: 292 Kapcsolódó cikkek Mind a(z) 4 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gpt-4v (ision) is a generalist web agent, if grounded

B Zheng, B Gou, J Kil, H Sun, Y Su - arxiv preprint arxiv:2401.01614, 2024 - arxiv.org

The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …

Mentés Hivatkozás Idézetek száma: 167 Kapcsolódó cikkek Mind a(z) 6 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vary: Scaling up the vision vocabulary for large vision-language model

H Wei, L Kong, J Chen, L Zhao, Z Ge, J Yang… - … on Computer Vision, 2024 - Springer

Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie,
CLIP, for common vision tasks. However, for some special task that needs dense and fine …

Mentés Hivatkozás Idézetek száma: 82 Kapcsolódó cikkek Mind a(z) 7 változat

Értesítés létrehozása

Hivatkozás

Speciális keresés

Mentve a Saját könyvtárba

Pix2struct: Screenshot parsing as pretraining for visual language understanding

From pixels to insights: A survey on automatic chart understanding in the era of large foundation models

Self-supervised multimodal learning: A survey

The llama 3 herd of models

What matters when building vision-language models?

Cogagent: A visual language model for gui agents

Monkey: Image resolution and text label are important things for large multi-modal models

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Webarena: A realistic web environment for building autonomous agents

Gpt-4v (ision) is a generalist web agent, if grounded

Vary: Scaling up the vision vocabulary for large vision-language model