- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Tallenna Viittaa Viittausten määrä 1208 Aiheeseen liittyviä artikkeleita Kaikki 12 versiota

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Tallenna Viittaa Viittausten määrä 218 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025 - proceedings.neurips.cc

The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

Tallenna Viittaa Viittausten määrä 157 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Seeclick: Harnessing gui grounding for advanced visual gui agents

K Cheng, Q Sun, Y Chu, F Xu, Y Li, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital
devices, such as smartphones and desktops. Most existing GUI agents interact with the …

Tallenna Viittaa Viittausten määrä 96 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Tallenna Viittaa Viittausten määrä 74 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[HTML] cell.com

[HTML][HTML] Empowering biomedical discovery with AI agents

S Gao, A Fang, Y Huang, V Giunchiglia, A Noori… - Cell, 2024 - cell.com

We envision" AI scientists" as systems capable of skeptical learning and reasoning that
empower biomedical research through collaborative agents that integrate AI models and …

Tallenna Viittaa Viittausten määrä 42 Aiheeseen liittyviä artikkeleita Kaikki 7 versiota

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Viescore: Towards explainable metrics for conditional image synthesis evaluation

M Ku, D Jiang, C Wei, X Yue, W Chen - arxiv preprint arxiv:2312.14867, 2023 - arxiv.org

In the rapidly advancing field of conditional image generation research, challenges such as
limited explainability lie in effectively evaluating the performance and capabilities of various …

Tallenna Viittaa Viittausten määrä 35 Aiheeseen liittyviä artikkeleita Kaikki 5 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Building and better understanding vision-language models: insights and future directions

H Laurençon, A Marafioti, V Sanh… - … on Responsibly Building …, 2024 - openreview.net

The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …

Tallenna Viittaa Viittausten määrä 34 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From concept to manufacturing: Evaluating vision-language models for engineering design

C Picard, KM Edwards, AC Doris, B Man… - arxiv preprint arxiv …, 2023 - arxiv.org

Engineering design is undergoing a transformative shift with the advent of AI, marking a new
era in how we approach product, system, and service planning. Large language models …

Tallenna Viittaa Viittausten määrä 36 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

J Wen, Y Zhu, J Li, M Zhu, K Wu, Z Xu, N Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor
control and instruction comprehension through end-to-end learning processes. However …

Tallenna Viittaa Viittausten määrä 20 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

Luo ilmoitus

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Introducing our multimodal models

A Survey of Multimodel Large Language Models

Mm-llms: Recent advances in multimodal large language models

What matters when building vision-language models?

Seeclick: Harnessing gui grounding for advanced visual gui agents

Emu3: Next-token prediction is all you need

[HTML][HTML] Empowering biomedical discovery with AI agents

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Building and better understanding vision-language models: insights and future directions

From concept to manufacturing: Evaluating vision-language models for engineering design

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation