- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Simpan Kutip Dirujuk 164 kali Artikel terkait 7 versi

[Free GPT-4]
[DeepSeek]

[PDF] tandfonline.com

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis

Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Simpan Kutip Dirujuk 40 kali Artikel terkait 2 versi

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

On evaluating adversarial robustness of large vision-language models

Y Zhao, T Pang, C Du, X Yang, C Li… - Advances in …, 2024 - proceedings.neurips.cc

Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …

Simpan Kutip Dirujuk 182 kali Artikel terkait 8 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer

Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Simpan Kutip Dirujuk 29 kali Artikel terkait 2 versi

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Autonomous evaluation and refinement of digital agents

J Pan, Y Zhang, N Tomlin, Y Zhou, S Levine… - arxiv preprint arxiv …, 2024 - arxiv.org

We show that domain-general automatic evaluators can significantly improve the
performance of agents for web navigation and device control. We experiment with multiple …

Simpan Kutip Dirujuk 40 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Q Yu, J Li, L Wei, L Pang, W Ye, B Qin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …

Simpan Kutip Dirujuk 53 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning to model the world with language

J Lin, Y Du, O Watkins, D Hafner, P Abbeel… - arxiv preprint arxiv …, 2023 - arxiv.org

To interact with humans in the world, agents need to understand the diverse types of
language that people use, relate them to the visual world, and act based on them. While …

Simpan Kutip Dirujuk 48 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] ecva.net

Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models

H Cheng, E **ao, J Gu, L Yang, J Duan… - … on Computer Vision, 2024 - Springer

Abstract Large Vision-Language Models (LVLMs) rely on vision encoders and Large
Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in …

Simpan Kutip Dirujuk 9 kali Artikel terkait

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language models are free boosters for biomedical imaging tasks

Z Lai, J Wu, S Chen, Y Zhou, A Hovakimyan… - arxiv preprint arxiv …, 2024 - arxiv.org

In this study, we uncover the unexpected efficacy of residual-based large language models
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …

Simpan Kutip Dirujuk 17 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Musechat: A conversational music recommendation system for videos

Z Dong, X Liu, B Chen, P Polak… - Proceedings of the …, 2024 - openaccess.thecvf.com

Music recommendation for videos attracts growing interest in multi-modal research.
However existing systems focus primarily on content compatibility often ignoring the users' …

Simpan Kutip Dirujuk 27 kali Artikel terkait 4 versi Versi HTML

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

From images to textual prompts: Zero-shot visual question answering with frozen large language...

A Survey of Multimodel Large Language Models

Real-world robot applications of foundation models: A review

On evaluating adversarial robustness of large vision-language models

BRAVE: Broadening the visual encoding of vision-language models

Autonomous evaluation and refinement of digital agents

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Learning to model the world with language

Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models

Language models are free boosters for biomedical imaging tasks

Musechat: A conversational music recommendation system for videos