A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

On evaluating adversarial robustness of large vision-language models

Y Zhao, T Pang, C Du, X Yang, C Li… - Advances in …, 2024 - proceedings.neurips.cc
Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Autonomous evaluation and refinement of digital agents

J Pan, Y Zhang, N Tomlin, Y Zhou, S Levine… - arxiv preprint arxiv …, 2024 - arxiv.org
We show that domain-general automatic evaluators can significantly improve the
performance of agents for web navigation and device control. We experiment with multiple …

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Q Yu, J Li, L Wei, L Pang, W Ye, B Qin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …

Learning to model the world with language

J Lin, Y Du, O Watkins, D Hafner, P Abbeel… - arxiv preprint arxiv …, 2023 - arxiv.org
To interact with humans in the world, agents need to understand the diverse types of
language that people use, relate them to the visual world, and act based on them. While …

Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models

H Cheng, E **ao, J Gu, L Yang, J Duan… - … on Computer Vision, 2024 - Springer
Abstract Large Vision-Language Models (LVLMs) rely on vision encoders and Large
Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in …

Language models are free boosters for biomedical imaging tasks

Z Lai, J Wu, S Chen, Y Zhou, A Hovakimyan… - arxiv preprint arxiv …, 2024 - arxiv.org
In this study, we uncover the unexpected efficacy of residual-based large language models
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …

Musechat: A conversational music recommendation system for videos

Z Dong, X Liu, B Chen, P Polak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Music recommendation for videos attracts growing interest in multi-modal research.
However existing systems focus primarily on content compatibility often ignoring the users' …