A Survey of Multimodel Large Language Models
Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …
including vision, the technology of large language models is evolving from a single modality …
Real-world robot applications of foundation models: A review
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …
Language Models (VLMs), trained on extensive data, facilitate flexible application across …
On evaluating adversarial robustness of large vision-language models
Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …
performance in response generation, especially with visual inputs, enabling more creative …
BRAVE: Broadening the visual encoding of vision-language models
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …
language model (LM) that interprets the encoded features to solve downstream tasks …
Autonomous evaluation and refinement of digital agents
We show that domain-general automatic evaluators can significantly improve the
performance of agents for web navigation and device control. We experiment with multiple …
performance of agents for web navigation and device control. We experiment with multiple …
Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …
instruction-following data have demonstrated remarkable performance in various multimodal …
Learning to model the world with language
To interact with humans in the world, agents need to understand the diverse types of
language that people use, relate them to the visual world, and act based on them. While …
language that people use, relate them to the visual world, and act based on them. While …
Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models
Abstract Large Vision-Language Models (LVLMs) rely on vision encoders and Large
Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in …
Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in …
Language models are free boosters for biomedical imaging tasks
In this study, we uncover the unexpected efficacy of residual-based large language models
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …
Musechat: A conversational music recommendation system for videos
Music recommendation for videos attracts growing interest in multi-modal research.
However existing systems focus primarily on content compatibility often ignoring the users' …
However existing systems focus primarily on content compatibility often ignoring the users' …