A survey on hallucination in large vision-language models
Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …
attention within the AI landscape for its practical implementation potential. However,`` …
Generative multimodal models are in-context learners
Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …
simple instructions which current multimodal systems largely struggle to imitate. In this work …
Gpt4roi: Instruction tuning large language model on region-of-interest
Instruction tuning large language model (LLM) on image-text pairs has achieved
unprecedented vision-language multimodal abilities. However, their vision-language …
unprecedented vision-language multimodal abilities. However, their vision-language …
Ferret: Refer and ground anything anywhere at any granularity
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …
understanding spatial referring of any shape or granularity within an image and accurately …
Capsfusion: Rethinking image-text data at scale
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …
Textdiffuser-2: Unleashing the power of language models for text rendering
The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …
remains a challenge in generating visual text. Although existing work has endeavored to …
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …
possibilities for various applications in the field of human-machine interactions. However …
Osprey: Pixel understanding with visual instruction tuning
Multimodal large language models (MLLMs) have recently achieved impressive general-
purpose vision-language capabilities through visual instruction tuning. However current …
purpose vision-language capabilities through visual instruction tuning. However current …
See Say and Segment: Teaching LMMs to Overcome False Premises
Abstract Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-
vocabulary language grounding and segmentation but can suffer under false premises …
vocabulary language grounding and segmentation but can suffer under false premises …
The (r) evolution of multimodal large language models: A survey
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …
this reason, inspired by the success of large language models, significant research efforts …