- Academic Search

Save Cite Cited by 197 Related articles All 3 versions Free GPT-4 View as HTML

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Save Cite Cited by 206 Related articles All 3 versions Free GPT-4 View as HTML

Gpt4roi: Instruction tuning large language model on region-of-interest

S Zhang, P Sun, S Chen, M **ao, W Shao… - arxiv preprint arxiv …, 2023 - arxiv.org

Instruction tuning large language model (LLM) on image-text pairs has achieved
unprecedented vision-language multimodal abilities. However, their vision-language …

Save Cite Cited by 238 Related articles All 3 versions Free GPT-4 View as HTML

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

Save Cite Cited by 43 Related articles All 3 versions Free GPT-4 View as HTML

Textdiffuser-2: Unleashing the power of language models for text rendering

J Chen, Y Huang, T Lv, L Cui, Q Chen, F Wei - European Conference on …, 2024 - Springer

The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …

Save Cite Cited by 42 Related articles All 2 versions Free GPT-4

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

S Chen, X Chen, C Zhang, M Li, G Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …

Save Cite Cited by 61 Related articles All 3 versions Free GPT-4 View as HTML

Osprey: Pixel understanding with visual instruction tuning

Y Yuan, W Li, J Liu, D Tang, X Luo… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have recently achieved impressive general-
purpose vision-language capabilities through visual instruction tuning. However current …

Save Cite Cited by 65 Related articles All 4 versions Free GPT-4 View as HTML

See Say and Segment: Teaching LMMs to Overcome False Premises

TH Wu, G Biamby, D Chan, L Dunlap… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-
vocabulary language grounding and segmentation but can suffer under false premises …

Save Cite Cited by 17 Related articles All 3 versions Free GPT-4 View as HTML