Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

F Cocchi, N Moratelli, M Cornia, L Baraldi… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle
multimodal inputs, combining text and image data. They have recently garnered attention …