Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle
multimodal inputs, combining text and image data. They have recently garnered attention …
multimodal inputs, combining text and image data. They have recently garnered attention …