Cross-modal retrieval: a systematic review of methods and future directions
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …
methods struggle to meet the needs of users seeking access to data across various …
mplug-owl3: Towards long image-sequence understanding in multi-modal large language models
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities
in executing instructions for a variety of single-image tasks. Despite this progress, significant …
in executing instructions for a variety of single-image tasks. Despite this progress, significant …
Grounding language models for visual entity recognition
Abstract We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our
model extends an autoregressive Multimodal Large Language Model by employing retrieval …
model extends an autoregressive Multimodal Large Language Model by employing retrieval …
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …
about images using extensive background knowledge. Despite significant advancements …
Vacode: Visual augmented contrastive decoding
Despite the astonishing performance of recent Large Vision-Language Models (LVLMs),
these models often generate inaccurate responses. To address this issue, previous studies …
these models often generate inaccurate responses. To address this issue, previous studies …
Taming CLIP for Fine-Grained and Structured Visual Understanding of Museum Exhibits
CLIP is a powerful and widely used tool for understanding images in the context of natural
language descriptions to perform nuanced tasks. However, it does not offer application …
language descriptions to perform nuanced tasks. However, it does not offer application …
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
While large pre-trained visual-language models have shown promising results on traditional
visual question answering benchmarks, it is still challenging for them to answer complex …
visual question answering benchmarks, it is still challenging for them to answer complex …
Splate: Sparse late interaction retrieval
The late interaction paradigm introduced with ColBERT stands out in the neural Information
Retrieval space, offering a compelling effectiveness-efficiency trade-off across many …
Retrieval space, offering a compelling effectiveness-efficiency trade-off across many …
Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa
Visual question answering (VQA) tasks, often performed by visual language model (VLM),
face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …
face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
With the breakthrough of multi-modal large language models (MLLMs), answering complex
visual questions that demand advanced reasoning abilities and world knowledge has …
visual questions that demand advanced reasoning abilities and world knowledge has …