Zero-shot composed image retrieval with textual inversion
Abstract Composed Image Retrieval (CIR) aims to retrieve a target image based on a query
composed of a reference image and a relative caption that describes the difference between …
composed of a reference image and a relative caption that describes the difference between …
Image retrieval on real-life images with pre-trained vision-and-language models
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …
and short textual description of how to modify the image. Existing methods have only been …
Fine-tuning multimodal llms to follow zero-shot demonstrative instructions
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Fashion iq: A new dataset towards retrieving images by natural language feedback
Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …
expressive, and user friendly than classical keyword-based search interfaces. In this paper …
Can language models encode perceptual structure without grounding? a case study in color
Pretrained language models have been shown to encode relational information, such as the
relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …
relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …
Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering
To contribute to automating the medical vision-language model, we propose a novel Chest-
Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …
Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …
Image retrieval from contextual descriptions
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …
Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval
Abstract Composed Image Retrieval (CIR) is a complex task that retrieves images using a
query, which is configured with an image and a caption that describes desired modifications …
query, which is configured with an image and a caption that describes desired modifications …
Modality-agnostic attention fusion for visual search with text feedback
Image retrieval with natural language feedback offers the promise of catalog search based
on fine-grained visual features that go beyond objects and binary attributes, facilitating real …
on fine-grained visual features that go beyond objects and binary attributes, facilitating real …