Zero-shot composed image retrieval with textual inversion

A Baldrati, L Agnolucci, M Bertini… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Composed Image Retrieval (CIR) aims to retrieve a target image based on a query
composed of a reference image and a relative caption that describes the difference between …

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

J Li, K Pan, Z Ge, M Gao, W Ji, W Zhang… - The Twelfth …, 2023 - openreview.net
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …

Fashion iq: A new dataset towards retrieving images by natural language feedback

H Wu, Y Gao, X Guo, Z Al-Halah… - Proceedings of the …, 2021 - openaccess.thecvf.com
Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …

Can language models encode perceptual structure without grounding? a case study in color

M Abdou, A Kulmizev, D Hershcovich, S Frank… - arxiv preprint arxiv …, 2021 - arxiv.org
Pretrained language models have been shown to encode relational information, such as the
relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

X Hu, L Gu, Q An, M Zhang, L Liu, K Kobayashi… - Proceedings of the 29th …, 2023 - dl.acm.org
To contribute to automating the medical vision-language model, we propose a novel Chest-
Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …

Image retrieval from contextual descriptions

B Krojer, V Adlakha, V Vineet, Y Goyal, E Ponti… - arxiv preprint arxiv …, 2022 - arxiv.org
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval

YK Jang, D Huynh, A Shah, WK Chen… - European Conference on …, 2024 - Springer
Abstract Composed Image Retrieval (CIR) is a complex task that retrieves images using a
query, which is configured with an image and a caption that describes desired modifications …

Modality-agnostic attention fusion for visual search with text feedback

E Dodds, J Culpepper, S Herdade, Y Zhang… - arxiv preprint arxiv …, 2020 - arxiv.org
Image retrieval with natural language feedback offers the promise of catalog search based
on fine-grained visual features that go beyond objects and binary attributes, facilitating real …