- Academic Search

A Baldrati, L Agnolucci, M Bertini… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Composed Image Retrieval (CIR) aims to retrieve a target image based on a query
composed of a reference image and a relative caption that describes the difference between …

保存引用被引用次数：93 相关文章所有 7 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com

We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

保存引用被引用次数：191 相关文章所有 7 个版本 HTML 版

[Free GPT-4]

[PDF] openreview.net

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

J Li, K Pan, Z Ge, M Gao, W Ji, W Zhang… - The Twelfth …, 2023 - openreview.net

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …

保存引用被引用次数：69 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Fashion iq: A new dataset towards retrieving images by natural language feedback

H Wu, Y Gao, X Guo, Z Al-Halah… - Proceedings of the …, 2021 - openaccess.thecvf.com

Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …

保存引用被引用次数：259 相关文章所有 7 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Can language models encode perceptual structure without grounding? a case study in color

M Abdou, A Kulmizev, D Hershcovich, S Frank… - arxiv preprint arxiv …, 2021 - arxiv.org

Pretrained language models have been shown to encode relational information, such as the
relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …

保存引用被引用次数：122 相关文章所有 5 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

保存引用被引用次数：15 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] acm.org

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

X Hu, L Gu, Q An, M Zhang, L Liu, K Kobayashi… - Proceedings of the 29th …, 2023 - dl.acm.org

To contribute to automating the medical vision-language model, we propose a novel Chest-
Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …

保存引用被引用次数：27 相关文章所有 5 个版本

[Free GPT-4]

[PDF] arxiv.org

Image retrieval from contextual descriptions

B Krojer, V Adlakha, V Vineet, Y Goyal, E Ponti… - arxiv preprint arxiv …, 2022 - arxiv.org

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …

保存引用被引用次数：39 相关文章所有 8 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval

YK Jang, D Huynh, A Shah, WK Chen… - European Conference on …, 2024 - Springer

Abstract Composed Image Retrieval (CIR) is a complex task that retrieves images using a
query, which is configured with an image and a caption that describes desired modifications …

保存引用被引用次数：6 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Modality-agnostic attention fusion for visual search with text feedback

E Dodds, J Culpepper, S Herdade, Y Zhang… - arxiv preprint arxiv …, 2020 - arxiv.org

Image retrieval with natural language feedback offers the promise of catalog search based
on fine-grained visual features that go beyond objects and binary attributes, facilitating real …

保存引用被引用次数：67 相关文章所有 2 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Neural naturalist: Generating fine-grained image comparisons

Zero-shot composed image retrieval with textual inversion

Image retrieval on real-life images with pre-trained vision-and-language models

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

Fashion iq: A new dataset towards retrieving images by natural language feedback

Can language models encode perceptual structure without grounding? a case study in color

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

Image retrieval from contextual descriptions

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval

Modality-agnostic attention fusion for visual search with text feedback