Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
AI assisted fashion design: A review
This review explores the integration of enhanced personalization and seamless multimodal
interfaces in the field of fashion design and recommendation. We examine the increasing …
interfaces in the field of fashion design and recommendation. We examine the increasing …
Pic2word: Map** pictures to words for zero-shot composed image retrieval
Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …
describe their intended target. Existing methods rely on supervised learning of CIR models …
VLT: Vision-language transformer and query generation for referring segmentation
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …
facilitate deep interactions among multi-modal information and enhance the holistic …
Covr: Learning composed video retrieval from web video captions
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers
both text and image queries together, to search for relevant images in a database. Most …
both text and image queries together, to search for relevant images in a database. Most …
Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks
In the fashion domain, there exists a variety of vision-and-language (V+ L) tasks, including
cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image …
cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image …
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Two primary input modalities prevail in image retrieval: sketch and text. While text is widely
used for inter-category retrieval tasks sketches have been established as the sole preferred …
used for inter-category retrieval tasks sketches have been established as the sole preferred …
Multilateral semantic relations modeling for image text retrieval
Z Wang, Z Gao, K Guo, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text retrieval is a fundamental task to bridge vision and language by exploiting
various strategies to fine-grained alignment between regions and words. This is still tough …
various strategies to fine-grained alignment between regions and words. This is still tough …
Chatting makes perfect: Chat-based image retrieval
Chats emerge as an effective user-friendly approach for information retrieval, and are
successfully employed in many domains, such as customer service, healthcare, and finance …
successfully employed in many domains, such as customer service, healthcare, and finance …
Target-guided composed image retrieval
Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can
retrieve the target image for a multimodal query, including a reference image and its …
retrieve the target image for a multimodal query, including a reference image and its …