- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com‏

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …‏

שמור צטט צוטט על ידי 198 מאמרים בנושא זה כל 7 הגרסאות חיפוש ספריות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Image-text retrieval: A survey on recent research and development‏

M Cao, S Li, J Li, L Nie, M Zhang - arxiv preprint arxiv:2203.14713, 2022‏ - arxiv.org‏

In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …‏

שמור צטט צוטט על ידי 102 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Unified contrastive learning in image-text-label space‏

J Yang, C Li, P Zhang, B **ao, C Liu… - Proceedings of the …, 2022‏ - openaccess.thecvf.com‏

Visual recognition is recently learned via either supervised learning on human-annotated
image-label data or language-image contrastive learning with webly-crawled image-text …‏

שמור צטט צוטט על ידי 232 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Towards language-free training for text-to-image generation‏

Y Zhou, R Zhang, C Chen, C Li… - Proceedings of the …, 2022‏ - openaccess.thecvf.com‏

One of the major challenges in training text-to-image generation models is the need of a
large number of high-quality text-image pairs. While image samples are often easily …‏

שמור צטט צוטט על ידי 286 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Transvg: End-to-end visual grounding with transformers‏

J Deng, Z Yang, T Chen, W Zhou… - Proceedings of the IEEE …, 2021‏ - openaccess.thecvf.com‏

In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …‏

שמור צטט צוטט על ידי 378 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Seqtr: A simple yet universal network for visual grounding‏

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022‏ - Springer‏

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …‏

שמור צטט צוטט על ידי 157 מאמרים בנושא זה כל 6 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Improving visual grounding with visual-linguistic verification and iterative reasoning‏

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022‏ - openaccess.thecvf.com‏

Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …‏

שמור צטט צוטט על ידי 133 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips‏

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019‏ - openaccess.thecvf.com‏

Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …‏

שמור צטט צוטט על ידי 1320 מאמרים בנושא זה כל 10 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Self-supervised multimodal versatile networks‏

JB Alayrac, A Recasens, R Schneider… - Advances in neural …, 2020‏ - proceedings.neurips.cc‏

Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …‏

שמור צטט צוטט על ידי 433 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multi-modality cross attention network for image and sentence matching‏

X Wei, T Zhang, Y Li, Y Zhang… - Proceedings of the IEEE …, 2020‏ - openaccess.thecvf.com‏

The key of image and sentence matching is to accurately measure the visual-semantic
similarity between an image and a sentence. However, most existing methods make use of …‏

שמור צטט צוטט על ידי 431 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Learning two-branch neural networks for image-text matching tasks

Vision-language pre-training: Basics, recent advances, and future trends‏

Image-text retrieval: A survey on recent research and development‏

Unified contrastive learning in image-text-label space‏

Towards language-free training for text-to-image generation‏

Transvg: End-to-end visual grounding with transformers‏

Seqtr: A simple yet universal network for visual grounding‏

Improving visual grounding with visual-linguistic verification and iterative reasoning‏

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips‏

Self-supervised multimodal versatile networks‏

Multi-modality cross attention network for image and sentence matching‏