- Academic Search

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Save Cite Cited by 2910 Related articles All 8 versions Free GPT-4

[Free GPT-4]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Save Cite Cited by 198 Related articles All 7 versions Free GPT-4 Library Search View as HTML

[Free GPT-4]

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Save Cite Cited by 4972 Related articles All 15 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Save Cite Cited by 624 Related articles All 9 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer

Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

Save Cite Cited by 2223 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Diffusion-based generation, optimization, and planning in 3d scenes

S Huang, Z Wang, P Li, B Jia, T Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce SceneDiffuser, a conditional generative model for 3D scene understanding.
SceneDiffuser provides a unified model for solving scene-conditioned generation …

Save Cite Cited by 179 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

How much can clip benefit vision-and-language tasks?

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arxiv preprint arxiv …, 2021 - arxiv.org

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

Save Cite Cited by 457 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Save Cite Cited by 190 Related articles All 8 versions Free GPT-4

[Free GPT-4]

[PDF] neurips.cc

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc

We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …

Save Cite Cited by 550 Related articles All 8 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

History aware multimodal transformer for vision-and-language navigation

S Chen, PL Guhur, C Schmid… - Advances in neural …, 2021 - proceedings.neurips.cc

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …

Save Cite Cited by 237 Related articles All 8 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Towards learning a generic agent for vision-and-language navigation via pre-training

Transformers in vision: A survey

Vision-language pre-training: Basics, recent advances, and future trends

Visual instruction tuning

Multimodal learning with transformers: A survey

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Diffusion-based generation, optimization, and planning in 3d scenes

How much can clip benefit vision-and-language tasks?

Large-scale multi-modal pre-trained models: A comprehensive survey

Large-scale adversarial training for vision-and-language representation learning

History aware multimodal transformer for vision-and-language navigation