Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

Scaling up vision-language pre-training for image captioning

X Hu, Z Gan, J Wang, Z Yang, Z Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

S Changpinyo, P Sharma, N Ding… - Proceedings of the …, 2021 - openaccess.thecvf.com
The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

History aware multimodal transformer for vision-and-language navigation

S Chen, PL Guhur, C Schmid… - Advances in neural …, 2021 - proceedings.neurips.cc
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …

[PDF][PDF] Soft tissue feature tracking based on deep matching network

S Lu, S Liu, P Hou, B Yang, M Liu, L Yin… - … . Model. Eng. Sci, 2023 - cdn.techscience.cn
Research in the field of medical image is an important part of the medical robot to operate
human organs. A medical robot is the intersection of multi-disciplinary research fields, in …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …