Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024‏ - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023‏ - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022‏ - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing

S Zhang, Y Xu, N Usuyama, J Bagga… - arxiv preprint arxiv …, 2023‏ - researchgate.net
Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …

Focal self-attention for local-global interactions in vision transformers

J Yang, C Li, P Zhang, X Dai, B **ao, L Yuan… - arxiv preprint arxiv …, 2021‏ - arxiv.org
Recently, Vision Transformer and its variants have shown great promise on various
computer vision tasks. The ability of capturing short-and long-range visual dependencies …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y **a, YT Chen… - International …, 2021‏ - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

Vitae: Vision transformer advanced by exploring intrinsic inductive bias

Y Xu, Q Zhang, J Zhang, D Tao - Advances in neural …, 2021‏ - proceedings.neurips.cc
Transformers have shown great potential in various computer vision tasks owing to their
strong capability in modeling long-range dependency using the self-attention mechanism …

Foundations and trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022‏ - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Mental health analysis in social media posts: a survey

M Garg - Archives of Computational Methods in Engineering, 2023‏ - Springer
The surge in internet use to express personal thoughts and beliefs makes it increasingly
feasible for the social NLP research community to find and validate associations between …