Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Semantic communications for future internet: Fundamentals, applications, and challenges

W Yang, H Du, ZQ Liew, WYB Lim… - … Surveys & Tutorials, 2022 - ieeexplore.ieee.org
With the increasing demand for intelligent services, the sixth-generation (6G) wireless
networks will shift from a traditional architecture that focuses solely on a high transmission …

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

M Afham, I Dissanayake… - Proceedings of the …, 2022 - openaccess.thecvf.com
Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object
classification, segmentation and detection is often laborious owing to the irregular structure …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arxiv preprint arxiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Mdetr-modulated detection for end-to-end multi-modal understanding

A Kamath, M Singh, Y LeCun… - Proceedings of the …, 2021 - openaccess.thecvf.com
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of
interest from the image. However, this crucial module is typically used as a black box …

Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc
We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …