Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Subject-driven text-to-image generation via apprenticeship learning

W Chen, H Hu, Y Li, N Ruiz, X Jia… - Advances in …, 2023 - proceedings.neurips.cc
Recent text-to-image generation models like DreamBooth have made remarkable progress
in generating highly customized images of a target subject, by fine-tuning an``expert …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Artificial intelligence for science in quantum, atomistic, and continuum systems

X Zhang, L Wang, J Helwig, Y Luo, C Fu, Y **e… - arxiv preprint arxiv …, 2023 - arxiv.org
Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural
sciences. Today, AI has started to advance natural sciences by improving, accelerating, and …

Seeing what you said: Talking face generation guided by a lip reading expert

J Wang, X Qian, M Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions
concerning lips given coherent speech input. The previous studies revealed the importance …

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

Effective conditioned and composed image retrieval combining clip-based features

A Baldrati, M Bertini, T Uricchio… - Proceedings of the …, 2022 - openaccess.thecvf.com
Conditioned and composed image retrieval extend CBIR systems by combining a query
image with an additional text that expresses the intent of the user, describing additional …

Foundations and trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Actionclip: Adapting language-image pretrained models for video action recognition

M Wang, J **ng, J Mei, Y Liu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The canonical approach to video action recognition dictates a neural network model to do a
classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of …