Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Graph neural networks for natural language processing: A survey

L Wu, Y Chen, K Shen, X Guo, H Gao… - … and Trends® in …, 2023 - nowpublishers.com
Deep learning has become the dominant approach in addressing various tasks in Natural
Language Processing (NLP). Although text inputs are typically represented as a sequence …

Foundations and trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Multi-modal sarcasm detection via cross-modal graph convolutional network

B Liang, C Lou, X Li, M Yang, L Gui, Y He… - Proceedings of the …, 2022 - aclanthology.org
With the increasing popularity of posting multimodal messages online, many recent studies
have been carried out utilizing both textual and visual information for multi-modal sarcasm …

Multi-modal graph fusion for named entity recognition with targeted visual guidance

D Zhang, S Wei, S Li, H Wu, Q Zhu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Multi-modal named entity recognition (MNER) aims to discover named entities in free text
and classify them into pre-defined types with images. However, dominant MNER models do …

Smart: Syntax-calibrated multi-aspect relation transformer for change captioning

Y Tu, L Li, L Su, ZJ Zha, Q Huang - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Change captioning aims to describe the semantic change between two similar images. In
this process, as the most typical distractor, viewpoint change leads to the pseudo changes …

On vision features in multimodal machine translation

B Li, C Lv, Z Zhou, T Zhou, T **ao, A Ma… - arxiv preprint arxiv …, 2022 - arxiv.org
Previous work on multimodal machine translation (MMT) has focused on the way of
incorporating vision features into translation but little attention is on the quality of vision …

TSVFN: Two-stage visual fusion network for multimodal relation extraction

Q Zhao, T Gao, N Guo - Information Processing & Management, 2023 - Elsevier
Multimodal relation extraction is a critical task in information extraction, aiming to predict the
class of relations between head and tail entities from linguistic sequences and related …

Graph-based multimodal sequential embedding for sign language translation

S Tang, D Guo, R Hong, M Wang - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Sign language translation (SLT) is a challenging weakly supervised task without word-level
annotations. An effective method of SLT is to leverage multimodal complementarity and to …