Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Causal reasoning meets visual representation learning: A prospective study

Y Liu, YS Wei, H Yan, GB Li, L Lin - Machine Intelligence Research, 2022 - Springer
Visual representation learning is ubiquitous in various real-world applications, including
visual comprehension, video understanding, multi-modal analysis, human-computer …

Invariant grounding for video question answering

Y Li, X Wang, J **ao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …

Counterfactual vqa: A cause-effect look at language bias

Y Niu, K Tang, H Zhang, Z Lu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …

Cross-modal causal relational reasoning for event-level visual question answering

Y Liu, G Li, L Lin - IEEE Transactions on Pattern Analysis and …, 2023 - ieeexplore.ieee.org
Existing visual question answering methods often suffer from cross-modal spurious
correlations and oversimplified event-level reasoning processes that fail to capture event …

Interventional video grounding with dual contrastive learning

G Nan, R Qiao, Y **ao, J Liu, S Leng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Video grounding aims to localize a moment from an untrimmed video for a given textual
query. Existing approaches focus more on the alignment of visual and language stimuli with …

Alleviating structural distribution shift in graph anomaly detection

Y Gao, X Wang, X He, Z Liu, H Feng… - Proceedings of the …, 2023 - dl.acm.org
Graph anomaly detection (GAD) is a challenging binary classification problem due to its
different structural distribution between anomalies and normal nodes---abnormal nodes are …

Exposing and mitigating spurious correlations for cross-modal retrieval

JM Kim, A Koepke, C Schmid… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Cross-modal retrieval methods are the preferred tool to search databases for the text that
best matches a query image and vice versa However, image-text retrieval models commonly …

Counterfactual contrastive learning for weakly-supervised vision-language grounding

Z Zhang, Z Zhao, Z Lin, X He - Advances in Neural …, 2020 - proceedings.neurips.cc
Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …

Learning to contrast the counterfactual samples for robust visual question answering

Z Liang, W Jiang, H Hu, J Zhu - Proceedings of the 2020 …, 2020 - aclanthology.org
In the task of Visual Question Answering (VQA), most state-of-the-art models tend to learn
spurious correlations in the training set and achieve poor performance in out-of-distribution …