Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Causal reasoning meets visual representation learning: A prospective study
Visual representation learning is ubiquitous in various real-world applications, including
visual comprehension, video understanding, multi-modal analysis, human-computer …
visual comprehension, video understanding, multi-modal analysis, human-computer …
Invariant grounding for video question answering
Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …
video. At its core is understanding the alignments between visual scenes in video and …
Counterfactual vqa: A cause-effect look at language bias
Recent VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …
Cross-modal causal relational reasoning for event-level visual question answering
Existing visual question answering methods often suffer from cross-modal spurious
correlations and oversimplified event-level reasoning processes that fail to capture event …
correlations and oversimplified event-level reasoning processes that fail to capture event …
Interventional video grounding with dual contrastive learning
Video grounding aims to localize a moment from an untrimmed video for a given textual
query. Existing approaches focus more on the alignment of visual and language stimuli with …
query. Existing approaches focus more on the alignment of visual and language stimuli with …
Alleviating structural distribution shift in graph anomaly detection
Graph anomaly detection (GAD) is a challenging binary classification problem due to its
different structural distribution between anomalies and normal nodes---abnormal nodes are …
different structural distribution between anomalies and normal nodes---abnormal nodes are …
Exposing and mitigating spurious correlations for cross-modal retrieval
Cross-modal retrieval methods are the preferred tool to search databases for the text that
best matches a query image and vice versa However, image-text retrieval models commonly …
best matches a query image and vice versa However, image-text retrieval models commonly …
Counterfactual contrastive learning for weakly-supervised vision-language grounding
Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …
or a specific region in an image according to the given sentence query, where only video …
Learning to contrast the counterfactual samples for robust visual question answering
In the task of Visual Question Answering (VQA), most state-of-the-art models tend to learn
spurious correlations in the training set and achieve poor performance in out-of-distribution …
spurious correlations in the training set and achieve poor performance in out-of-distribution …