Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

C Chen, Y Wu, Q Dai, HY Zhou, M Xu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (eg, social network …

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Large language models as commonsense knowledge for large-scale task planning

Z Zhao, WS Lee, D Hsu - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Large-scale task planning is a major challenge. Recent work exploits large language
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …

Explainability in deep reinforcement learning

A Heuillet, F Couthouis, N Díaz-Rodríguez - Knowledge-Based Systems, 2021 - Elsevier
A large set of the explainable Artificial Intelligence (XAI) literature is emerging on feature
relevance techniques to explain a deep neural network (DNN) output or explaining models …

Fine-grained video-text retrieval with hierarchical graph reasoning

S Chen, Y Zhao, Q **, Q Wu - Proceedings of the IEEE/CVF …, 2020 - openaccess.thecvf.com
Cross-modal retrieval between videos and texts has attracted growing attentions due to the
rapid emergence of videos on the web. The current dominant approach is to learn a joint …

Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs

P **, Y Wu, Y Fan, Z Sun, W Yang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Most text-driven human motion generation methods employ sequential modeling
approaches, eg, transformer, to extract sentence-level text representations automatically and …

Video as conditional graph hierarchy for multi-granular question answering

J **ao, A Yao, Z Liu, Y Li, W Ji, TS Chua - Proceedings of the AAAI …, 2022 - ojs.aaai.org
Video question answering requires the models to understand and reason about both the
complex video and language data to correctly derive the answers. Existing efforts have been …

Region-aware image captioning via interaction learning

AA Liu, Y Zhai, N Xu, W Nie, W Li… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Image captioning is one of the primary goals in computer vision which aims to automatically
generate natural descriptions for images. Intuitively, human visual system can notice some …

When radiology report generation meets knowledge graph

Y Zhang, X Wang, Z Xu, Q Yu, A Yuille, D Xu - Proceedings of the AAAI …, 2020 - aaai.org
Automatic radiology report generation has been an attracting research problem towards
computer-aided diagnosis to alleviate the workload of doctors in recent years. Deep learning …