Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective
Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (eg, social network …
and boosted the state of the art in a variety of areas, such as data mining (eg, social network …
Vipergpt: Visual inference via python execution for reasoning
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
Large language models as commonsense knowledge for large-scale task planning
Large-scale task planning is a major challenge. Recent work exploits large language
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …
Explainability in deep reinforcement learning
A large set of the explainable Artificial Intelligence (XAI) literature is emerging on feature
relevance techniques to explain a deep neural network (DNN) output or explaining models …
relevance techniques to explain a deep neural network (DNN) output or explaining models …
Fine-grained video-text retrieval with hierarchical graph reasoning
Cross-modal retrieval between videos and texts has attracted growing attentions due to the
rapid emergence of videos on the web. The current dominant approach is to learn a joint …
rapid emergence of videos on the web. The current dominant approach is to learn a joint …
Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs
Most text-driven human motion generation methods employ sequential modeling
approaches, eg, transformer, to extract sentence-level text representations automatically and …
approaches, eg, transformer, to extract sentence-level text representations automatically and …
Video as conditional graph hierarchy for multi-granular question answering
Video question answering requires the models to understand and reason about both the
complex video and language data to correctly derive the answers. Existing efforts have been …
complex video and language data to correctly derive the answers. Existing efforts have been …
Region-aware image captioning via interaction learning
Image captioning is one of the primary goals in computer vision which aims to automatically
generate natural descriptions for images. Intuitively, human visual system can notice some …
generate natural descriptions for images. Intuitively, human visual system can notice some …
When radiology report generation meets knowledge graph
Automatic radiology report generation has been an attracting research problem towards
computer-aided diagnosis to alleviate the workload of doctors in recent years. Deep learning …
computer-aided diagnosis to alleviate the workload of doctors in recent years. Deep learning …