Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
A review of deep learning for video captioning
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …
contributions from domains such as computer vision, natural language processing …
Multi-modal knowledge graph construction and application: A survey
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
Text with knowledge graph augmented transformer for video captioning
Video captioning aims to describe the content of videos using natural language. Although
significant progress has been made, there is still much room to improve the performance for …
significant progress has been made, there is still much room to improve the performance for …
Hierarchical representation network with auxiliary tasks for video captioning and video question answering
Recently, integrating vision and language for in-depth video understanding eg, video
captioning and video question answering, has become a promising direction for artificial …
captioning and video question answering, has become a promising direction for artificial …
Multi-modal relational graph for cross-modal video moment retrieval
Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims
to rank a video moment from pre-segmented video moment candidates that best matches …
to rank a video moment from pre-segmented video moment candidates that best matches …
Memory-based augmentation network for video captioning
Video captioning focuses on generating natural language descriptions according to the
video content. Existing works mainly explore this multimodal learning with the paired source …
video content. Existing works mainly explore this multimodal learning with the paired source …
Classification by attention: Scene graph classification with prior knowledge
A major challenge in scene graph classification is that the appearance of objects and
relations can be significantly different from one image to another. Previous works have …
relations can be significantly different from one image to another. Previous works have …
Vision-enhanced and consensus-aware transformer for image captioning
Image captioning generates descriptions in a natural language for a given image. Due to its
great potential for a wide range of applications, many deep learning based-methods have …
great potential for a wide range of applications, many deep learning based-methods have …
Magic: Multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning
Text-based image captioning (TextCap) requires simultaneous comprehension of visual
content and reading the text of images to generate a natural language description. Although …
content and reading the text of images to generate a natural language description. Although …