Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion
In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-
intricate setting, ie, generating intricate visual content from simple abstract text prompts …
intricate setting, ie, generating intricate visual content from simple abstract text prompts …
Relational context learning for human-object interaction detection
Recent state-of-the-art methods for HOI detection typically build on transformer architectures
with two decoder branches, one for human-object pair detection and the other for interaction …
with two decoder branches, one for human-object pair detection and the other for interaction …
Empowering dynamics-aware text-to-video diffusion with large language models
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the
recently emerged diffusion models (DMs) have promisingly shown stronger performance …
recently emerged diffusion models (DMs) have promisingly shown stronger performance …
Graph neural networks in vision-language image understanding: A survey
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …
the key to providing human-level scene comprehension. It goes further than identifying the …
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
Being able to understand visual scenes is a precursor for many downstream tasks including
autonomous driving robotics and other vision-based approaches. A common approach …
autonomous driving robotics and other vision-based approaches. A common approach …
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
Abstract Text-to-video (T2V) synthesis has gained increasing attention in the community in
which the recently emerged diffusion models (DMs) have promisingly shown stronger …
which the recently emerged diffusion models (DMs) have promisingly shown stronger …
Generalized unbiased scene graph generation
Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …
Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph
Image-to-image retrieval, a fundamental task, aims at matching similar images based on a
query image. Existing methods with convolutional neural networks are usually sensitive to …
query image. Existing methods with convolutional neural networks are usually sensitive to …
Learning multimodal relationship interaction for visual relationship detection
Z Liu, WS Zheng - Pattern Recognition, 2022 - Elsevier
Visual relationship detection aims to recognize visual relationships in scenes as triplets<
subject-predicate-object>. Previous works have shown remarkable progress by introducing …
subject-predicate-object>. Previous works have shown remarkable progress by introducing …