Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion

S Wu, H Fei, H Zhang, TS Chua - Advances in Neural …, 2024 - proceedings.neurips.cc
In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-
intricate setting, ie, generating intricate visual content from simple abstract text prompts …

Relational context learning for human-object interaction detection

S Kim, D Jung, M Cho - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Recent state-of-the-art methods for HOI detection typically build on transformer architectures
with two decoder branches, one for human-object pair detection and the other for interaction …

Empowering dynamics-aware text-to-video diffusion with large language models

H Fei, S Wu, W Ji, H Zhang, TS Chua - arxiv preprint arxiv:2308.13812, 2023 - arxiv.org
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the
recently emerged diffusion models (DMs) have promisingly shown stronger performance …

Graph neural networks in vision-language image understanding: A survey

H Senior, G Slabaugh, S Yuan, L Rossi - The Visual Computer, 2024 - Springer
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

C Zhang, S Stepputtis, J Campbell… - Proceedings of the …, 2024 - openaccess.thecvf.com
Being able to understand visual scenes is a precursor for many downstream tasks including
autonomous driving robotics and other vision-based approaches. A common approach …

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

H Fei, S Wu, W Ji, H Zhang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Text-to-video (T2V) synthesis has gained increasing attention in the community in
which the recently emerged diffusion models (DMs) have promisingly shown stronger …

Generalized unbiased scene graph generation

X Lyu, L Gao, J **e, P Zeng, Y Tian, J Shao… - arxiv preprint arxiv …, 2023 - arxiv.org
Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …

Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

Y Wang, P Dai, X Jia, Z Zeng, R Li, X Cao - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Image-to-image retrieval, a fundamental task, aims at matching similar images based on a
query image. Existing methods with convolutional neural networks are usually sensitive to …

Learning multimodal relationship interaction for visual relationship detection

Z Liu, WS Zheng - Pattern Recognition, 2022 - Elsevier
Visual relationship detection aims to recognize visual relationships in scenes as triplets<
subject-predicate-object>. Previous works have shown remarkable progress by introducing …