A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

A comprehensive survey of scene graphs: Generation and application

X Chang, P Ren, P Xu, Z Li, X Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Scene graph is a structured representation of a scene that can clearly express the objects,
attributes, and relationships between objects in the scene. As computer vision technology …

Diffusiondet: Diffusion model for object detection

S Chen, P Sun, Y Song, P Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We propose DiffusionDet, a new framework that formulates object detection as a denoising
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …

Beyond transmitting bits: Context, semantics, and task-oriented communications

D Gündüz, Z Qin, IE Aguerri, HS Dhillon… - IEEE Journal on …, 2022 - ieeexplore.ieee.org
Communication systems to date primarily aim at reliably communicating bit sequences.
Such an approach provides efficient engineering designs that are agnostic to the meanings …

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - IEEE transactions on …, 2024 - ieeexplore.ieee.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Semantic communications: Principles and challenges

Z Qin, X Tao, J Lu, W Tong, GY Li - arxiv preprint arxiv:2201.01389, 2021 - arxiv.org
Semantic communication, regarded as the breakthrough beyond the Shannon paradigm,
aims at the successful transmission of semantic information conveyed by the source rather …

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

T Qian, J Chen, L Zhuo, Y Jiao, YG Jiang - Proceedings of the AAAI …, 2024 - ojs.aaai.org
We introduce a novel visual question answering (VQA) task in the context of autonomous
driving, aiming to answer natural language questions based on street-view clues. Compared …

Enhancing video-language representations with structural spatio-temporal alignment

H Fei, S Wu, M Zhang, M Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
While pre-training large-scale video-language models (VLMs) has shown remarkable
potential for various downstream video-language tasks, existing VLMs can still suffer from …

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …