Študovňa Google

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org

Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Uložiť Citovať Citované 1008-krát Súvisiace články Všetky verzie 9

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A comprehensive survey of scene graphs: Generation and application

X Chang, P Ren, P Xu, Z Li, X Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Scene graph is a structured representation of a scene that can clearly express the objects,
attributes, and relationships between objects in the scene. As computer vision technology …

Uložiť Citovať Citované 323-krát Súvisiace články Všetky verzie 12

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Diffusiondet: Diffusion model for object detection

S Chen, P Sun, Y Song, P Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We propose DiffusionDet, a new framework that formulates object detection as a denoising
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …

Uložiť Citovať Citované 495-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Beyond transmitting bits: Context, semantics, and task-oriented communications

D Gündüz, Z Qin, IE Aguerri, HS Dhillon… - IEEE Journal on …, 2022 - ieeexplore.ieee.org

Communication systems to date primarily aim at reliably communicating bit sequences.
Such an approach provides efficient engineering designs that are agnostic to the meanings …

Uložiť Citovať Citované 452-krát Súvisiace články Všetky verzie 7

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Uložiť Citovať Citované 72-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - IEEE transactions on …, 2024 - ieeexplore.ieee.org

Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Uložiť Citovať Citované 135-krát Súvisiace články Všetky verzie 12

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Semantic communications: Principles and challenges

Z Qin, X Tao, J Lu, W Tong, GY Li - arxiv preprint arxiv:2201.01389, 2021 - arxiv.org

Semantic communication, regarded as the breakthrough beyond the Shannon paradigm,
aims at the successful transmission of semantic information conveyed by the source rather …

Uložiť Citovať Citované 400-krát Súvisiace články Všetky verzie 2 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

T Qian, J Chen, L Zhuo, Y Jiao, YG Jiang - Proceedings of the AAAI …, 2024 - ojs.aaai.org

We introduce a novel visual question answering (VQA) task in the context of autonomous
driving, aiming to answer natural language questions based on street-view clues. Compared …

Uložiť Citovať Citované 105-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Enhancing video-language representations with structural spatio-temporal alignment

H Fei, S Wu, M Zhang, M Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

While pre-training large-scale video-language models (VLMs) has shown remarkable
potential for various downstream video-language tasks, existing VLMs can still suffer from …

Uložiť Citovať Citované 60-krát Súvisiace články Všetky verzie 10

[Free GPT-4]
[DeepSeek]

[HTML] sciencedirect.com

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier

Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

Uložiť Citovať Citované 273-krát Súvisiace články Všetky verzie 4

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Image retrieval using scene graphs

A comprehensive survey of deep learning for image captioning

A comprehensive survey of scene graphs: Generation and application

Diffusiondet: Diffusion model for object detection

Beyond transmitting bits: Context, semantics, and task-oriented communications

Compositional chain-of-thought prompting for large multimodal models

Transformer-based visual segmentation: A survey

Semantic communications: Principles and challenges

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Enhancing video-language representations with structural spatio-temporal alignment

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models