Google Učenjak

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Shrani Navedi Navedeno v 199 virih Sorodni članki Vse različice: 7 Iskanje knjižnic V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Shrani Navedi Navedeno v 203 virih Sorodni članki Vse različice: 8

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Shrani Navedi Navedeno v 601 virih Sorodni članki Vse različice: 6 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Shrani Navedi Navedeno v 234 virih Sorodni članki Vse različice: 7 Iskanje knjižnic V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] thecvf.com

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

Shrani Navedi Navedeno v 215 virih Sorodni članki Vse različice: 7 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Shrani Navedi Navedeno v 345 virih Sorodni članki Vse različice: 2 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Shrani Navedi Navedeno v 235 virih Sorodni članki Vse različice: 5 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] neurips.cc

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in neural …, 2022 - proceedings.neurips.cc

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

Shrani Navedi Navedeno v 3158 virih Sorodni članki Vse različice: 14 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Gpt-4v (ision) is a generalist web agent, if grounded

B Zheng, B Gou, J Kil, H Sun, Y Su - arxiv preprint arxiv:2401.01614, 2024 - arxiv.org

The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …

Shrani Navedi Navedeno v 178 virih Sorodni članki Vse različice: 6 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arxiv preprint arxiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Shrani Navedi Navedeno v 577 virih Sorodni članki Vse različice: 4 V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

How much can clip benefit vision-and-language tasks?

Vision-language pre-training: Basics, recent advances, and future trends

Large-scale multi-modal pre-trained models: A comprehensive survey

Videochat: Chat-centric video understanding

Multimodal foundation models: From specialists to general-purpose assistants

Prompting large language models with answer heuristics for knowledge-based visual question answering

Internvideo: General video foundation models via generative and discriminative learning

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Laion-5b: An open large-scale dataset for training next generation image-text models

Gpt-4v (ision) is a generalist web agent, if grounded

Git: A generative image-to-text transformer for vision and language