Google Učenjak

S Zhang, L Dong, X Li, S Zhang, X Sun, S Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper surveys research works in the quickly advancing field of instruction tuning (IT),
which can also be referred to as supervised fine-tuning (SFT)\footnote {In this paper, unless …

Shrani Navedi Navedeno v 766 virih Sorodni članki Vse različice: 5 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Shrani Navedi Navedeno v 148 virih Sorodni članki Vse različice: 4

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Shrani Navedi Navedeno v 189 virih Sorodni članki Vse različice: 8 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eva-clip: Improved training techniques for clip at scale

Q Sun, Y Fang, L Wu, X Wang, Y Cao - arxiv preprint arxiv:2303.15389, 2023 - arxiv.org

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for
its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models …

Shrani Navedi Navedeno v 461 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Glamm: Pixel grounding large multimodal model

H Rasheed, M Maaz, S Shaji… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …

Shrani Navedi Navedeno v 170 virih Sorodni članki Vse različice: 9 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Exploring object-centric temporal modeling for efficient multi-view 3d object detection

S Wang, Y Liu, T Wang, Y Li… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

In this paper, we propose a long-sequence modeling framework, named StreamPETR, for
multi-view 3D object detection. Built upon the sparse query design in the PETR series, we …

Shrani Navedi Navedeno v 201 virih Sorodni članki Vse različice: 8 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B **e, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Shrani Navedi Navedeno v 718 virih Sorodni članki Vse različice: 6 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Detrs with collaborative hybrid assignments training

Z Zong, G Song, Y Liu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

In this paper, we provide the observation that too few queries assigned as positive samples
in DETR with one-to-one set matching leads to sparse supervision on the encoder's output …

Shrani Navedi Navedeno v 397 virih Sorodni članki Vse različice: 6 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Autoregressive model beats diffusion: Llama for scalable image generation

P Sun, Y Jiang, S Chen, S Zhang, B Peng… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce LlamaGen, a new family of image generation models that apply original``next-
token prediction''paradigm of large language models to visual generation domain. It is an …

Shrani Navedi Navedeno v 141 virih Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Sam-clip: Merging vision foundation models towards semantic and spatial understanding

H Wang, PKA Vasu, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com

The landscape of publicly available vision foundation models (VFMs) such as CLIP and
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …

Shrani Navedi Navedeno v 74 virih Sorodni članki Vse različice: 7 V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Eva-02: A visual representation for neon genesis

Instruction tuning for large language models: A survey

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Eva-clip: Improved training techniques for clip at scale

Glamm: Pixel grounding large multimodal model

Exploring object-centric temporal modeling for efficient multi-view 3d object detection

Eva: Exploring the limits of masked visual representation learning at scale

Detrs with collaborative hybrid assignments training

Autoregressive model beats diffusion: Llama for scalable image generation

Sam-clip: Merging vision foundation models towards semantic and spatial understanding