- Academic Search

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025‏ - ieeexplore.ieee.org‏

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …‏

שמור צטט צוטט על ידי 148 מאמרים בנושא זה כל 4 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends‏

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com‏

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …‏

שמור צטט צוטט על ידי 198 מאמרים בנושא זה כל 7 הגרסאות חיפוש ספריות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Cogvlm: Visual expert for pretrained language models‏

W Wang, Q Lv, W Yu, W Hong, J Qi… - Advances in …, 2025‏ - proceedings.neurips.cc‏

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular\emph {shallow alignment} method which maps image features into the …‏

שמור צטט צוטט על ידי 592 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sharegpt4v: Improving large multi-modal models with better captions‏

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - … on Computer Vision, 2024‏ - Springer‏

Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …‏

שמור צטט צוטט על ידי 493 מאמרים בנושא זה כל 7 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants‏

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024‏ - nowpublishers.com‏

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …‏

שמור צטט צוטט על ידי 233 מאמרים בנושא זה כל 7 הגרסאות חיפוש ספריות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Yolo-world: Real-time open-vocabulary object detection‏

T Cheng, L Song, Y Ge, W Liu… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

Abstract The You Only Look Once (YOLO) series of detectors have established themselves
as efficient and practical tools. However their reliance on predefined and trained object …‏

שמור צטט צוטט על ידי 243 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Grounded sam: Assembling open-world models for diverse visual tasks‏

T Ren, S Liu, A Zeng, J Lin, K Li, H Cao, J Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to
combine with the segment anything model (SAM). This integration enables the detection and …‏

שמור צטט צוטט על ידי 268 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …‏

שמור צטט צוטט על ידי 243 מאמרים בנושא זה כל 19 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale‏

Y Fang, W Wang, B **e, Q Sun, L Wu… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …‏

שמור צטט צוטט על ידי 716 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Internimage: Exploring large-scale vision foundation models with deformable convolutions‏

W Wang, J Dai, Z Chen, Z Huang, Z Li… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …‏

שמור צטט צוטט על ידי 851 מאמרים בנושא זה כל 10 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Glipv2: Unifying localization and vision-language understanding

Foundation Models Defining a New Era in Vision: a Survey and Outlook‏

Vision-language pre-training: Basics, recent advances, and future trends‏

Cogvlm: Visual expert for pretrained language models‏

Sharegpt4v: Improving large multi-modal models with better captions‏

Multimodal foundation models: From specialists to general-purpose assistants‏

Yolo-world: Real-time open-vocabulary object detection‏

Grounded sam: Assembling open-world models for diverse visual tasks‏

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

Eva: Exploring the limits of masked visual representation learning at scale‏

Internimage: Exploring large-scale vision foundation models with deformable convolutions‏