Time does tell: Self-supervised time-tuning of dense image representations

M Salehi, E Gavves, CGM Snoek… - Proceedings of the …, 2023 - openaccess.thecvf.com
Spatially dense self-supervised learning is a rapidly growing problem domain with
promising applications for unsupervised segmentation and pretraining for dense …

Grounding language models for visual entity recognition

Z **ao, M Gong, P Cascante-Bonilla, X Zhang… - … on Computer Vision, 2024 - Springer
Abstract We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our
model extends an autoregressive Multimodal Large Language Model by employing retrieval …

Self-supervised visual learning from interactions with objects

A Aubret, C Teulière, J Triesch - European Conference on Computer …, 2024 - Springer
Self-supervised learning (SSL) has revolutionized visual representation learning, but has
not achieved the robustness of human vision. A reason for this could be that SSL does not …

Representation learning and identity adversarial training for facial behavior understanding

M Ning, AA Salah, IO Ertugrul - arxiv preprint arxiv:2407.11243, 2024 - arxiv.org
Facial Action Unit (AU) detection has gained significant research attention as AUs contain
complex expression information. In this paper, we unpack two fundamental factors in AU …

Foundation models for video understanding: A survey

N Madan, A Møgelmose, R Modi, YS Rawat… - Authorea …, 2024 - techrxiv.org
Video Foundation Models (ViFMs) aim to develop general-purpose representations for
various video understanding tasks by leveraging large-scale datasets and powerful models …

[КНИГА][B] Computer Vision-ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXIV.

A Leonardis, E Ricci, S Roth, O Russakovsky, T Sattler… - 2024 - books.google.com
The multi-volume set of LNCS books with volume numbers 15059 up to 15147 constitutes
the refereed proceedings of the 18th European Conference on Computer Vision, ECCV …

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

G Wang, F Lin, T Wu, Z Liu, Z Ba, K Ren - arxiv preprint arxiv:2412.12032, 2024 - arxiv.org
This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable
facial representation that boosts various face security tasks with respect to generalization …

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

SA Ahamed, M Gunawardhana, L David… - arxiv preprint arxiv …, 2025 - arxiv.org
Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective
spatiotemporal representations from a visual perspective, which may lead the model to …

Self-supervised Pretraining of Vision Transformers for Earth Observation

A Fuller - 2023 - repository.library.carleton.ca
Remote sensing offers vast yet sparsely labeled multimodal data but lacks foundation
models that can be leveraged across societally impactful applications. In this thesis, I …