Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

M Afham, I Dissanayake… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object
classification, segmentation and detection is often laborious owing to the irregular structure …

Temporal query networks for fine-grained video understanding

C Zhang, A Gupta, A Zisserman - Proceedings of the ieee …, 2021‏ - openaccess.thecvf.com
Our objective in this work is fine-grained classification of actions in untrimmed videos, where
the actions may be temporally extended or may span only a few frames of the video. We cast …

Self-supervised video representation learning by context and motion decoupling

L Huang, Y Liu, B Wang, P Pan… - Proceedings of the …, 2021‏ - openaccess.thecvf.com
A key challenge in self-supervised video representation learning is how to effectively
capture motion information besides context bias. While most existing works implicitly …

Self-supervised motion perception for spatiotemporal representation learning

C Liu, Y Yao, D Luo, Y Zhou… - IEEE Transactions on …, 2022‏ - ieeexplore.ieee.org
In this study, we propose a novel pretext task and a self-supervised motion perception (SMP)
method for spatiotemporal representation learning. The pretext task is defined as video …

Repeat and learn: Self-supervised visual representations learning by Repeated Scene Localization

H Altabrawee, MHM Noor - Pattern Recognition, 2024‏ - Elsevier
Large labeled datasets are crucial for video understanding progress. However, the labeling
process is time-consuming, expensive, and tiresome. To overcome this impediment, various …

STCLR: Sparse Temporal Contrastive Learning for Video Representation

H Altabrawee, MHM Noor - Neurocomputing, 2025‏ - Elsevier
Abstract Temporal Contrastive Learning for Video Representation (TCLR) is the first
contrastive framework that uses temporal losses to enforce the temporal distinctiveness of …

Boosting video representation learning with multi-faceted integration

Z Qiu, T Yao, CW Ngo, XP Zhang… - Proceedings of the …, 2021‏ - openaccess.thecvf.com
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The
existing datasets mostly label only one of the facets for model training, resulting in the video …

Temporal transformer networks with self-supervision for action recognition

Y Zhang, J Li, N Jiang, G Wu, H Zhang… - IEEE Internet of …, 2023‏ - ieeexplore.ieee.org
In recent years, Internet of Things (IoT) has made rapid development, and IoT devices are
develo** toward intelligence. IoT terminal devices represented by surveillance cameras …

Collaboratively Self-supervised Video Representation Learning for Action Recognition

J Zhang, Z Wan, L Hu, S Lin, S Wu… - IEEE Transactions on …, 2025‏ - ieeexplore.ieee.org
Considering the close connection between action recognition and human pose estimation,
we design a Collaboratively Self-supervised Video Representation (CSVR) learning …

Benchmarking self-supervised video representation learning

A Kumar, A Kumar, ZH Sia, V Vineet, YS Rawat - 2024‏ - openreview.net
Self-supervised learning is an effective way for label-free model pre-training, especially in
the video domain where labeling is expensive. Existing self-supervised works in the video …