Backbones-review: Feature extraction networks for deep learning and deep reinforcement learning approaches

O Elharrouss, Y Akbari, N Almaadeed… - arxiv preprint arxiv …, 2022 - arxiv.org
To understand the real world using various types of data, Artificial Intelligence (AI) is the
most used technique nowadays. While finding the pattern within the analyzed data …

Dynamic neural networks: A survey

Y Han, G Huang, S Song, L Yang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Dynamic neural network is an emerging research topic in deep learning. Compared to static
models which have fixed computational graphs and parameters at the inference stage …

X3d: Expanding architectures for efficient video recognition

C Feichtenhofer - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com
This paper presents X3D, a family of efficient video networks that progressively expand a
tiny 2D image classification architecture along multiple network axes, in space, time, width …

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Beyond robustness: A taxonomy of approaches towards resilient multi-robot systems

A Prorok, M Malencia, L Carlone, GS Sukhatme… - arxiv preprint arxiv …, 2021 - arxiv.org
Robustness is key to engineering, automation, and science as a whole. However, the
property of robustness is often underpinned by costly requirements such as over …

Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models

W Wu, X Wang, H Luo, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability on various visual tasks. Transferring knowledge …

Uniformerv2: Unlocking the potential of image vits for video understanding

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
The prolific performances of Vision Transformers (ViTs) in image tasks have prompted
research into adapting the image ViTs for video tasks. However, the substantial gap …

Listen to look: Action recognition by previewing audio

R Gao, TH Oh, K Grauman… - Proceedings of the …, 2020 - openaccess.thecvf.com
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly
impractical. We propose a framework for efficient action recognition in untrimmed video that …

Revisiting classifier: Transferring vision-language models for video recognition

W Wu, Z Sun, W Ouyang - Proceedings of the AAAI conference on …, 2023 - ojs.aaai.org
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is
an important topic in computer vision research. Along with the growth of computational …

Smart frame selection for action recognition

SN Gowda, M Rohrbach, L Sevilla-Lara - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Video classification is computationally expensive. In this paper, we address theproblem of
frame selection to reduce the computational cost of video classification. Recent work has …