A review of deep learning for video captioning
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …
contributions from domains such as computer vision, natural language processing …
Beyond supervised learning for pervasive healthcare
The integration of machine/deep learning and sensing technologies is transforming
healthcare and medical practice. However, inherent limitations in healthcare data, namely …
healthcare and medical practice. However, inherent limitations in healthcare data, namely …
TCTrack: Temporal contexts for aerial tracking
Temporal contexts among consecutive frames are far from being fully utilized in existing
visual trackers. In this work, we present TCTrack, a comprehensive framework to fully exploit …
visual trackers. In this work, we present TCTrack, a comprehensive framework to fully exploit …
Disentangling spatial and temporal learning for efficient image-to-video transfer learning
Recently, large-scale pre-trained language-image models like CLIP have shown
extraordinary capabilities for understanding spatial contents, but naively transferring such …
extraordinary capabilities for understanding spatial contents, but naively transferring such …
Inherent redundancy in spiking neural networks
Abstract Spiking Neural Networks (SNNs) are well known as a promising energy-efficient
alternative to conventional artificial neural networks. Subject to the preconceived impression …
alternative to conventional artificial neural networks. Subject to the preconceived impression …
Towards real-world visual tracking with temporal contexts
Visual tracking has made significant improvements in the past few decades. Most existing
state-of-the-art trackers 1) merely aim for performance in ideal conditions while overlooking …
state-of-the-art trackers 1) merely aim for performance in ideal conditions while overlooking …
Mar: Masked autoencoders for efficient action recognition
Standard approaches for video action recognition usually operate on full input videos, which
is inefficient due to the widespread spatio-temporal redundancy in videos. The recent …
is inefficient due to the widespread spatio-temporal redundancy in videos. The recent …
Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip
By mimicking the neurons and synapses of the human brain and employing spiking neural
networks on neuromorphic chips, neuromorphic computing offers a promising energy …
networks on neuromorphic chips, neuromorphic computing offers a promising energy …
Cdc-yolofusion: Leveraging cross-scale dynamic convolution fusion for visible-infrared object detection
Feature-level fusion methods have demonstrated superior performance for visible-infrared
object detection due to the deep exploration of visible and infrared features. However, most …
object detection due to the deep exploration of visible and infrared features. However, most …
Transformer meets remote sensing video detection and tracking: A comprehensive survey
Transformer has shown excellent performance in remote sensing field with long-range
modeling capabilities. Remote sensing video (RSV) moving object detection and tracking …
modeling capabilities. Remote sensing video (RSV) moving object detection and tracking …