Multimodal transformer with multi-view visual representation for image captioning
Image captioning aims to automatically generate a natural language description of a given
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …
X-net: a dual encoding–decoding method in medical image segmentation
Medical image segmentation has the priori guiding significance for clinical diagnosis and
treatment. In the past ten years, a large number of experimental facts have proved the great …
treatment. In the past ten years, a large number of experimental facts have proved the great …
M-FFN: multi-scale feature fusion network for image captioning
In this work, we present a novel multi-scale feature fusion network (M-FFN) for image
captioning task to incorporate discriminative features and scene contextual information of an …
captioning task to incorporate discriminative features and scene contextual information of an …
Bi-box regression for pedestrian detection and occlusion estimation
Occlusions present a great challenge for pedestrian detection in practical applications. In
this paper, we propose a novel approach to simultaneous pedestrian detection and …
this paper, we propose a novel approach to simultaneous pedestrian detection and …
Learning transferable human-object interaction detector with natural language supervision
It is difficult to construct a data collection including all possible combinations of human
actions and interacting objects due to the combinatorial nature of human-object interactions …
actions and interacting objects due to the combinatorial nature of human-object interactions …
Spatiotemporal multimodal learning with 3D CNNs for video action recognition
H Wu, X Ma, Y Li - IEEE Transactions on Circuits and Systems …, 2021 - ieeexplore.ieee.org
Extracting effective spatial-temporal information is significantly important for video-based
action recognition. Recently 3D convolutional neural networks (3D CNNs) that could …
action recognition. Recently 3D convolutional neural networks (3D CNNs) that could …
Action-stage emphasized spatiotemporal VLAD for video action recognition
Despite outstanding performance in image recognition, convolutional neural networks
(CNNs) do not yet achieve the same impressive results on action recognition in videos. This …
(CNNs) do not yet achieve the same impressive results on action recognition in videos. This …
From artifact removal to super-resolution
Deep-learning-based super-resolution (SR) methods have been extensively studied and
have achieved significant performance with deep convolutional neural networks. However …
have achieved significant performance with deep convolutional neural networks. However …
Motion-driven visual tempo learning for video-based action recognition
Action visual tempo characterizes the dynamics and the temporal scale of an action, which is
helpful to distinguish human actions that share high similarities in visual dynamics and …
helpful to distinguish human actions that share high similarities in visual dynamics and …
Remote sensing image defogging networks based on dual self-attention boost residual octave convolution
Remote sensing images have been widely used in military, national defense, disaster
emergency response, ecological environment monitoring, among other applications …
emergency response, ecological environment monitoring, among other applications …