Multimodal transformer with multi-view visual representation for image captioning

J Yu, J Li, Z Yu, Q Huang - … on circuits and systems for video …, 2019 - ieeexplore.ieee.org
Image captioning aims to automatically generate a natural language description of a given
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …

X-net: a dual encoding–decoding method in medical image segmentation

Y Li, Z Wang, L Yin, Z Zhu, G Qi, Y Liu - The Visual Computer, 2023 - Springer
Medical image segmentation has the priori guiding significance for clinical diagnosis and
treatment. In the past ten years, a large number of experimental facts have proved the great …

M-FFN: multi-scale feature fusion network for image captioning

J Prudviraj, C Vishnu, CK Mohan - Applied Intelligence, 2022 - Springer
In this work, we present a novel multi-scale feature fusion network (M-FFN) for image
captioning task to incorporate discriminative features and scene contextual information of an …

Bi-box regression for pedestrian detection and occlusion estimation

C Zhou, J Yuan - … of the European Conference on Computer …, 2018 - openaccess.thecvf.com
Occlusions present a great challenge for pedestrian detection in practical applications. In
this paper, we propose a novel approach to simultaneous pedestrian detection and …

Learning transferable human-object interaction detector with natural language supervision

S Wang, Y Duan, H Ding, YP Tan… - Proceedings of the …, 2022 - openaccess.thecvf.com
It is difficult to construct a data collection including all possible combinations of human
actions and interacting objects due to the combinatorial nature of human-object interactions …

Spatiotemporal multimodal learning with 3D CNNs for video action recognition

H Wu, X Ma, Y Li - IEEE Transactions on Circuits and Systems …, 2021 - ieeexplore.ieee.org
Extracting effective spatial-temporal information is significantly important for video-based
action recognition. Recently 3D convolutional neural networks (3D CNNs) that could …

Action-stage emphasized spatiotemporal VLAD for video action recognition

Z Tu, H Li, D Zhang, J Dauwels, B Li… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Despite outstanding performance in image recognition, convolutional neural networks
(CNNs) do not yet achieve the same impressive results on action recognition in videos. This …

From artifact removal to super-resolution

J Wang, Z Shao, X Huang, T Lu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Deep-learning-based super-resolution (SR) methods have been extensively studied and
have achieved significant performance with deep convolutional neural networks. However …

Motion-driven visual tempo learning for video-based action recognition

Y Liu, J Yuan, Z Tu - IEEE Transactions on Image Processing, 2022 - ieeexplore.ieee.org
Action visual tempo characterizes the dynamics and the temporal scale of an action, which is
helpful to distinguish human actions that share high similarities in visual dynamics and …

Remote sensing image defogging networks based on dual self-attention boost residual octave convolution

Z Zhu, Y Luo, G Qi, J Meng, Y Li, N Mazur - Remote Sensing, 2021 - mdpi.com
Remote sensing images have been widely used in military, national defense, disaster
emergency response, ecological environment monitoring, among other applications …