The revolution of multimodal large language models: a survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Vmamba: Visual state space model

Y Liu, Y Tian, Y Zhao, H Yu, L **e… - Advances in neural …, 2025 - proceedings.neurips.cc
Designing computationally efficient network architectures remains an ongoing necessity in
computer vision. In this paper, we adapt Mamba, a state-space language model, into …

Mamba-nd: Selective state space modeling for multi-dimensional data

S Li, H Singh, A Grover - European Conference on Computer Vision, 2024 - Springer
In recent years, Transformers have become the de-facto architecture for sequence modeling
on text and multi-dimensional data, such as images and video. However, the use of self …

Spatial transform decoupling for oriented object detection

H Yu, Y Tian, Q Ye, Y Liu - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks.
However, their potential in rotation-sensitive scenarios has not been fully explored, and this …

Improving pixel-based mim by reducing wasted modeling capability

Y Liu, S Zhang, J Chen, Z Yu… - Proceedings of the …, 2023 - openaccess.thecvf.com
There has been significant progress in Masked Image Modeling (MIM). Existing MIM
methods can be broadly categorized into two groups based on the reconstruction target …

Structured adversarial self-supervised learning for robust object detection in remote sensing images

C Zhang, KM Lam, T Liu, YL Chan… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Object detection plays a crucial role in scene understanding and has extensive practical
applications. In the field of remote sensing object detection, both detection accuracy and …

Videomac: Video masked autoencoders meet convnets

G Pei, T Chen, X Jiang, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently the advancement of self-supervised learning techniques like masked
autoencoders (MAE) has greatly influenced visual representation learning for images and …

vheat: Building vision models upon heat conduction

Z Wang, Y Liu, Y Liu, H Yu, Y Wang, Q Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
A fundamental problem in learning robust and expressive visual representations lies in
efficiently estimating the spatial relationships of visual semantics throughout the entire …

Efficient analysis of deep neural networks for vision via biologically-inspired receptive field angles: An in-depth survey

Y Ma, M Yu, H Lin, C Liu, M Hu, Q Song - Information Fusion, 2024 - Elsevier
Efficient feature extraction is a pivotal requirement for Deep Neural Network (DNN) models,
particularly in the realm of visual tasks where effective feature extraction relies on well …

Visual detection algorithm for enhanced environmental perception of unmanned surface vehicles in complex marine environments

K Dong, T Liu, Y Zheng, Z Shi, H Du… - Journal of Intelligent & …, 2024 - Springer
Unmanned surface vehicles (USVs) are distinguished by their intelligence, compactness,
and absence of human casualties, making them a vital component of the maritime industry …