Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

A deep cross-modal neural cognitive diagnosis framework for modeling student performance

L Song, M He, X Shang, C Yang, J Liu, M Yu… - Expert Systems with …, 2023 - Elsevier
In intelligent education systems, one fundamental task is to predict student performance on
new exercises and estimate the knowledge proficiency of students on knowledge concepts …

Cross-modal contrastive learning for domain adaptation in 3d semantic segmentation

B **ng, X Ying, R Wang, J Yang, T Chen - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Abstract Domain adaptation for 3D point cloud has attracted a lot of interest since it can
avoid the time-consuming labeling process of 3D data to some extent. A recent work named …

Ta-Adapter: Enhancing few-shot CLIP with task-aware encoders

W Zhang, Y Zhang, Y Deng, W Zhang, J Lin, B Huang… - Pattern Recognition, 2024 - Elsevier
Abstract Contrastive Language-Image Pre-training (CLIP) has shown impressive zero-shot
transfer capabilities, but its potential for specific downstream tasks is not fully utilized. To …

Evaluating out-of-distribution performance on document image classifiers

S Larson, YYG Lim, Y Ai, D Kuang… - Advances in Neural …, 2022 - proceedings.neurips.cc
The ability of a document classifier to handle inputs that are drawn from a distribution
different from the training distribution is crucial for robust deployment and generalizability …

F-SCP: An automatic prompt generation method for specific classes based on visual language pre-training models

B Han, X Jiang, Z Fang, H Fujita, Y Gao - Pattern Recognition, 2024 - Elsevier
The zero-shot classification performance of large-scale vision-language pre-training models
(eg, CLIP, BLIP and ALIGN) can be enhanced by incorporating a prompt (eg,“a photo of a …

Visually-Rich Document Understanding: Concepts, Taxonomy and Challenges

A Sassioui, R Benouini, Y El Ouargui… - … Networks and Mobile …, 2023 - ieeexplore.ieee.org
The increasing prevalence of Visually-rich Documents (VRDs) in diverse domains has led to
a growing interest in Visually-rich Document Understanding (VrDU). Researchers have …

Multi-schema prompting powered token-feature woven attention network for short text classification

Z Cai, H Zhang, P Zhan, X Jia, Y Yan, X Song, B **e - Pattern Recognition, 2024 - Elsevier
Short text classification task poses challenges in natural language processing due to
insufficient contextual information. This task is typically approached by extracting rich …

Enhancing automatic placenta analysis through distributional feature recomposition in vision-language contrastive learning

Y Pan, T Cai, M Mehta, AD Gernand… - … Conference on Medical …, 2023 - Springer
The placenta is a valuable organ that can aid in understanding adverse events during
pregnancy and predicting issues post-birth. Manual pathological examination and report …

Beyond Document Page Classification: Design, Datasets, and Challenges

J Van Landeghem, S Biswas… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper highlights the need to bring document classification benchmarking closer to real-
world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …