- Academic Search

Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks

L Wang, KJ Yoon - IEEE transactions on pattern analysis and …, 2021 - ieeexplore.ieee.org

Deep neural models, in recent years, have been successful in almost every field, even
solving the most complex problem statements. However, these models are huge in size with …

Save Cite Cited by 816 Related articles All 10 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

C Chen, Y Wu, Q Dai, HY Zhou, M Xu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (eg, social network …

Save Cite Cited by 71 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arxiv preprint arxiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Save Cite Cited by 569 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Knowledge distillation: A survey

J Gou, B Yu, SJ Maybank, D Tao - International Journal of Computer Vision, 2021 - Springer

In recent years, deep neural networks have been successful in both industry and academia,
especially for computer vision tasks. The great success of deep learning is mainly due to its …

Save Cite Cited by 3292 Related articles All 12 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Save Cite Cited by 238 Related articles All 26 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arxiv preprint arxiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Save Cite Cited by 483 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Save Cite Cited by 147 Related articles All 8 versions Free GPT-4 View as HTML

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan, T Li - Neurocomputing, 2022 - Elsevier

Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …

Save Cite Cited by 565 Related articles All 5 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Swinbert: End-to-end transformers with sparse attention for video captioning

K Lin, L Li, CC Lin, F Ahmed, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com

The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …

Save Cite Cited by 298 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

Save Cite Cited by 204 Related articles All 6 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Object relational graph with teacher-recommended learning for video captioning

Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

Git: A generative image-to-text transformer for vision and language

Knowledge distillation: A survey

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Next-gpt: Any-to-any multimodal llm

Any-to-any generation via composable diffusion

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

Swinbert: End-to-end transformers with sparse attention for video captioning

End-to-end generative pretraining for multimodal video captioning