- Academic Search

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

保存引用被引用次数：88 相关文章

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

保存引用被引用次数：197 相关文章所有 8 个版本

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Subject-driven text-to-image generation via apprenticeship learning

W Chen, H Hu, Y Li, N Ruiz, X Jia… - Advances in …, 2023 - proceedings.neurips.cc

Recent text-to-image generation models like DreamBooth have made remarkable progress
in generating highly customized images of a target subject, by fine-tuning an``expert …

保存引用被引用次数：165 相关文章所有 6 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

保存引用被引用次数：640 相关文章所有 11 个版本

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Artificial intelligence for science in quantum, atomistic, and continuum systems

X Zhang, L Wang, J Helwig, Y Luo, C Fu, Y **e… - arxiv preprint arxiv …, 2023 - arxiv.org

Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural
sciences. Today, AI has started to advance natural sciences by improving, accelerating, and …

保存引用被引用次数：125 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Seeing what you said: Talking face generation guided by a lip reading expert

J Wang, X Qian, M Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Talking face generation, also known as speech-to-lip generation, reconstructs facial motions
concerning lips given coherent speech input. The previous studies revealed the importance …

保存引用被引用次数：85 相关文章所有 6 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

保存引用被引用次数：74 相关文章所有 7 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Effective conditioned and composed image retrieval combining clip-based features

A Baldrati, M Bertini, T Uricchio… - Proceedings of the …, 2022 - openaccess.thecvf.com

Conditioned and composed image retrieval extend CBIR systems by combining a query
image with an additional text that expresses the intent of the user, describing additional …

保存引用被引用次数：154 相关文章所有 5 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundations and trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

保存引用被引用次数：152 相关文章所有 2 个版本 HTML 版

Actionclip: Adapting language-image pretrained models for video action recognition

M Wang, J **ng, J Mei, Y Liu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

The canonical approach to video action recognition dictates a neural network model to do a
classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of …

保存引用被引用次数：40 相关文章所有 2 个版本

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Clip-event: Connecting text and images with event structures

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

Large-scale multi-modal pre-trained models: A comprehensive survey

Subject-driven text-to-image generation via apprenticeship learning

Multimodal learning with transformers: A survey

Artificial intelligence for science in quantum, atomistic, and continuum systems

Seeing what you said: Talking face generation guided by a lip reading expert

Verbs in action: Improving verb understanding in video-language models

Effective conditioned and composed image retrieval combining clip-based features

Foundations and trends in multimodal machine learning: Principles, challenges, and open questions

Actionclip: Adapting language-image pretrained models for video action recognition