Proteingym: Large-scale benchmarks for protein fitness prediction and design

P Notin, A Kollasch, D Ritter… - Advances in …, 2023 - proceedings.neurips.cc
Predicting the effects of mutations in proteins is critical to many applications, from
understanding genetic disease to designing novel proteins to address our most pressing …

Evaluating generalizability of artificial intelligence models for molecular datasets

Y Ektefaie, A Shen, D Bykova, MG Marin… - Nature Machine …, 2024 - nature.com
Deep learning has made rapid advances in modelling molecular sequencing data. Despite
achieving high performance on benchmarks, it remains unclear to what extent deep learning …

Peer: a comprehensive and multi-task benchmark for protein sequence understanding

M Xu, Z Zhang, J Lu, Z Zhu, Y Zhang… - Advances in …, 2022 - proceedings.neurips.cc
We are now witnessing significant progress of deep learning methods in a variety of tasks
(or datasets) of proteins. However, there is a lack of a standard benchmark to evaluate the …

[HTML][HTML] Proteingym: Large-scale benchmarks for protein design and fitness prediction

P Notin, AW Kollasch, D Ritter, L van Niekerk, S Paul… - bioRxiv, 2023 - ncbi.nlm.nih.gov
Predicting the effects of mutations in proteins is critical to many applications, from
understanding genetic disease to designing novel proteins that can address our most …

Evaluating representation learning on the protein structure universe

AR Jamasb, A Morehead, CK Joshi, Z Zhang, K Didi… - Ar**v, 2024 - pmc.ncbi.nlm.nih.gov
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation
learning on protein structures with Geometric Graph Neural Networks. We consider large …

Ten quick tips for sequence-based prediction of protein properties using machine learning

Q Hou, K Waury, D Gogishvili… - PLOS Computational …, 2022 - journals.plos.org
The ubiquitous availability of genome sequencing data explains the popularity of machine
learning-based methods for the prediction of protein properties from their amino acid …

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications

Y Tan, M Li, Z Zhou, P Tan, H Yu, G Fan… - Journal of …, 2024 - Springer
Protein language models (PLMs) play a dominant role in protein representation learning.
Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem …

Protein engineering in the deep learning era

B Zhou, Y Tan, Y Hu, L Zheng, B Zhong, L Hong - mLife, 2024 - Wiley Online Library
Advances in deep learning have significantly aided protein engineering in addressing
challenges in industrial production, healthcare, and environmental sustainability. This …

Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

L Robinson, T Atkinson, L Copoiu, P Bordes, T Pierrot… - bioRxiv, 2023 - biorxiv.org
Understanding protein function is vital for drug discovery, disease diagnosis, and protein
engineering. While Protein Language Models (PLMs) pre-trained on vast protein sequence …

Collectively encoding protein properties enriches protein language models

J An, X Weng - BMC bioinformatics, 2022 - Springer
Pre-trained natural language processing models on a large natural language corpus can
naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain …