Data-driven materials research enabled by natural language processing and information extraction

EA Olivetti, JM Cole, E Kim, O Kononova… - Applied Physics …, 2020 - pubs.aip.org
Given the emergence of data science and machine learning throughout all aspects of
society, but particularly in the scientific domain, there is increased importance placed on …

Don't stop pretraining: Adapt language models to domains and tasks

S Gururangan, A Marasović, S Swayamdipta… - arxiv preprint arxiv …, 2020 - arxiv.org
Language models pretrained on text from a wide variety of sources form the foundation of
today's NLP. In light of the success of these broad-coverage models, we investigate whether …

A discipline-wide investigation of the replicability of Psychology papers over the past two decades

W Youyou, Y Yang, B Uzzi - Proceedings of the National …, 2023 - National Acad Sciences
Conjecture about the weak replicability in social sciences has made scholars eager to
quantify the scale and scope of replication failure for a discipline. Yet small-scale manual …

Better with less: A data-active perspective on pre-training graph neural networks

J Xu, R Huang, X Jiang, Y Cao… - Advances in …, 2023 - proceedings.neurips.cc
Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for
downstream tasks with unlabeled data, and it has recently become an active research area …

Unsupervised domain adaptation of contextualized embeddings for sequence labeling

X Han, J Eisenstein - arxiv preprint arxiv:1904.02817, 2019 - arxiv.org
Contextualized word embeddings such as ELMo and BERT provide a foundation for strong
performance across a wide range of natural language processing tasks by pretraining on …

Code and named entity recognition in stackoverflow

J Tabassum, M Maddela, W Xu, A Ritter - arxiv preprint arxiv:2005.01634, 2020 - arxiv.org
There is an increasing interest in studying natural language and computer code together, as
large corpora of programming texts become readily available on the Internet. For example …

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

L Campillos-Llanos, A Valverde-Mateos… - BMC medical informatics …, 2021 - Springer
Background The large volume of medical literature makes it difficult for healthcare
professionals to keep abreast of the latest studies that support Evidence-Based Medicine …

Domain adaptation for deep entity resolution

J Tu, J Fan, N Tang, P Wang, C Chai, G Li… - Proceedings of the …, 2022 - dl.acm.org
Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA)
results on ER are achieved by deep learning (DL) based methods, trained with a lot of …

An effective transition-based model for discontinuous NER

X Dai, S Karimi, B Hachey, C Paris - arxiv preprint arxiv:2004.13454, 2020 - arxiv.org
Unlike widely used Named Entity Recognition (NER) data sets in generic domains,
biomedical NER data sets often contain mentions consisting of discontinuous spans …

Scaling laws for downstream task performance of large language models

B Isik, N Ponomareva, H Hazimeh, D Paparas… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling laws provide important insights that can guide the design of large language models
(LLMs). Existing work has primarily focused on studying scaling laws for pretraining …