Data-driven materials research enabled by natural language processing and information extraction
Given the emergence of data science and machine learning throughout all aspects of
society, but particularly in the scientific domain, there is increased importance placed on …
society, but particularly in the scientific domain, there is increased importance placed on …
Don't stop pretraining: Adapt language models to domains and tasks
Language models pretrained on text from a wide variety of sources form the foundation of
today's NLP. In light of the success of these broad-coverage models, we investigate whether …
today's NLP. In light of the success of these broad-coverage models, we investigate whether …
A discipline-wide investigation of the replicability of Psychology papers over the past two decades
Conjecture about the weak replicability in social sciences has made scholars eager to
quantify the scale and scope of replication failure for a discipline. Yet small-scale manual …
quantify the scale and scope of replication failure for a discipline. Yet small-scale manual …
Better with less: A data-active perspective on pre-training graph neural networks
Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for
downstream tasks with unlabeled data, and it has recently become an active research area …
downstream tasks with unlabeled data, and it has recently become an active research area …
Unsupervised domain adaptation of contextualized embeddings for sequence labeling
Contextualized word embeddings such as ELMo and BERT provide a foundation for strong
performance across a wide range of natural language processing tasks by pretraining on …
performance across a wide range of natural language processing tasks by pretraining on …
Code and named entity recognition in stackoverflow
There is an increasing interest in studying natural language and computer code together, as
large corpora of programming texts become readily available on the Internet. For example …
large corpora of programming texts become readily available on the Internet. For example …
A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine
Background The large volume of medical literature makes it difficult for healthcare
professionals to keep abreast of the latest studies that support Evidence-Based Medicine …
professionals to keep abreast of the latest studies that support Evidence-Based Medicine …
Domain adaptation for deep entity resolution
Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA)
results on ER are achieved by deep learning (DL) based methods, trained with a lot of …
results on ER are achieved by deep learning (DL) based methods, trained with a lot of …
An effective transition-based model for discontinuous NER
Unlike widely used Named Entity Recognition (NER) data sets in generic domains,
biomedical NER data sets often contain mentions consisting of discontinuous spans …
biomedical NER data sets often contain mentions consisting of discontinuous spans …
Scaling laws for downstream task performance of large language models
Scaling laws provide important insights that can guide the design of large language models
(LLMs). Existing work has primarily focused on studying scaling laws for pretraining …
(LLMs). Existing work has primarily focused on studying scaling laws for pretraining …