Towards precision medicine

EA Ashley - Nature Reviews Genetics, 2016‏ - nature.com
There is great potential for genome sequencing to enhance patient care through improved
diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential …

From matching to generation: A survey on generative information retrieval

X Li, J **, Y Zhou, Y Zhang, P Zhang, Y Zhu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Information Retrieval (IR) systems are crucial tools for users to access information, widely
applied in scenarios like search engines, question answering, and recommendation …

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

L Huang, W Yu, W Ma, W Zhong, Z Feng… - ACM Transactions on …, 2025‏ - dl.acm.org
The emergence of large language models (LLMs) has marked a significant breakthrough in
natural language processing (NLP), fueling a paradigm shift in information acquisition …

The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only

G Penedo, Q Malartic, D Hesslow, R Cojocaru… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Large language models are commonly trained on a mixture of filtered web data and curated
high-quality corpora, such as social media conversations, books, or technical papers. This …

The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only

G Penedo, Q Malartic, D Hesslow… - Advances in …, 2023‏ - proceedings.neurips.cc
Large language models are commonly trained on a mixture of filtered web data and
curated``high-quality''corpora, such as social media conversations, books, or technical …

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022‏ - proceedings.neurips.cc
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

Deduplicating training data makes language models better

K Lee, D Ippolito, A Nystrom, C Zhang, D Eck… - arxiv preprint arxiv …, 2021‏ - arxiv.org
We find that existing language modeling datasets contain many near-duplicate examples
and long repetitive substrings. As a result, over 1% of the unprompted output of language …

Rest: Retrieval-based speculative decoding

Z He, Z Zhong, T Cai, JD Lee, D He - arxiv preprint arxiv:2311.08252, 2023‏ - arxiv.org
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to
speed up language model generation. The key insight driving the development of REST is …

OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more

AM Altenhoff, CM Train, KJ Gilbert… - Nucleic acids …, 2021‏ - academic.oup.com
OMA is an established resource to elucidate evolutionary relationships among genes from
currently 2326 genomes covering all domains of life. OMA provides pairwise and groupwise …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …