Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Towards precision medicine
EA Ashley - Nature Reviews Genetics, 2016 - nature.com
There is great potential for genome sequencing to enhance patient care through improved
diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential …
diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential …
From matching to generation: A survey on generative information retrieval
Information Retrieval (IR) systems are crucial tools for users to access information, widely
applied in scenarios like search engines, question answering, and recommendation …
applied in scenarios like search engines, question answering, and recommendation …
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions
The emergence of large language models (LLMs) has marked a significant breakthrough in
natural language processing (NLP), fueling a paradigm shift in information acquisition …
natural language processing (NLP), fueling a paradigm shift in information acquisition …
The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only
G Penedo, Q Malartic, D Hesslow, R Cojocaru… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models are commonly trained on a mixture of filtered web data and curated
high-quality corpora, such as social media conversations, books, or technical papers. This …
high-quality corpora, such as social media conversations, books, or technical papers. This …
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only
G Penedo, Q Malartic, D Hesslow… - Advances in …, 2023 - proceedings.neurips.cc
Large language models are commonly trained on a mixture of filtered web data and
curated``high-quality''corpora, such as social media conversations, books, or technical …
curated``high-quality''corpora, such as social media conversations, books, or technical …
The bigscience roots corpus: A 1.6 tb composite multilingual dataset
H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …
Deduplicating training data makes language models better
K Lee, D Ippolito, A Nystrom, C Zhang, D Eck… - arxiv preprint arxiv …, 2021 - arxiv.org
We find that existing language modeling datasets contain many near-duplicate examples
and long repetitive substrings. As a result, over 1% of the unprompted output of language …
and long repetitive substrings. As a result, over 1% of the unprompted output of language …
Rest: Retrieval-based speculative decoding
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to
speed up language model generation. The key insight driving the development of REST is …
speed up language model generation. The key insight driving the development of REST is …
OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more
AM Altenhoff, CM Train, KJ Gilbert… - Nucleic acids …, 2021 - academic.oup.com
OMA is an established resource to elucidate evolutionary relationships among genes from
currently 2326 genomes covering all domains of life. OMA provides pairwise and groupwise …
currently 2326 genomes covering all domains of life. OMA provides pairwise and groupwise …
A survey of multimodal large language model from a data-centric perspective
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …
language models by integrating and processing data from multiple modalities, including text …