A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L ** - arxiv preprint arxiv:2402.18041, 2024 - arxiv.org
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

C-pack: Packed resources for general chinese embeddings

S **ao, Z Liu, P Zhang, N Muennighoff, D Lian… - Proceedings of the 47th …, 2024 - dl.acm.org
We introduce C-Pack, a package of resources that significantly advances the field of general
text embeddings for Chinese. C-Pack includes three critical resources. 1) C-MTP is a …

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

J Chen, S **ao, P Zhang, K Luo, D Lian… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present a new embedding model, called M3-Embedding, which is
distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It …

Longbench: A bilingual, multitask benchmark for long context understanding

Y Bai, X Lv, J Zhang, H Lyu, J Tang, Z Huang… - arxiv preprint arxiv …, 2023 - arxiv.org
Although large language models (LLMs) demonstrate impressive performance for many
language tasks, most of them can only handle texts a few thousand tokens long, limiting their …

Hallucination detection: Robustly discerning reliable answers in large language models

Y Chen, Q Fu, Y Yuan, Z Wen, G Fan, D Liu… - Proceedings of the …, 2023 - dl.acm.org
Large language models (LLMs) have gained widespread adoption in various natural
language processing tasks, including question answering and dialogue systems. However …

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation

Y Sun, S Wang, S Feng, S Ding, C Pang… - arxiv preprint arxiv …, 2021 - arxiv.org
Pre-trained models have achieved state-of-the-art results in various Natural Language
Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up …

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

JH Clark, E Choi, M Collins, D Garrette… - Transactions of the …, 2020 - direct.mit.edu
Confidently making progress on multilingual modeling requires challenging, trustworthy
evaluations. We present TyDi QA—a question answering dataset covering 11 typologically …