The pile: An 800gb dataset of diverse text for language modeling
L Gao, S Biderman, S Black, L Golding… - arxiv preprint arxiv …, 2020 - arxiv.org
Recent work has demonstrated that increased training dataset diversity improves general
cross-domain knowledge and downstream generalization capability for large-scale …
cross-domain knowledge and downstream generalization capability for large-scale …
Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and
instruction-tuned open generative large language models (LLMs). The models are based on …
instruction-tuned open generative large language models (LLMs). The models are based on …
Datasheet for the pile
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by
EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different …
EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different …
On the features of translationese
V Volansky, N Ordan, S Wintner - Digital Scholarship in the …, 2015 - academic.oup.com
Much research in translation studies indicates that translated texts are ontologically different
from original non-translated ones. Translated texts, in any language, can be considered a …
from original non-translated ones. Translated texts, in any language, can be considered a …
[PDF][PDF] Translationese and its dialects
M Koppel, N Ordan - Proceedings of the 49th annual meeting of …, 2011 - aclanthology.org
While it is has often been observed that the product of translation is somehow different than
non-translated text, scholars have emphasized two distinct bases for such differences. Some …
non-translated text, scholars have emphasized two distinct bases for such differences. Some …
[BOOK][B] The Routledge handbook of translation and cognition
F Alves, AL Jakobsen - 2021 - api.taylorfrancis.com
With a strong focus on interdisciplinarity, the handbook surveys concepts and methods in
neighbouring disciplines that are concerned with cognition and how they relate to …
neighbouring disciplines that are concerned with cognition and how they relate to …
Language models for machine translation: Original vs. translated texts
We investigate the differences between language models compiled from original target-
language texts and those compiled from texts manually translated to the target language …
language texts and those compiled from texts manually translated to the target language …
[PDF][PDF] Automatic detection of translated text and its impact on machine translation
We investigate the possibility of automatically detecting whether a piece of text is an original
or a translation. On a large parallel English-French corpus where reference information is …
or a translation. On a large parallel English-French corpus where reference information is …
[PDF][PDF] Contrastive analysis and native language identification
Attempts to profile authors based on their characteristics, including native language, have
drawn attention in recent years, via several approaches using machine learning with simple …
drawn attention in recent years, via several approaches using machine learning with simple …
Not…Until across European Languages: A Parallel Corpus Study
We present a parallel corpus study on the expression of the temporal construction 'not…
until'in a sample of European languages. We use data from the Europarl corpus and create …
until'in a sample of European languages. We use data from the Europarl corpus and create …