Findings of the 2021 conference on machine translation (WMT21)

F Akhbardeh, A Arkhangorodsky, M Biesialska… - Proceedings of the sixth …, 2021 - cris.fbk.eu
This paper presents the results of the news translation task, the multilingual low-resource
translation for Indo-European languages, the triangular translation task, and the automatic …

The ParlaMint corpora of parliamentary proceedings

T Erjavec, M Ogrodniczuk, P Osenova… - Language resources …, 2023 - Springer
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17
European national parliaments with half a billion words. The corpora are uniformly encoded …

A Warm Start and a Clean Crawled Corpus--A Recipe for Good Language Models

V Snæbjarnarson, HB Símonarson… - arxiv preprint arxiv …, 2022 - arxiv.org
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-
art performance in a variety of downstream tasks, including part-of-speech tagging, named …

Language technology programme for Icelandic 2019-2023

AB Nikulásdóttir, J Guðnason, AK Ingason… - arxiv preprint arxiv …, 2020 - arxiv.org
In this paper, we describe a new national language technology programme for Icelandic.
The programme, which spans a period of five years, aims at making Icelandic usable in …

Nefnir: A high accuracy lemmatizer for Icelandic

SL Ingólfsdóttir, H Loftsson, JF Daðason… - arxiv preprint arxiv …, 2019 - arxiv.org
Lemmatization, finding the basic morphological form of a word in a corpus, is an important
step in many natural language processing tasks when working with morphologically rich …

Augmenting a BiLSTM tagger with a morphological lexicon and a lexical category identification step

S Steingrímsson, Ö Kárason, H Loftsson - arxiv preprint arxiv:1907.09038, 2019 - arxiv.org
Previous work on using BiLSTM models for PoS tagging has primarily focused on small
tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language …

Pre-training and Evaluating Transformer-based Language Models for Icelandic

JF Daðason, H Loftsson - Proceedings of the Thirteenth Language …, 2022 - aclanthology.org
In this paper, we evaluate several Transformer-based language models for Icelandic on four
downstream tasks: Part-of-Speech tagging, Named Entity Recognition. Dependency …

DIM: The database of Icelandic morphology

K Bjarnadóttir, KI Hlynsdóttir… - Proceedings of the 22nd …, 2019 - aclanthology.org
The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose
linguistic resource, created for use in language technology, as a reference for the general …

The Danish Gigaword Corpus

L Derczynski, MR Ciosici, R Baglini… - Proceedings of the …, 2021 - aclanthology.org
Danish language technology has been hindered by a lack of broad-coverage corpora at the
scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of …

Byte-level grammatical error correction using synthetic and curated corpora

SL Ingólfsdóttir, PO Ragnarsson, HP Jónsson… - arxiv preprint arxiv …, 2023 - arxiv.org
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and
grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we …