Geographic Adaptation of Pretrained Language Models

V Hofmann, G Glavaš, N Ljubešić… - Transactions of the …, 2024 - direct.mit.edu
While pretrained language models (PLMs) have been shown to possess a plethora of
linguistic knowledge, the existing body of research has largely neglected extralinguistic …

CLASSLA-Stanza: The next step for linguistic processing of South Slavic Languages

L Terčon, N Ljubešić - ar** the languages of Twitter in Finland
T Hiippala, T Väisänen, T Toivonen, O Järv - Neuphilologische Mitteilungen, 2020 - JSTOR
Twitter is a popular social media platform for scholarly research, because the user-
generated content on the platform can also include geographic and temporal information …

Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language: Two case studies from Welsh

D Willis - Glossa, 2020 - ora.ox.ac.uk
Data gathered from social media have been used extensively to examine lexical dialect
variation in widely used languages such as English and Spanish, but their use to date in …

Together we are stronger: Bootstrap** language technology infrastructure for South Slavic languages with CLARIN. SI

N Ljubešić, T Erjavec, M Miličević Petrović… - … . The Infrastructure for …, 2022 - degruyter.com
In this chapter we describe the recent developments in language technology infrastructure
building for three South Slavic languages–Slovenian, Croatian, and Serbian. These …

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

N Ljubešić, T Kuzman - arxiv preprint arxiv:2403.12721, 2024 - arxiv.org
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian,
Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole …

How to optimize your Twitter collection: Dutch keywords for better coverage

T Kreutz, W Daelemans - Computational Linguistics in the …, 2019 - clinjournal.org
Twitter allows API calls to retrieve one percent of all tweets at any time using a search word
list. Since some languages, including Dutch, make up less than one percent of all tweets on …

6 Data Collection and Representation for Similar Languages, Varieties and Dialects

T Samardžic, N Ljubešic - Similar Languages, Varieties, and …, 2021 - books.google.com
Collections of digital text intended for research–known as language corpora–have been
used as linguistic data since the pioneering work on the Brown corpus by Francis and …

[PDF][PDF] The Russian invasion of Ukraine through the lens of ex-Yugoslavian Twitter

B Evkoski, I Mozetic, PK Novak, N Ljubešic - 2022 - researchgate.net
ABSTRACT The Russian invasion of Ukraine marks a dramatic change in international
relations globally, as well as at specific, already unstable, regions. The geographical area of …