Tools for automated analysis of cybercriminal markets

RS Portnoff, S Afroz, G Durrett, JK Kummerfeld… - Proceedings of the 26th …, 2017 - dl.acm.org
Underground forums are widely used by criminals to buy and sell a host of stolen items,
datasets, resources, and criminal services. These forums contain important resources for …

Substructure substitution: Structured data augmentation for NLP

H Shi, K Livescu, K Gimpel - arxiv preprint arxiv:2101.00411, 2021 - arxiv.org
We study a family of data augmentation methods, substructure substitution (SUB2), for
natural language processing (NLP) tasks. SUB2 generates new examples by substituting …

Improving pre-trained multilingual models with vocabulary expansion

H Wang, D Yu, K Sun, J Chen, D Yu - arxiv preprint arxiv:1909.12440, 2019 - arxiv.org
Recently, pre-trained language models have achieved remarkable success in a broad range
of natural language processing tasks. However, in multilingual setting, it is extremely …

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

M Sanguinetti, C Bosco, L Cassidy, Ö Çetinoğlu… - Language Resources …, 2023 - Springer
This article presents a discussion on the main linguistic phenomena which cause difficulties
in the analysis of user-generated texts found on the web and in social media, and proposes …

You are your photographs: Detecting multiple identities of vendors in the darknet marketplaces

X Wang, P Peng, C Wang, G Wang - Proceedings of the 2018 on Asia …, 2018 - dl.acm.org
Darknet markets are online services behind Tor where cybercriminals trade illegal goods
and stolen datasets. In recent years, security analysts and law enforcement start to …

Identifying products in online cybercrime marketplaces: A dataset for fine-grained domain adaptation

G Durrett, JK Kummerfeld, T Berg-Kirkpatrick… - arxiv preprint arxiv …, 2017 - arxiv.org
One weakness of machine-learned NLP models is that they typically perform poorly on out-
of-domain data. In this work, we study the task of identifying products being bought and sold …

Treebanking user-generated content: A proposal for a unified representation in Universal Dependencies

M Sanguinetti, B Cristina, C Lauren, C Ozlem… - Proceedings of the 12th …, 2020 - iris.unica.it
The paper presents a discussion on the main linguistic phenomena of user-generated texts
found in web and social media, and proposes a set of annotation guidelines for their …

A taxonomy for in-depth evaluation of normalization for user generated content

R Van Der Goot, R van Noord… - … Conference on Language …, 2018 - research.rug.nl
In this work we present a taxonomy of error categories for lexical normalization, which is the
task of translating user generated content to canonical language. We annotate a recent …

Discovery of stylistic patterns in business process textual descriptions: It ticket case

N Rizun, V Meister, A Revina - Innovation Management and …, 2020 - papers.ssrn.com
Growing IT complexity and related problems, which are reflected in IT tickets, create a need
for new qualitative approaches. The goal is to automate the extraction of main topics …

From noisy questions to Minecraft texts: Annotation challenges in extreme syntax scenario

HM Alonso, D Seddah, B Sagot - … of the 2nd Workshop on Noisy …, 2016 - aclanthology.org
User-generated content presents many challenges for its automatic processing. While many
of them do come from out-of-vocabulary effects, others spawn from different linguistic …