Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling

Andraž Pelicon*, Syrielle Montariol, Petra Kralj Novak

*Corresponding author for this work

Research output: Contribution to Book/Report typesConference contributionpeer-review

Abstract (may include machine translation)

Many text classification tasks face a severe class imbalance problem that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. Our proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. We evaluate our approach on three real-world hate speech datasets and compare it to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. Our results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.

Original languageEnglish
Title of host publicationAdvances in Intelligent Data Analysis XXI - 21st International Symposium on Intelligent Data Analysis, IDA 2023, Proceedings
EditorsBruno Crémilleux, Sibylle Hess, Siegfried Nijssen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages353-365
Number of pages13
ISBN (Print)9783031300462
DOIs
StatePublished - 1 Apr 2023
Event21st International Symposium on Intelligent Data Analysis, IDA 2022 - Louvain-la-Neuve, Belgium
Duration: 12 Apr 202314 Apr 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13876 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Symposium on Intelligent Data Analysis, IDA 2022
Country/TerritoryBelgium
CityLouvain-la-Neuve
Period12/04/2314/04/23

Fingerprint

Dive into the research topics of 'Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling'. Together they form a unique fingerprint.

Cite this