TY - GEN
T1 - Don’t Start Your Data Labeling from Scratch
T2 - 21st International Symposium on Intelligent Data Analysis, IDA 2022
AU - Pelicon, Andraž
AU - Montariol, Syrielle
AU - Kralj Novak, Petra
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023/4/1
Y1 - 2023/4/1
N2 - Many text classification tasks face a severe class imbalance problem that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. Our proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. We evaluate our approach on three real-world hate speech datasets and compare it to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. Our results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.
AB - Many text classification tasks face a severe class imbalance problem that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. Our proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. We evaluate our approach on three real-world hate speech datasets and compare it to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. Our results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.
UR - http://www.scopus.com/inward/record.url?scp=85152543895&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-30047-9_28
DO - 10.1007/978-3-031-30047-9_28
M3 - Conference contribution
AN - SCOPUS:85152543895
SN - 9783031300462
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 353
EP - 365
BT - Advances in Intelligent Data Analysis XXI - 21st International Symposium on Intelligent Data Analysis, IDA 2023, Proceedings
A2 - Crémilleux, Bruno
A2 - Hess, Sibylle
A2 - Nijssen, Siegfried
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 12 April 2023 through 14 April 2023
ER -