Enhancing Machine Learning in Abusive Language Detection with Dataset Integration

Samaneh Hosseini Moghaddam,Kelly Lyons,Cheryl Regehr,Frank Rudzicz,V. Goel,Kaitlyn Regehr

Published 2025 in Conference of the Centre for Advanced Studies on Collaborative Research

ABSTRACT

Abusive language detection models are widely reported to suffer from poor generalization, limiting their realworld effectiveness. This is largely due to sampling and lexical biases in datasets. In response to these issues, we aim to enhance the generalizability of abusive language detection models by leveraging and unifying existing datasets. We harmonize ten publicly available datasets under a consistent definition of abusive language and integrate them into a single dataset. Our core hypothesis is that while individual datasets exhibit sampling bias, their complementary characteristics can be harnessed to create a broader and more representative training distribution. To evaluate this hypothesis, we first empirically demonstrate the extent of sampling bias across datasets, then systematically integrate multiple datasets into an aggregated corpus and compare the classification performance of models trained on each individual dataset versus a model trained on the aggregated corpus using a held-out, uniformly sampled benchmark comprising data from all datasets. While the integrated model improves macro-F1 from 0.60 (average across single datasets) to 0.84. Furthermore, we quantify the contribution of each dataset to the integrated model's performance gains and its lexical dissimilarity relative to others, and find a strong correlation with a magnitude of 0.71. These findings suggest that integrating lexically diverse datasets exposes models to a broader spectrum of abuse-related language, mitigating dataset-specific sampling biases and enhancing model generalizability in real-world scenarios.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-31 of 31 references · Page 1 of 1

CITED BY

  • No citing papers are available for this paper.

Showing 0-0 of 0 citing papers · Page 1 of 1