| dc.description.abstract |
The proliferation of social media has significantly amplified user interactions, but it also poses serious threats to
communities through the spread of harmful content such as hate speech. The emotionally charged and nuanced language
found in user-generated content presents unique challenges for effective detection and analysis. This study investigates
YouTube comments related to child abuse and introduces a comprehensive machine learning framework for the automatic
identification of hate speech. A dataset of 2,500 comments was collected using Selenium for web scraping, with an equal
balance of hate and non-hate speech to ensure fair evaluation. To extract textual features, various natural language
processing (NLP) techniques were employed, including CountVectorizer, TF-IDF, Word2Vec, and FastText. Several
machine learning models were evaluated on this dataset. The Gradient Boosting model combined with CountVectorizer
achieved the highest accuracy at 78%. Ensemble approaches, such as soft voting and stacking classifiers, also performed
strongly, reaching up to 75% accuracy. Performance was assessed using metrics like precision and recall. The results
demonstrate the effectiveness of the Gradient Boosting model in enhancing hate speech detection systems, particularly in
sensitive contexts such as child abuse discussions. By advancing methods for identifying harmful content, this research
supports the creation of safer and more respectful digital environments. |
en_US |