A comparative study for classical machine learning models for swahili social media sentiment analysis

Authors

  • Mahadia Tunga Department of Computer Science and Engineering, College of Information and Communication Technologies, University of Dar es Salaam, Dar es Salaam, Tanzania.
  • Davis David Tanzania Data Lab Organization (dLab), Dar es Salaam, Tanzania

Abstract

Despite sentiment analysis being one of the most popular applications in Natural Language Processing (NLP), most studies are skewed towards languages with a rich corpus (language database). Less emphasis has been placed on low-resource languages like Swahili. Swahili is the official language of the African Union and of 4 countries in East Africa, and is spoken by many people on the African continent. This study performed sentiment analysis using 3,000 tweets hosted on the Zindi Africa platform. Data was processed using a term frequency-inverse document frequency vectorization method, and five classical machine learning algorithms (RandomForest, XgBoost, and CatBoost, HistogramGradientBoost, LightGradientBoos) were trained and evaluated using the collected tweets. We found that CatBoost produced the highest performance in general compared to other classical models, with 0.610 accuracy, 0.470 F1 score, 0.522 Precision and 0.462 Recall. The F1-score of 0.47 indicates modest performance and reflects the challenges posed by the small dataset and the complexity of Swahili sentiment analysis. This study offers a comprehensive overview of the relative performance of various classical machine learning models applied to Swahili social media sentiment data. These insights can help researchers make informed choices when selecting appropriate classical machine learning algorithms for sentiment analysis in a similar context.

Downloads

Published

2025-11-15

Issue

Section

Mathematics and Computational Sciences