Low-Resource Language Text Classification
Abstract
This paper describes our approach to SemEval-2023 Task 12 on sentiment analysis for African languages. We fine-tune multilingual pretrained models (XLM-RoBERTa, AfroXLMR) for low-resource text classification.
Background
Natural language processing has made remarkable progress in recent years, but this progress has been overwhelmingly concentrated on a handful of high-resource languages -- primarily English, Chinese, and a few European languages. The vast majority of the world's roughly 7,000 languages have little to no annotated data, pretrained models, or evaluation benchmarks. African languages are particularly underserved: despite hundreds of millions of speakers across languages like Hausa, Yoruba, Igbo, Amharic, and Swahili, NLP resources for these languages remain scarce.
Sentiment analysis -- determining whether a piece of text expresses positive, negative, or neutral sentiment -- is a foundational NLP task with direct applications in market research, public opinion monitoring, and social media analysis. For African languages, building sentiment analysis systems is challenging not only because of limited labeled data but also because of the linguistic diversity involved. These languages span multiple language families with vastly different morphological, syntactic, and tonal properties. A method that works well for Hausa (Afroasiatic, predominantly written in Latin script) may not transfer well to Amharic (Afroasiatic, Ge'ez script) or Yoruba (Niger-Congo, tonal with diacritics).
SemEval-2023 Task 12 was designed to push the community toward building NLP systems that work for these underrepresented languages. The shared task provided sentiment-annotated datasets for multiple African languages and defined a standardized evaluation setting, enabling systematic comparison of approaches to low-resource text classification.
Methodology
Our approach centered on fine-tuning multilingual pretrained language models, which have shown strong cross-lingual transfer capabilities. We primarily experimented with XLM-RoBERTa and AfroXLMR. XLM-RoBERTa is a massively multilingual model pretrained on CommonCrawl data spanning 100 languages, while AfroXLMR is specifically adapted for African languages, having been further pretrained on African language corpora. The hypothesis was that AfroXLMR, with its domain-specific pretraining, would provide better representations for the target languages than the more general XLM-RoBERTa.
For fine-tuning, we adopted standard practices for text classification: appending a classification head on top of the pretrained encoder and training end-to-end on the task-specific labeled data. Given the small size of the training sets for some languages, we carefully tuned hyperparameters -- learning rate, batch size, number of epochs, and weight decay -- to avoid overfitting. We also explored data augmentation strategies and the effect of combining training data from related languages to increase the effective training set size.
The evaluation covered multiple African languages including Hausa, Yoruba, Igbo, and several others. Performance was measured using weighted F1 score, which accounts for class imbalance in the sentiment labels. We conducted ablation studies to understand the contribution of each component: the choice of base model, the fine-tuning regime, and the effect of cross-lingual data pooling.
Approach
- Multilingual pretraining: Leveraging cross-lingual transfer from models pretrained on diverse multilingual corpora, enabling knowledge sharing between high-resource and low-resource languages through shared subword representations
- Fine-tuning strategies: Adapting to limited data scenarios through careful hyperparameter selection, early stopping, and regularization techniques designed to prevent overfitting on small training sets
- Language coverage: Hausa, Yoruba, Igbo, and more -- spanning multiple language families and writing systems, testing the generality of multilingual transfer across linguistically diverse targets
Results
The experiments confirmed that multilingual pretrained models provide a strong foundation for low-resource sentiment analysis. AfroXLMR consistently outperformed standard XLM-RoBERTa across most of the target languages, validating the value of language-adaptive pretraining. The performance gap was most pronounced for languages with the least representation in XLM-RoBERTa's pretraining corpus, where AfroXLMR's additional African language data provided a clear advantage.
Cross-lingual data pooling -- training on combined data from related languages -- produced mixed results. For closely related languages within the same family, pooling improved performance, likely because the shared linguistic features allowed positive transfer. For more distant language pairs, pooling sometimes hurt performance, suggesting that the model conflated distinct linguistic patterns when forced to share parameters across dissimilar languages.
Overall, our system achieved solid results on the shared task leaderboard. The findings highlight both the promise and the limitations of current multilingual models for truly low-resource scenarios: while transfer learning dramatically reduces the data requirements compared to training from scratch, there remains a meaningful performance gap between high-resource and low-resource languages even with the best available pretrained models.
Discussion
This work underscores an important direction for the NLP community: building and evaluating systems that work beyond the small set of languages that currently dominate research. The success of AfroXLMR relative to general-purpose multilingual models suggests that targeted pretraining on underrepresented language families is a productive investment. As more African language text becomes available online and through dedicated data collection efforts, the foundation for better models will continue to grow.
From a practical standpoint, even modest sentiment analysis capabilities in these languages can unlock applications that were previously infeasible -- from monitoring public health discourse in local languages to analyzing customer feedback in regional markets. The shared task format proved valuable for catalyzing research attention toward these languages, and we hope that continued community efforts will further close the resource gap.