Multilingual Berth-Based Question Answering for Code-Mixed Hausa-English Text

Authors

  • Y. M. Malgwi Department of Computer Science, Modibbo Adama University Yola, Adamawa State
  • B. J. Muhammed Department of Computer Science, Federal University of Kashere, Nigeria
  • A. L. Mohammed Department of Computer Science, Nigerian Army University Biu, Borno State.
  • H. Mikailu Department of Computer Science, Nigerian Army University Biu, Borno State

Keywords:

code-mixed language, Hausa-English, question answering, multilingual BERT, low-resource NLP, transformer models

Abstract

Code-mixed language processing poses significant challenges due to limited linguistic resources and the complexity of handling multiple languages within a single context. This study addresses these challenges by developing a Hausa–English code-mixed question-answering (QA) dataset, derived from the Stanford Question Answering Dataset (SQuAD), and fine-tuning a multilingual BERT (mBERT) model for extractive QA tasks. The dataset, named HECM-QA, contains over 10,000 samples with context passages, code-mixed questions, answer spans, and token-level language annotations, reflecting natural language use in Northern Nigeria. Text preprocessing involved WordPiece tokenization, cleaning, segmentation, and numerical encoding to preserve the structure of code-mixed sentences. Experimental results demonstrate that mBERT significantly outperforms LSTM and RNN baselines, achieving 79.03% Accuracy, 77.06 F1 Score, and 51.79 ROUGE, with statistical significance confirmed through paired t-tests and bootstrap resampling. The study highlights the effectiveness of transformer-based multilingual models for code-mixed QA, emphasizes the importance of rich annotated datasets, and contributes a robust benchmark for future research in low-resource and multilingual NLP scenarios.

 

Downloads

Published

2025-12-12