Multilingual Berth-Based Question Answering for Code-Mixed Hausa-English Text

Authors

Y. M. Malgwi Department of Computer Science, Modibbo Adama University Yola, Adamawa State
B. J. Muhammed Department of Computer Science, Federal University of Kashere, Nigeria
A. L. Mohammed Department of Computer Science, Nigerian Army University Biu, Borno State.
H. Mikailu Department of Computer Science, Nigerian Army University Biu, Borno State

Keywords:

code-mixed language, Hausa-English, question answering, multilingual BERT, low-resource NLP, transformer models

Abstract

Code-mixed language processing poses significant challenges due to limited linguistic resources and the complexity of handling multiple languages within a single context. This study addresses these challenges by developing a Hausa–English code-mixed question-answering (QA) dataset, derived from the Stanford Question Answering Dataset (SQuAD), and fine-tuning a multilingual BERT (mBERT) model for extractive QA tasks. The dataset, named HECM-QA, contains over 10,000 samples with context passages, code-mixed questions, answer spans, and token-level language annotations, reflecting natural language use in Northern Nigeria. Text preprocessing involved WordPiece tokenization, cleaning, segmentation, and numerical encoding to preserve the structure of code-mixed sentences. Experimental results demonstrate that mBERT significantly outperforms LSTM and RNN baselines, achieving 79.03% Accuracy, 77.06 F1 Score, and 51.79 ROUGE, with statistical significance confirmed through paired t-tests and bootstrap resampling. The study highlights the effectiveness of transformer-based multilingual models for code-mixed QA, emphasizes the importance of rich annotated datasets, and contributes a robust benchmark for future research in low-resource and multilingual NLP scenarios.

Downloads

Published

2025-12-12

Issue

Vol. 10 No. 2 (2025): Jewel Journal of Scientific Research (JJSR)

Section

Articles