Multilingual Berth-Based Question Answering for Code-Mixed Hausa-English Text
Keywords:
code-mixed language, Hausa-English, question answering, multilingual BERT, low-resource NLP, transformer modelsAbstract
Code-mixed language processing poses significant challenges due to limited linguistic resources and the complexity of handling multiple languages within a single context. This study addresses these challenges by developing a Hausa–English code-mixed question-answering (QA) dataset, derived from the Stanford Question Answering Dataset (SQuAD), and fine-tuning a multilingual BERT (mBERT) model for extractive QA tasks. The dataset, named HECM-QA, contains over 10,000 samples with context passages, code-mixed questions, answer spans, and token-level language annotations, reflecting natural language use in Northern Nigeria. Text preprocessing involved WordPiece tokenization, cleaning, segmentation, and numerical encoding to preserve the structure of code-mixed sentences. Experimental results demonstrate that mBERT significantly outperforms LSTM and RNN baselines, achieving 79.03% Accuracy, 77.06 F1 Score, and 51.79 ROUGE, with statistical significance confirmed through paired t-tests and bootstrap resampling. The study highlights the effectiveness of transformer-based multilingual models for code-mixed QA, emphasizes the importance of rich annotated datasets, and contributes a robust benchmark for future research in low-resource and multilingual NLP scenarios.