Multimodal emotion detection in low-resource languages using lightweight transformer architectures: A dual-level fusion framework integrating distilBERT, CNN-BiGRU, and mobileViT for efficient real-time urdu affective computing

Dr. AZHAR MuhammadAmjad, AdeenAdeenAmjadArman, MuhammadMuhammadArmanDewi, Deshinta ArrovaDeshinta ArrovaDewi2026-06-052026-06-052026Information, 2026, vol. 17(5), article no. 458.2078-2489http://hdl.handle.net/20.500.11861/27320Open accessThis paper addresses emotion recognition in low-resource language settings for healthcare and human-computer interaction (HCI). Most existing multimodal systems rely on resource-intensive transformers or high-resource languages, limiting their applicability to low-resource languages like Urdu. We propose an efficiency-driven, lightweight multimodal framework for Urdu emotion detection integrating facial expressions, speech, and text. We utilize DistilBERT for text, CNN-BiGRU for audio, and MobileViT-XXS for visual processing with a dual-level fusion strategy. We evaluate on the publicly available UMED corpus, the only multimodal Urdu emotion dataset. Our system recognizes expressed emotional signals rather than internal affective states. Experimental results demonstrate competitive performance (83.72% accuracy) while requiring 76.5% fewer parameters and 4.4× faster inference than heavyweight baselines, enabling accessible, real-time emotion recognition in low-resource contexts.enAffective ComputingMultimodal Emotion RecognitionUrdu Language ProcessingLightweight TransformersEdge ComputingPublic HealthDistilBERTMobileViTCNN-BiGRUMultimodal emotion detection in low-resource languages using lightweight transformer architectures: A dual-level fusion framework integrating distilBERT, CNN-BiGRU, and mobileViT for efficient real-time urdu affective computingPeer Reviewed Journal Article10.3390/info17050458