Options
Multimodal emotion detection in low-resource languages using lightweight transformer architectures: A dual-level fusion framework integrating distilBERT, CNN-BiGRU, and mobileViT for efficient real-time urdu affective computing
Date Issued
2026
Publisher
MDPI AG
Journal
ISSN
2078-2489
Citation
Information, 2026, vol. 17(5), article no. 458.
Description
Open access
Type
Peer Reviewed Journal Article
Abstract
This paper addresses emotion recognition in low-resource language settings for healthcare and human-computer interaction (HCI). Most existing multimodal systems rely on resource-intensive transformers or high-resource languages, limiting their applicability to low-resource languages like Urdu. We propose an efficiency-driven, lightweight multimodal framework for Urdu emotion detection integrating facial expressions, speech, and text. We utilize DistilBERT for text, CNN-BiGRU for audio, and MobileViT-XXS for visual processing with a dual-level fusion strategy. We evaluate on the publicly available UMED corpus, the only multimodal Urdu emotion dataset. Our system recognizes expressed emotional signals rather than internal affective states. Experimental results demonstrate competitive performance (83.72% accuracy) while requiring 76.5% fewer parameters and 4.4× faster inference than heavyweight baselines, enabling accessible, real-time emotion recognition in low-resource contexts.
Loading...
Availability at HKSYU Library

