Multimodal emotion detection in low-resource languages using lightweight transformer architectures: A dual-level fusion framework integrating distilBERT, CNN-BiGRU, and mobileViT for efficient real-time urdu affective computing

This paper addresses emotion recognition in low-resource language settings for healthcare and human-computer interaction (HCI). Most existing multimodal systems rely on resource-intensive transformers or high-resource languages, limiting their applicability to low-resource languages like Urdu. We propose an efficiency-driven, lightweight multimodal framework for Urdu emotion detection integrating facial expressions, speech, and text. We utilize DistilBERT for text, CNN-BiGRU for audio, and MobileViT-XXS for visual processing with a dual-level fusion strategy. We evaluate on the publicly available UMED corpus, the only multimodal Urdu emotion dataset. Our system recognizes expressed emotional signals rather than internal affective states. Experimental results demonstrate competitive performance (83.72% accuracy) while requiring 76.5% fewer parameters and 4.4× faster inference than heavyweight baselines, enabling accessible, real-time emotion recognition in low-resource contexts.

Subjects

Affective Computing

Multimodal Emotion Re...

Urdu Language Process...

Lightweight Transform...

Edge Computing

Public Health

DistilBERT

MobileViT

CNN-BiGRU

Options

Multimodal emotion detection in low-resource languages using lightweight transformer architectures: A dual-level fusion framework integrating distilBERT, CNN-BiGRU, and mobileViT for efficient real-time urdu affective computing

Availability at HKSYU Library