TY - JOUR
T1 - Gender-Based Bimodal Analysis of Speech and Text in Non-Contextual Emotion Recognition
AU - Angkasa, Abby Rasya
AU - Zahra, Amalia
AU - Fung, Wai-Keung
PY - 2024/12/19
Y1 - 2024/12/19
N2 - Emotion recognition can help human-computer interactions by enabling systems to respond empathetically and adapt to users' emotional conditions. This capability improves user experience, supporting the development of a more intuitive and emotionally responsive communication system. This study analyzes a bimodal approach based on gender (male and female) to recognize emotions without contextual information in dialogue analysis. Utilizing the Multimodal EmoryNLP dataset extracted from the TV series Friends with acted speech, we focused on four primary emotions: Angry, Neutral, Joy, and Scared. The model used in this study for text classification is RoBERTa, and wav2vec 2.0 is used for audio feature extraction with the Bi-LSTM model for classification. The experiment results using weighted F1-score reveal that data augmentation enhanced the performance of analyzing the original dataset from 0.46% to 0.52% and the male dataset from 0.43% to 0.51 %. In comparison, the female dataset remained consistent at 0.46%. The weighted F1-score and Unweighted Averaged Recall (UAR) from the male dataset are higher, 51 % and 48%, respectively, than those from the female dataset, 46% and 47%, respectively. Gender-based analysis indicated that male and female datasets exhibited distinct performance patterns, highlighting variations in emotional expression and recognition between genders. These findings underscore the effectiveness of multimodal strategies in emotion recognition and suggest that gender-specific factors play a significant role in enhancing classification performance. While these results highlight performance trends, further validation through repeated trials and statistical analyses could provide stronger generalizations and insights into gender-based differences.
AB - Emotion recognition can help human-computer interactions by enabling systems to respond empathetically and adapt to users' emotional conditions. This capability improves user experience, supporting the development of a more intuitive and emotionally responsive communication system. This study analyzes a bimodal approach based on gender (male and female) to recognize emotions without contextual information in dialogue analysis. Utilizing the Multimodal EmoryNLP dataset extracted from the TV series Friends with acted speech, we focused on four primary emotions: Angry, Neutral, Joy, and Scared. The model used in this study for text classification is RoBERTa, and wav2vec 2.0 is used for audio feature extraction with the Bi-LSTM model for classification. The experiment results using weighted F1-score reveal that data augmentation enhanced the performance of analyzing the original dataset from 0.46% to 0.52% and the male dataset from 0.43% to 0.51 %. In comparison, the female dataset remained consistent at 0.46%. The weighted F1-score and Unweighted Averaged Recall (UAR) from the male dataset are higher, 51 % and 48%, respectively, than those from the female dataset, 46% and 47%, respectively. Gender-based analysis indicated that male and female datasets exhibited distinct performance patterns, highlighting variations in emotional expression and recognition between genders. These findings underscore the effectiveness of multimodal strategies in emotion recognition and suggest that gender-specific factors play a significant role in enhancing classification performance. While these results highlight performance trends, further validation through repeated trials and statistical analyses could provide stronger generalizations and insights into gender-based differences.
U2 - 10.1109/bts-i2c63534.2024.10942075
DO - 10.1109/bts-i2c63534.2024.10942075
M3 - Article
SP - 398
EP - 403
JO - 2024 Beyond Technology Summit on Informatics International Conference (BTS-I2C)
JF - 2024 Beyond Technology Summit on Informatics International Conference (BTS-I2C)
ER -