Recognizing emotions on multimodal datasets is a complex task, especially in fields such as Human-Computer Interaction (HCI). This study proposes a multimodal approach for emotion recognition using the MELD dataset, which includes audio, text, and facial features. However, only audio and text features are used in this research. To process the audio data, it is transformed into MFCC and used as input for a bidirectional LSTM that performs emotion classification. For the text data, BERT is used to tokenize the text, which is then fed into another bidirectional LSTM for emotion classification. The results from both modalities are combined using a voting ensemble method, and the model's performance is evaluated using F1-score and confusion matrices. The unimodal audio model achieved an F1-score of 41.69%, while the unimodal text model achieved 47.29%. The voting ensemble model achieved an F1-score of 47.47%. Additionally, this paper discusses potential future research that involves improving deep learning models and combining them with ensemble models to enhance emotion recognition on multimodal datasets.
Discussion(0)
No comments yet. Be the first to comment.