Abstract
Speech Emotion Recognition (SER) is an essential component in human-computer interaction, enabling
systems to understand and respond to human emotions. Traditional emotion recognition methods often rely on
handcrafted features, which can be limited in capturing the full complexity of emotional cues. In contrast, deep learning
approaches, particularly convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term
memory (LSTM) networks, offer more robust solutions by automatically learning hierarchical features from raw audio
data. This paper reviews recent advancements in deep learning-based speech emotion recognition, discusses the various
architectures used, and evaluates the challenges in real-world applications. We focus on the application of deep learning
models to enhance the accuracy and robustness of SER, particularly in noisy environments. The study also discusses
future directions for research, including multimodal emotion recognition and transfer learning to address challenges
such as small datasets and cross-domain applications.