Release time: 2024-04-08 11:32
Can AI Show Cognitive Empathy via Acoustics?
A study shows AI can recognize fear, joy, anger, and sadness from audio clips.
KEY POINTS
A new peer-reviewed study shows how AI can detect emotions on par with human performance.
Scientists used three different AI deep learning models for classifying emotions from short audio clips.
An AI and data science approach to psychology shows the potential machines have for cognitive empathy tasks.
Understanding and correctly identifying human emotional states are important for mental health providers. Can artificial intelligence (AI) machine learning demonstrate the human ability of cognitive empathy? A new peer-reviewed study shows how AI can detect emotions on par with human performance from audio clips as short as 1.5 seconds.
“The human voice serves as a powerful channel for expressing emotional states, as it provides universally understandable cues about the sender’s situation and can transmit them over long distances,” wrote the study’s first author, Hannes Diemerling, of the Max Planck Institute for Human Development’s Center for Lifespan Psychology, in collaboration with Germany-based psychology researchers Leonie Stresemann, Tina Braun, and Timo von Oertzen.
In AI deep learning, the quality and quantity of training data is critical to the performance and accuracy of the algorithm. The audio data used for this research come from over 1,500 unique audio clips from English and German language open-source emotion databases sourced from the Ryerson Audio-Visual Database of Emotional Speech and Song, and German audio recordings were from the Berlin Database of Emotional Speech (Emo-DB).
“Emotional recognition from audio recordings is a rapidly advancing field, with significant implications for artificial intelligence and human-computer interaction,” the researchers wrote.
For the purposes of this study, the researchers narrowed the emotional states to six categories: joy, fear, neutral, anger, sadness, and disgust. The audio recordings were consolidated into 1.5-second segments and various features. The quantified features include pitch tracking, pitch magnitudes, spectral bandwidth, magnitude, phase, MFCC, chroma, Tonnetz, spectral contrast, spectral rolloff, fundamental frequency, spectral centroid, zero crossing rate, Root Mean Square, HPSS, spectral flatness, and unmodified audio signal.
Psychoacoustics is the psychology of sound and the science of human sound perception. Audio frequency (pitch) and amplitude (volume) greatly impact how people experience sound. In psychoacoustics, the pitch describes the frequency of the sound and is measured in hertz (Hz) and kilohertz (kHz). The higher the pitch, the higher the frequency. Amplitude refers to the loudness of the sound and is measured in decibels (db). The higher the amplitude, the greater the sound volume.
Spectral bandwidth (spectral spread) is the range between the upper and lower frequencies and is derived from the spectral centroid. The spectral centroid measures the audio signal spectrum and is the center of the mass of the spectrum. The spectral flatness measures the evenness of the energy distribution across frequencies against a reference signal. The spectral rolloff finds the most strongly represented frequency ranges in a signal.
MFCC, the Mel Frequency Cepstral Coefficient, is a widely used feature for voice processing.
Chroma, or pitch class profiles, are a way to analyze the music’s key, typically with twelve semitones of an octave.
In music theory, Tonnetz (which translates to "audio network" in German) is a visual representation of relationships between chords in Neo-Reimannian Theory, named after German musicologist Hugo Riemann (1849-1919), one of the founders of modern musicology.
A common acoustic feature for audio analysis is zero crossing rate (ZCR). For an audio signal frame, the zero crossing rate measures the number of times the signal amplitude changes its sign and passes through the X-axis.
In audio production, root mean square (RMS) measures the average loudness or power of a sound waveform over time.
HPSS, harmonic-percussive source separation, is a method of breaking down an audio signal into harmonic and percussive components.
The scientists implemented three different AI deep learning models for classifying emotions from short audio clips using a combination of Python, TensorFlow, and Bayesian optimization, and then benchmarked the results against human performance. The AI models evaluated include a deep neural network (DNN), convolutional neural network (CNN), and a hybrid model of a combined DNN to process features with a CNN to analyze spectrograms. The goal was to see which model performed the best.
The researchers discovered that, across the board, the accuracy of the AI models' emotion classification surpassed that of chance and is on par with human performance. Within the three AI models, the deep neural network and hybrid model outperformed the convolutional neural network.
The combination of artificial intelligence and data science applied to psychology and psychoacoustic features illustrates how machines have the potential to perform cognitive empathy tasks based on voice comparably to human-level performance.
“This interdisciplinary research, bridging psychology and computer science, highlights the potential for advancements in automatic emotion recognition and the broad range of applications,” concluded the researchers.
Cami Rosso writes about science, technology, innovation, and leadership.
psychology today