Audio Features Investigation for Singing Voice Deepfake Detection

Gohari, Mahyar; Salvi, Davide; Bestagini, Paolo; Adami, Nicola

doi:10.1109/icassp49660.2025.10888452

The audio forensics field has recently faced a new challenge: singing voice deepfake detection. Current approaches to tackle this problem have borrowed methods initially developed for the more established task of speech deepfake detection, often simply retraining these systems on singing voice data. However, effective speech detection techniques may not necessarily perform well on singing voice, and there has been limited research on identifying the factors that can improve detection specifically in the singing domain. This paper investigates the effectiveness of various audio representations and features for discriminating real and synthetically generated singing voice signals. We evaluate two Convolutional Neural Network (CNN)-based detection systems using a wide range of audio representations, including handcrafted, learning-based, and pre-trained features. Through a systematic analysis, we aim to understand the key factors that can improve the performance of deepfake detection methods for singing voices. Additionally, we investigate the differences between singing voice and speech detection, highlighting the implications of the feature sets considered. Our results offer valuable insights and guidance for developing more advanced and effective singing voice deepfake detection systems in the future.