The audio forensics field has recently faced a new challenge: singing voice deepfake detection. Current approaches to tackle this problem have borrowed methods initially developed for the more established task of speech deepfake detection, often simply retraining these systems on singing voice data. However, effective speech detection techniques may not necessarily perform well on singing voice, and there has been limited research on identifying the factors that can improve detection specifically in the singing domain. This paper investigates the effectiveness of various audio representations and features for discriminating real and synthetically generated singing voice signals. We evaluate two Convolutional Neural Network (CNN)-based detection systems using a wide range of audio representations, including handcrafted, learning-based, and pre-trained features. Through a systematic analysis, we aim to understand the key factors that can improve the performance of deepfake detection methods for singing voices. Additionally, we investigate the differences between singing voice and speech detection, highlighting the implications of the feature sets considered. Our results offer valuable insights and guidance for developing more advanced and effective singing voice deepfake detection systems in the future.
Audio Features Investigation for Singing Voice Deepfake Detection
Gohari, MahyarMembro del Collaboration Group
;Adami, NicolaMembro del Collaboration Group
2025-01-01
Abstract
The audio forensics field has recently faced a new challenge: singing voice deepfake detection. Current approaches to tackle this problem have borrowed methods initially developed for the more established task of speech deepfake detection, often simply retraining these systems on singing voice data. However, effective speech detection techniques may not necessarily perform well on singing voice, and there has been limited research on identifying the factors that can improve detection specifically in the singing domain. This paper investigates the effectiveness of various audio representations and features for discriminating real and synthetically generated singing voice signals. We evaluate two Convolutional Neural Network (CNN)-based detection systems using a wide range of audio representations, including handcrafted, learning-based, and pre-trained features. Through a systematic analysis, we aim to understand the key factors that can improve the performance of deepfake detection methods for singing voices. Additionally, we investigate the differences between singing voice and speech detection, highlighting the implications of the feature sets considered. Our results offer valuable insights and guidance for developing more advanced and effective singing voice deepfake detection systems in the future.| File | Dimensione | Formato | |
|---|---|---|---|
|
Audio_Features_Investigation_for_Singing_Voice_Deepfake_Detection.pdf
gestori archivio
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
446.89 kB
Formato
Adobe PDF
|
446.89 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


