The audio forensics field has recently faced a new challenge: singing voice deepfake detection. Current approaches to tackle this problem have borrowed methods initially developed for the more established task of speech deepfake detection, often simply retraining these systems on singing voice data. However, effective speech detection techniques may not necessarily perform well on singing voice, and there has been limited research on identifying the factors that can improve detection specifically in the singing domain. This paper investigates the effectiveness of various audio representations and features for discriminating real and synthetically generated singing voice signals. We evaluate two Convolutional Neural Network (CNN)-based detection systems using a wide range of audio representations, including handcrafted, learning-based, and pre-trained features. Through a systematic analysis, we aim to understand the key factors that can improve the performance of deepfake detection methods for singing voices. Additionally, we investigate the differences between singing voice and speech detection, highlighting the implications of the feature sets considered. Our results offer valuable insights and guidance for developing more advanced and effective singing voice deepfake detection systems in the future.

Audio Features Investigation for Singing Voice Deepfake Detection

Gohari, Mahyar
Membro del Collaboration Group
;
Adami, Nicola
Membro del Collaboration Group
2025-01-01

Abstract

The audio forensics field has recently faced a new challenge: singing voice deepfake detection. Current approaches to tackle this problem have borrowed methods initially developed for the more established task of speech deepfake detection, often simply retraining these systems on singing voice data. However, effective speech detection techniques may not necessarily perform well on singing voice, and there has been limited research on identifying the factors that can improve detection specifically in the singing domain. This paper investigates the effectiveness of various audio representations and features for discriminating real and synthetically generated singing voice signals. We evaluate two Convolutional Neural Network (CNN)-based detection systems using a wide range of audio representations, including handcrafted, learning-based, and pre-trained features. Through a systematic analysis, we aim to understand the key factors that can improve the performance of deepfake detection methods for singing voices. Additionally, we investigate the differences between singing voice and speech detection, highlighting the implications of the feature sets considered. Our results offer valuable insights and guidance for developing more advanced and effective singing voice deepfake detection systems in the future.
File in questo prodotto:
File Dimensione Formato  
Audio_Features_Investigation_for_Singing_Voice_Deepfake_Detection.pdf

gestori archivio

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 446.89 kB
Formato Adobe PDF
446.89 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11379/633787
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact