Method and system for semantically segmenting an audio sequence

Li­qun, Xu; Benini, Sergio

An audio segmentation method and system which automatically segments an audio sequence into audio scenes of similar semantic content is described. The method and system initially splits the audio sequence into segments of arbitrary length (step 101). Next, each segment is subject to short term spectral analysis (step 102) to generate feature vectors characterising the audio. A vector quantisation (VQ) technique is used to generate a signature codeboook using the feature vectors of the audio segments (step 103). An Earth Mover's Distance (EMD) measure is then used to calculate distances between consecutive audio segments (step 104). By statistically analysing the respective (EMD) measures to identify peaks therein, changes in the dominant audio content can be detected indicative of audio scene changes (step 105). In this way, it is possible to automate the timeconsuming and laborious process of organising and indexing increasingly large audio databases such that they can be easily browsed and searched using natural query structures.