This work deals with the representation of audiovisual information, to organize its content for future tasks such as retrieval and information browsing. Some indications are provided to demonstrate that a cross-modal analysis of simple visual and audio information is sufficient to organize an audiovisual sequence into semantically meaningful segments. Each segment defines a scene which is coherent from some semantic point of view. Depending on the sophistication of the cross-modal analysis, the scene may represent either a generic story unit or more complex situations such as dialogues or actions. The results shown in this work indicate that audio classification is key in establishing relationships among consecutive shots, allowing us to reach a scene-level description. A higher abstraction level can be reached when a correlation exists among nonconsecutive shots, defining what is called “video idioms.” Accordingly, a generic audio model is proposed: a linear combination of four classes of audio signals. For semantic purposes, it is meaningful to select the classes so that they can serve any subsequent scene characterization. When several audio sources are combined simultaneously, it is assumed that only one is linked to the semantic of the scene, and that it corresponds to the dominant class of audio (in energy terms). The different classes that identify each type of audio are selected to facilitate any decision related to a semantic characterization of the audiovisual information. The problem therefore lies in a source separation task. The proposed scheme classifies the audio signal into the following four component types: speech, music, silence, and miscellaneous other sounds. Its performance are quite satisfactory (∼90%) and were tested extensively using various types of source material. Considering a generic audiovisual sequence, video shots are merged according to this audio classification. Depending on the type of source material (broadcast news, commercials, documentaries, and movies), different types of scenes can be identified, e.g., a single advertisement in the case of commercials; a dialogue situation in a movie. The article describes some experimental simulations in these different fields.

Indexing Audio-Visual databases through a joint audio and video processing.

LEONARDI, Riccardo
1998-01-01

Abstract

This work deals with the representation of audiovisual information, to organize its content for future tasks such as retrieval and information browsing. Some indications are provided to demonstrate that a cross-modal analysis of simple visual and audio information is sufficient to organize an audiovisual sequence into semantically meaningful segments. Each segment defines a scene which is coherent from some semantic point of view. Depending on the sophistication of the cross-modal analysis, the scene may represent either a generic story unit or more complex situations such as dialogues or actions. The results shown in this work indicate that audio classification is key in establishing relationships among consecutive shots, allowing us to reach a scene-level description. A higher abstraction level can be reached when a correlation exists among nonconsecutive shots, defining what is called “video idioms.” Accordingly, a generic audio model is proposed: a linear combination of four classes of audio signals. For semantic purposes, it is meaningful to select the classes so that they can serve any subsequent scene characterization. When several audio sources are combined simultaneously, it is assumed that only one is linked to the semantic of the scene, and that it corresponds to the dominant class of audio (in energy terms). The different classes that identify each type of audio are selected to facilitate any decision related to a semantic characterization of the audiovisual information. The problem therefore lies in a source separation task. The proposed scheme classifies the audio signal into the following four component types: speech, music, silence, and miscellaneous other sounds. Its performance are quite satisfactory (∼90%) and were tested extensively using various types of source material. Considering a generic audiovisual sequence, video shots are merged according to this audio classification. Depending on the type of source material (broadcast news, commercials, documentaries, and movies), different types of scenes can be identified, e.g., a single advertisement in the case of commercials; a dialogue situation in a movie. The article describes some experimental simulations in these different fields.
File in questo prodotto:
File Dimensione Formato  
SL_IJIST-1998_full-text.pdf

gestori archivio

Tipologia: Full Text
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 459.54 kB
Formato Adobe PDF
459.54 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11379/8146
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 23
  • ???jsp.display-item.citation.isi??? 17
social impact