Detecting Categories in News Video Using Acoustic, Speech, and Image Features.
Article 2006 en
Authors
SP
Slav Petrov
AF
Arlo Faria
PM
Pascal Michaillat
Abstract
1 min read
This work describes systems for detecting semantic categories present in news video. The multimedia data was processed in three ways: the audio signal was converted to a sequence of acoustic features, automatic speech recognition provided a word-level transcription, and image features were computed for selected frames of the video signal. Primary acoustic, speech, and vision systems were trained to discriminate instances of the categories. Higher-level systems exploited correlations among the categories, incorporated sequential context, and combined the joint evidence from the three information sources. We present experimental results from the TREC video retrieval evaluation. 1. OVERVIEW We participated in the ”High-Level Feature-Extraction ” task and focused on the design of effective acoustic, speech and image features: • ucb 1best: For each category choose the vision or speech system that performed best on a held out set. • ucb vision: SVM trained on image features only. • ucb fusion: SVM trained on a weighted combination of image, speech and acoustic features. • ucb concat: SVM on top of SVMs which are trained on image, speech and acoustic features (uses only TRECVID provided ASR and MT). • ucb text: SVM trained on speech features from the SRI speech recognizer. • ucb sound: SVM trained on the outputs of category specific acoustic GMMs. In our experiments, shape features extracted from images were more effective than speech or acoustic features extracted from the audio signal. Our system which used only image features (ucb vision) achieved a mean AP of 0.11 and was perhaps the best vision only system. When using speech features, we found that performance can be greatly improved by using ASR from SRI rather than the TRECVID provided ASR/MT. 2.
Discussion(0)
No comments yet. Be the first to comment.