Detecting Categories in News Video Using Acoustic, Speech, and Image Features.

Slav Petrov; Arlo Faria; Pascal Michaillat; Alexander C. Berg; Andreas Stolcke; Dan Klein; Jitendra Malik

Abstract

1 min read

This work describes systems for detecting semantic categories present in news video. The multimedia data was processed in three ways: the audio signal was converted to a sequence of acoustic features, automatic speech recognition provided a word-level transcription, and image features were computed for selected frames of the video signal. Primary acoustic, speech, and vision systems were trained to discriminate instances of the categories. Higher-level systems exploited correlations among the categories, incorporated sequential context, and combined the joint evidence from the three information sources. We present experimental results from the TREC video retrieval evaluation. 1. OVERVIEW We participated in the ”High-Level Feature-Extraction ” task and focused on the design of effective acoustic, speech and image features: • ucb 1best: For each category choose the vision or speech system that performed best on a held out set. • ucb vision: SVM trained on image features only. • ucb fusion: SVM trained on a weighted combination of image, speech and acoustic features. • ucb concat: SVM on top of SVMs which are trained on image, speech and acoustic features (uses only TRECVID provided ASR and MT). • ucb text: SVM trained on speech features from the SRI speech recognizer. • ucb sound: SVM trained on the outputs of category specific acoustic GMMs. In our experiments, shape features extracted from images were more effective than speech or acoustic features extracted from the audio signal. Our system which used only image features (ucb vision) achieved a mean AP of 0.11 and was perhaps the best vision only system. When using speech features, we found that performance can be greatly improved by using ASR from SRI rather than the TRECVID provided ASR/MT. 2.

Detecting Categories in News Video Using Acoustic, Speech, and Image Features.

Abstract

Discussion(0)

Related publications

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Multi Modal Adaptive Normalization for Audio to Video Generation

Robust One Shot Audio to Video Generation

PartBook for image parsing

Related publications

Article2021
Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
Article2021

Preprint2021
Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
Preprint2021

Preprint2020
Multi Modal Adaptive Normalization for Audio to Video Generation
Preprint2020

Article2020
Robust One Shot Audio to Video Generation
Article2020

Article2012
PartBook for image parsing
Article2012