Using Semantics for Speech Annotation of Images.

Appeared in IEEE ICDE Conference, March 2009.

Chaitanya Desai, Dmitri V. Kalashnikov, Sharad Mehrotra, and Nalini Venkatasubramanian

Computer Science Department
University of California, Irvine


Digital cameras and multimedia capture devices are becoming increasingly popular to take pictures.Annotating these pictures is important to support their browsing and retrieval. Fully automatic image annotation techniques typically rely entirely on visual properties of the image. The state of the art image annotation systems of this kind work well in detecting generic object classes: car, horse, motorcycle, airplane, etc. However, certain characteristics of the image are hard to capture using strictly the visual properties. These include location (Paris, California, San Francisco, etc), event (birthday, wedding, graduation ceremony, etc), people (John, Jane, brother, etc) and abstract qualities referring to objects in the image (beautiful, funny, sweet, etc) among others. The more conventional method of annotation that relies completely on human input has several limitations as well. Typing tags using the keypads of such devices can be cumbersome and error-prone. Secondly, delay in tagging may result in a loss of context in which the picture was taken (e.g., user may not remember the names of the people/structures in the image). This presents an opportunity for using speech as a modality to annotate images and/or other multimedia content. Most camera devices have a built-in microphone. In principle, some of the challenges associated with both, fully automatic annotation as well as manual tagging can be alleviated if the user were to use speech as a medium of annotation. Ideally, the user would take a picture and speak the desired tags into the device's microphone. A speech recognizer would transcribe the audio signal into text. The speech to text transcription can happen either on the device itself or be done on a remote machine. The transcribed text can be used as tags for the image, exactly as the user intended. One of the biggest bottlenecks facing such systems is the accuracy of the underlying speech recognizer. Even speaker dependent recognition systems can make mistakes in noisy environments. If the recognizer's output is considered as is for annotation, then poor recognition will lead to poor quality tags. Our work tries to address this issue by incorporating outside semantic knowledge to improve interpretation of the recognizer's output, as opposed to blindly believing what the recognizer suggests. To improve interpretation of speech output, we exploit the fact that most speech recognizers provide alternate hypotheses for each utterance. The main contribution of this paper is our approach for annotating images using speech as the input modality. The approach employs a probabilistic model for computing the joint probability of a given combination of tags using a Maximum Entropy solution. The extensive empirical evaluation demonstrates the advantage of the proposed solution, that leads to a significant improvement of quality of speech annotation.

Categories and Subject Descriptors:

H.2.m [Database Management]: Miscellaneous - Semantic Image Tagging;
H.2.8 [Database Management]: Database Applications - Data mining;
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval


Speech Annotation, Image Tagging, Disambiguation

Downloadable files:

Paper: ICDE09_dvk_Speech.pdf
Presentation: ICDE09_dvk_Speech.ppt

BibTeX entry:

   author    = {Chaitanya Desai and Dmitri V.\ Kalashnikov and Sharad Mehrotra and Nalini Venkatasubramanian},
   title     = {Using Semantics for Speech Annotation of Images},
   booktitle = {Proc.\ of the 25th IEEE Intíl Conference on Data Engineering (IEEE ICDE 2009)},
   note      = {short publication},
   year      = {2009},   
   month     = {March 29 - April 4},
   address   = {Shanghai, China}   

Back to Kalashnikov's homepage