Audio mining
Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term audio mining is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words. == History == Academic research on audio mining began in the late 1970s in schools like Carnegie Mellon University, Columbia University, the Georgia Institute of Technology, and the University of Texas. Audio data indexing and retrieval began to receive attention and demand in the early 1990s, when multimedia content started to develop and the volume of audio content significantly increased. Before audio mining became the mainstream method, written transcripts of audio content were created and manually analyzed. == Process == Audio mining is typically split into four components: audio indexing, speech processing and recognition systems, feature extraction and audio classification. The audio will typically be processed by a speech recognition system in order to identify word or phoneme units that are likely to occur in the spoken content. This information may either be used immediately in pre-defined searches for keywords or phrases (a real-time "word spotting" system), or the output of the speech recognizer may be stored in an index file. One or more audio mining index files can then be loaded at a later date in order to run searches for keywords or phrases. The results of a search will normally be in terms of hits, which are regions within files that are good matches for the chosen keywords. The user may then be able to listen to the audio corresponding to these hits in order to verify if a correct match was found. === Audio Indexing === In audio, there is the main problem of information retrieval - there is a need to locate the text documents that contain the search key. Unlike humans, a computer is not able to distinguish between the different types of audios such as speed, mood, noise, music or human speech - an effective searching method is needed. Hence, audio indexing allows efficient search for information by analyzing an entire file using speech recognition. An index of content is then produced, bearing words and their locations done through content-based audio retrieval, focusing on extracted audio features. It is done through mainly two methods: Large Vocabulary Continuous Speech Recognition (LVCSR) and Phonetic-based Indexing. ==== Large Vocabulary Continuous Speech Recognizers (LVCSR) ==== In text-based indexing or large vocabulary continuous speech recognition (LVCSR), the audio file is first broken down into recognizable phonemes. It is then run through a dictionary that can contain several hundred thousand entries and matched with words and phrases to produce a full text transcript. A user can then simply search a desired word term and the relevant portion of the audio content will be returned. If the text or word could not be found in the dictionary, the system will choose the next most similar entry it can find. The system uses a language understanding model to create a confidence level for its matches. If the confidence level be below 100 percent, the system will provide options of all the found matches. ===== Advantages and disadvantages ===== The main draw of LVCSR is its high accuracy and high searching speed. In LVCSR, statistical methods are used to predict the likelihood of different word sequences, hence the accuracy is much higher than the single word lookup of a phonetic search. If the word can be found, the probability of the word spoken is very high. Meanwhile, while initial processing of audio takes a fair bit of time, searching is quick as just a simple test to text matching is needed. On the other hand, LVCSR is susceptible to common issues of speech recognition. The inherent random nature of audio and problems of external noise all affect the accuracies of text-based indexing. Another problem with LVCSR is its over reliance on its dictionary database. LVCSR only recognizes words that are found in their dictionary databases, and these dictionaries and databases are unable to keep up with the constant evolving of new terminology, names and words. Should the dictionary not contain a word, there is no way for the system to identify or predict it. This reduces the accuracy and reliability of the system. This is named the Out-of-vocabulary (OOV) problem. Audio mining systems try to cope with OOV by continuously updating the dictionary and language model used, but the problem still remains significant and has probed a search for alternatives. Additionally, due to the need to constantly update and maintain task-based knowledge and large training databases to cope with the OOV problem, high computational costs are incurred. This makes LVCSR an expensive approach to audio mining. ==== Phonetic-based Indexing ==== Phonetic-based indexing also breaks the audio file into recognizable phonemes, but instead of converting them to a text index, they are kept as they are and analyzed to create a phonetic-based index. The process of phonetic-based indexing can be split into two phases. The first phase is indexing. It begins by converting the input media into a standard audio representation format (PCM). Then, an acoustic model is applied to the speech. This acoustic model represents characteristics of both an acoustic channel (an environment in which the speech was uttered and a transducer through which it was recorded) and a natural language (in which human beings expressed the input speech). This produces a corresponding phonetic search track, or phonetic audio track (PAT), a highly compressed representation of the phonetic content of the input media. The second phase is searching. The user's search query term is parsed into a possible phoneme string using a phonetic dictionary. Then, multiple PAT files can be scanned at high speed during a single search for likely phonetic sequences that closely match corresponding strings of phonemes in the query term. ===== Advantages and disadvantages ===== Phonetic indexing is most attractive as it is largely unaffected by linguistic issues such as unrecognized words and spelling errors. Phonetic preprocessing maintains an open vocabulary that does not require updating. That makes it particularly useful for searching specialized terminology or words in foreign languages that do not commonly appear in dictionaries. It is also more effective for searching audio files with disruptive background noise and/or unclear utterances as it can compile results based on the sounds it can discern, and should the user wish to, they can search through the options until they find the desired item. Furthermore, in contrast to LVCSR, it can process audio files very quickly as there are very few unique phonemes between languages. However, phonemes cannot be effectively indexed like an entire word, thus searching on a phonetic-based system is slow. An issue with phonetic indexing is its low accuracy. Phoneme-based searches result in more false matches than text-based indexing. This is especially prevalent for short search terms, which have a stronger likelihood of sounding similar to other words or being part of bigger words. It could also return irrelevant results from other languages. Unless the system recognizes exactly the entire word, or understands phonetic sequences of languages, it is difficult for phonetic-based indexing to return accurate findings. === Speech processing and recognition system === Deemed as the most critical and complex component of audio mining, speech recognition requires the knowledge of human speech production system and its modeling. To correspond the Human speech production system, the electrical speech production system is developed to consist of: Speech generation Speech perception Voiced & unvoiced speech Model of human speech The electrical speech production system converts acoustic signal into corresponding representation of the spoken through the acoustic models in their software where all phonemes are represented. A statistical language model aids in the process by identifying how likely words are to follow each other in certain languages. Put together with a complex probability analysis, the speech recognition system is capable of taking an unknown speech signal and transcribing it into words based on the program's dictionary. ASR (automatic speech recognition) system includes: Acoustic analysis: input sound waveform is transformed into a feature Acoustic model: establishes relationship between speech signal and phonemes, pronunciation model and lang
Read more →