Metoda automatske detekcije naglašenih riječi u zvučnom zapisu

Stojanović, Aleksandar

prikaz prve stranice dokumenta Metoda automatske detekcije naglašenih riječi u zvučnom zapisu

Download
PDF 6.51 MB

doctoral thesis

Metoda automatske detekcije naglašenih riječi u zvučnom zapisu

2019. urn:nbn:hr:131:248144

Stojanović, Aleksandar

University of Zagreb
Faculty of Humanities and Social Sciences
Department of information and Communication sciences

Linked objects

Metoda automatske detekcije naglašenih riječi u zvučnom zapisu : program | Dataset

Cite this document

APA 6th Edition

Stojanović, A. (2019). Metoda automatske detekcije naglašenih riječi u zvučnom zapisu (Doctoral thesis). Zagreb: University of Zagreb, Faculty of Humanities and Social Sciences. Retrieved from https://urn.nsk.hr/urn:nbn:hr:131:248144

MLA 8th Edition

Stojanović, Aleksandar. "Metoda automatske detekcije naglašenih riječi u zvučnom zapisu." Doctoral thesis, University of Zagreb, Faculty of Humanities and Social Sciences, 2019. https://urn.nsk.hr/urn:nbn:hr:131:248144

Chicago 17th Edition

Harvard

Stojanović, A. (2019). 'Metoda automatske detekcije naglašenih riječi u zvučnom zapisu', Doctoral thesis, University of Zagreb, Faculty of Humanities and Social Sciences, accessed 25 April 2024, https://urn.nsk.hr/urn:nbn:hr:131:248144

Vancouver

Stojanović A. Metoda automatske detekcije naglašenih riječi u zvučnom zapisu [Doctoral thesis]. Zagreb: University of Zagreb, Faculty of Humanities and Social Sciences; 2019 [cited 2024 April 25] Available at: https://urn.nsk.hr/urn:nbn:hr:131:248144

IEEE

A. Stojanović, "Metoda automatske detekcije naglašenih riječi u zvučnom zapisu", Doctoral thesis, University of Zagreb, Faculty of Humanities and Social Sciences, Zagreb, 2019. Available at: https://urn.nsk.hr/urn:nbn:hr:131:248144

Cite this item: https://urn.nsk.hr/urn:nbn:hr:131:248144

Please login to the repository to save this object to your list.

Metadata

Title	Metoda automatske detekcije naglašenih riječi u zvučnom zapisu
Title (english)	A method for automatic detection of emphasized words in recorded speech
Author	Aleksandar Stojanović
Mentor	Nikolaj Lazić (mentor)
Committee member	Nikolaj Lazić (predsjednik povjerenstva)
Granter	University of Zagreb Faculty of Humanities and Social Sciences (Department of information and Communication sciences) Zagreb
Defense date and country	2019-06-27, Croatia
Scientific / art field, discipline and subdiscipline	SOCIAL SCIENCES Information and Communication Sciences Information and Software Engineering
Scientific / art field, discipline and subdiscipline	HUMANISTIC SCIENCES Philology Phonetics
Universal decimal classification (UDC)	80 - Philology
Abstract	Prozodija je važan aspekt govora jer poboljšava informativnost izgovorenog. Jedan segment prozodije uključuje naglašavanje riječi kojim se ističe važnost jedne riječi u kontekstu sadržaja onoga što je izgovoreno, čime se može utjecati i na semantiku tog sadržaja. U tekstu, međutim, taj je aspekt izgubljen, pa je time izgubljena i ta dodatna informativnost napisanog sadržaja. Cilj ovog rada bio je istražiti mogućnosti automatskog vraćanja informacija o naglašenim riječima u tekst koji je spremljen kao podnatpis ili transkript. To se željelo postići bez upotrebe potpuno automatskog sustava za prepoznavanje govora. Naglašenost riječi analizira se kroz tri dimenzije: pojačani intenzitet, povišeni ton, produljeni (usporeni) izgovor. Vraćanje ovih informacija u tekst obogaćuje njegovu informativnost, dok istovremeno, s tehničke strane, takav tekst zahtijeva puno manje memorijskog kapaciteta od zvuka, pa u tom obliku može biti pogodan tamo gdje postoji potreba za spremanjem velikih količina podataka, kao što je arhiviranje ili računalno pretraživanje. Isto tako, ovako obogaćeni tekst može biti koristan za osobe s oštećenim sluhom ili gluhonijeme osobe jer bi se njima na ovaj način olakšalo razumijevanje sadržaja time što bi im se približio izvorni oblik onoga što je i kako je bilo izgovoreno.
Abstract (english)	Prosody is an important aspect of speech because it complements the meaning of spoken communication. One segment of prosody includes word emphasis by which the importance of one word is emphasized in the context of what was spoken, which can affect the semantics of the spoken content. In written text, however, that aspect is lost, thereby losing this additional information. The goal of this work was to develop a method of returning the prosodic component of speech back into text which is present through subtitles or transcript. Additionally, our goal was to achieve that without developing a full-scale speech recognition system. Word emphasis is examined through three dimensions: increased intensity, increased pitch, extended duration of speech at particular words. Returning these aspects back into text enhances its informational contents, while at the same time, from technical perspective, such text would require much less storage space than sound, so such format can be useful in aplications that store large amounts of data, like archiving or information retrieval. Also, such enhanced text can be useful for people with hearing disabilities because it would make it easier for them to get a better understanding of how was something uttered. This disertation is organized into several parts. The first part is the introduction. In the second part basic speech accoustics is described: physical properties of sound, frequency, tone, intensity, F0, formants, and accoustic properties of some phonemes with graphical representation of their spectrum and other accoustic properties. This part will also contain description of some accoustic properties of emphasized words that set them appart from other, nonemphasized words. Part 3 contains description of some sound analysis techniques: spectrum, spectrogram, oscilogram, spectral analysis, LTAS, together with some methods of sound preparation which are important for its analysis, like speech annotation. This part also describes some capabilities of program Praat used in this research, together with some Python libraries. Part 4 contains basics of machine learning and neural networks used in this research for phoneme classification. This part consists of basic introduction into machine learning and neural networks where their advantages compared to some other computational models are described in relation to sound analysis. After that one way of using such neural network in this work is described. Part 5 contains detailed description of a method of speech analysis with the goal of detecting emphasized words. That method consists of several activities divided into the following steps: Speech annotation, where for each phoneme its sound segment is isolated (by hand). This is necessary for neural network training. This is a tedious process because a recording of just a few minutes contains hundreds of phonemes that need to be carefully annotated. Also, determining the beginning and the end of a phoneme is not always simple because phoneme can be uttered only partially, and can also appear one after another where it can be difficult to determine the phoneme boundaries. Creation of Praat script to iterate over segmented speech and perform spectral analysis for each phoneme. The result consists of LTAS values for each phoneme together with the letter categorizing the phoneme. These values are later used for training the neural network with speech of several randomly chosen speakers. After this data preparation steps the next step is training the neural network. This process consists of several steps: 1. Elimination of variations in intensity. For neural network training we only need the spectral shape, so variations in intensity can create more variations for neural network to learn. To speed up the training process we need to eliminate variations in intensity as much as possible. One way to do this, as used in this research, is to increase or decrease all LTAS values by the amount necessary such that the largest value does not exceed some given intensity, but keeping all other values in the same ratio to each other as before. 2. Since the LTAS value range is not in the 0..1 interval the values need to be scaled. This is done because the neural network works only with the values in this interval. 3. The values are then organized into a data structure which contains the LTAS values and the category of the phoneme which these values represent. After that, neural network training is performed. The goal of this training is to later use the neural network to classify phonemes from new recordings not used for the training. 4. After the previous step the result would be a neural network trained for phoneme classification. The next phase is the process of emphasized word detection. Before that, however, we extracted the transcripts from the media file to get the information on when on the recording these transcripts appear. This is important for determining later which words the classified phonemes belong to. For example, if in a speech segment phonemes „d..ava“ have been recognized and the text of the transcript in that sound segment contains letters „država” (croatian for state) then it is likely that these phonemes belong to this word. Then the analysis of pitch and intensity would determine if the word was emphasized. After neural network training the detection of emphasized words consists of the following steps: 1. Phoneme classification from a speaker not used for neural network training. For phoneme classification we used two steps: First, the recording is partitioned into segments of 10 ms and for each of the segments the LTAS is calculated. Then, in the second step, the recording is marked with positions that contain glotal pulses (as calculated by Praat) and for each of those positions a segment of 5 ms before and after is selected for which LTAS is calculated. This second step helps avoiding skipping over some important sounds. 2. The result of the previous step is a sequence of phonemes which were the result from the classification process performed on those 10 ms sound segments. This phoneme sequence will contain the letter (phoneme) and time at which it appears in the recording. Some of those phonemes will be classified correctly, but some will not. For example, instead of classifying a phoneme as m the neural network might classify it incorrectly as v or some other phoneme. In order to determine which words were emphasized it is necessary to determine word boundaries. It is clear that the more correctly classified phonemes there are the easier it will be to find the word to which those phonemes belong. However, since many phonemes will be classified incorrectly, the text from the transcript needs to be matched against the phonemes produced by the neural network. This will be solved by using an alignment algorithm that will try to align the sequence of phonemes with the letters of the text from the transcript. 3. The result of the alignment will provide approximate information about where each word from the transcript begins and ends in the recording. Then the sound segment of each words is analysed from the perspective of F0, intensity and duration, which determines if a word was emphasized. Most of the previously described steps assumes creation of Praat and Python scripts by which these processes will be automated, which includes modules for testing and analysis of the results. Figures 1 and 2 show this process. Part 6 contains results obtained from recordings of new speakers (those whose speech was not used for neural network training). These recordings include several speakers thereby showing how this method functions in different environments from those used for testing (regarding speech tempo, pitch, voice, speech patterns, etc.). Also, it shows the speech-to-text alignment precision. Part 7 contains the conclusion. Here, the advantages and disadvantages of this method as compared to some others is discussed. Also, some alternatives are described as well, together with some possible improvements. Additionally, this part underlines the importance of having a larger corpus of annoated speech in croatian as a condition for many usefull future research in this area. Since the automatic recognition of phonemes in croatian is important for many research activities in this area (such as emotion detection, speaker identification, prosody analysis, etc.), such corpus would be an essential tool for this research.
Keywords
Keywords (english)
Language	croatian
URN:NBN	urn:nbn:hr:131:248144
Study programme	Title: Postgraduate (Doctoral) Program in Information Science Study programme type: university Study level: postgraduate Academic / professional title: doktor/doktorica znanosti, područje društvenih znanosti, polje informacijske i komunikacijske znanosti (doktor/doktorica znanosti, područje društvenih znanosti, polje informacijske i komunikacijske znanosti)
Type of resource	Text
Extent	229 str.
File origin	Born digital
Access conditions	Open access
Terms of use
Public note	Popratni skup podataka uz rad vidljiv je ovdje: https://urn.nsk.hr/urn:nbn:hr:131:854249
Created on	2020-02-12 12:32:17