idw - Informationsdienst
Wissenschaft
With the “Graz corpus of read and spontaneous speech”, researchers at TU Graz have developed new methods for speech recognition of Austrian German using speech data from 38 people.
Second-language speakers who come to Austria with a good knowledge of German usually find it difficult to understand the local dialects. Similarly, speech recognition systems often fail to decode regionally accented word choice and pronunciation. Barbara Schuppler from the Signal Processing and Speech Communication Laboratory at Graz University of Technology (TU Graz), together with researchers from the Know Center and the University of Graz, has investigated the complexity of conversational speech, built up a database of conversations in Austrian German and gained new knowledge about how to improve speech recognition. The results were recently published in the paper “What’s so complex about conversational speech?” in the journal Computer Speech & Language. The project was funded by the Austrian Science Fund FWF.
Free-flowing conversations in the recording studio
One of the main aims of the project was to improve the accuracy of automatic speech recognition (ASR) systems in spontaneous conversations with speakers from Austria. The team focused on the challenges posed by spontaneity, short sentences, overlapping speakers and dialectal accent in everyday conversations. In order to have a suitable database, the researchers set up the GRASS database (Graz corpus of read and spontaneous speech). It contains recordings of 38 speakers, which include both read texts and spontaneous conversations in which two people who knew each other well spoke freely for an hour in the recording studio without being given a topic. Since the same speakers were recorded in both speaking styles, the research team was able to eliminate the influence of speaker identity and recording quality on ASR performance.
Based on the database, the team compared various ASR architectures, including the long-established HMM models (hidden Markov models) and the relatively new transformer-based models. This showed that transformer-based models, such as the Whisper speech recognition system, work very well for longer sentences with a lot of context, but have problems with short, fragmentary sentences that frequently occur in conversations. Traditional HMM-based systems that were explicitly trained with pronunciation variations proved to be more robust for short sentences and dialectal language. The researchers therefore want to pursue a hybrid system approach that combines the strengths of both architectures. They have already combined a transformer model with a knowledge-based lexicon and a statistical language model, thereby achieving significant improvements.
Possible use in medical diagnostics
The team also analysed how characteristics such as speech rate, intonation and word choice influence the accuracy of speech recognition. These findings can contribute to the development of ASR systems that better understand human speech in all its nuances. The team plans to continue research in these areas and incorporate the findings into the development of new, more robust speech recognition systems. However, the results of the project also have interesting potential applications beyond this, particularly in the fields of medical diagnostics and human-computer interaction. In the future, ASR systems could be used to recognise dementia or epilepsy based on speech patterns in spontaneous conversations or to make interaction with social robots more natural.
“Spontaneous speech, especially in dialogue, has completely different characteristics compared to a recited or read speech,” says Barbara Schuppler. “By analysing human-human communication in particular, we have gained important findings in our project that also help us technically and open up new areas of application. Together with partners from the PMU Salzburg, Med Uni Graz and Med Uni Vienna, we are already working on follow-up projects to create socially relevant applications based on the foundations we have created in the Austrian Science Fund project.”
Barbara SCHUPPLER
Ass.Prof. Mag.rer.nat. Dr.
TU Graz | Signal Processing and Speech Communication Laboratory
Phone: +43 316 873 4366
b.schuppler@tugraz.at
What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures https://doi.org/10.1016/j.csl.2024.101738
Spontaneity, short sentences, overlapping speakers and dialectal colouring cause problems for speech ...
andreusK
andreusK/Adobe Stock
Criteria of this press release:
Journalists, all interested persons
Information technology, Language / literature
transregional, national
Research results
English
You can combine search terms with and, or and/or not, e.g. Philo not logy.
You can use brackets to separate combinations from each other, e.g. (Philo not logy) or (Psycho and logy).
Coherent groups of words will be located as complete phrases if you put them into quotation marks, e.g. “Federal Republic of Germany”.
You can also use the advanced search without entering search terms. It will then follow the criteria you have selected (e.g. country or subject area).
If you have not selected any criteria in a given category, the entire category will be searched (e.g. all subject areas or all countries).