idw - Informationsdienst
Wissenschaft
Are humans or machines better at recognizing speech? A new study shows that in noisy conditions, current automatic speech recognition (ASR) systems achieve remarkable accuracy and sometimes even surpass human performance. However, the systems need to be trained on an incredible amount of data, while humans acquire comparable skills in less time.
Automatic speech recognition (ASR) has made incredible advances in the past few years, especially for widely spoken languages such as English. Prior to 2020, it was typically assumed that human abilities for speech recognition far exceeded automatic systems, yet some current systems have started to match human performance. The goal in developing ASR systems has always been to lower the error rate, regardless of how people perform in the same environment. After all, not even people will recognize speech with 100% accuracy in a noisy environment.
In a new study, UZH computational linguistics specialist Eleanor Chodroff and a fellow researcher from Cambridge University, Chloe Patman, compared two popular ASR systems – Meta’s wav2vec 2.0 and Open AI’s Whisper – against native British English listeners. They tested how well the systems recognized speech in speech-shaped noise (a static noise) or pub noise, and produced with or without a cotton face mask.
Latest OpenAI system better – with one exception
The researchers found that humans still maintained the edge against both ASR systems. However, OpenAI’s most recent large ASR system, Whisper large-v3, significantly outperformed human listeners in all tested conditions except naturalistic pub noise, where it was merely on par with humans. Whisper large-v3 has thus demonstrated its ability to process the acoustic properties of speech and successfully map it to the intended message (i.e., the sentence). “This was impressive as the tested sentences were presented out of context, and it was difficult to predict any one word from the preceding words,” Eleanor Chodroff says.
Vast training data
A closer look at the ASR systems and how they’ve been trained shows that humans are nevertheless doing something remarkable. Both tested systems involve deep learning, but the most competitive system, Whisper, requires an incredible amount of training data. Meta’s wav2vec 2.0 was trained on 960 hours (or 40 days) of English audio data, while the default Whisper system was trained on over 75 years of speech data. The system that actually outperformed human ability was trained on over 500 years of nonstop speech. “Humans are capable of matching this performance in just a handful of years,” says Chodroff. “Considerable challenges also remain for automatic speech recognition in almost all other languages.”
Different types of errors
The paper also reveals that humans and ASR systems make different types of errors. English listeners almost always produced grammatical sentences, but were more likely to write sentence fragments, as opposed to trying to provide a written word for each part of the spoken sentence. In contrast, wav2vec 2.0 frequently produced gibberish in the most difficult conditions. Whisper also tended to produce full grammatical sentences, but was more likely to “fill in the gaps” with completely wrong information.
Prof. Dr. Eleanor Chodroff
Department of Computational Linguistics
University of Zurich
+41 76 426 27 07
eleanor.chodroff@uzh.ch
References
Chloe Patman, Eleanor Chodroff. Speech recognition in adverse conditions by humans and machines. JASA Express Lett. 4, 115204 (2024). DOI: https://doi.org/10.1121/10.0032473
https://www.news.uzh.ch/en/articles/media/2025/Spracherkennung.html
Criteria of this press release:
Journalists
Information technology, Language / literature, Media and communication sciences, Social studies, Teaching / education
transregional, national
Research results, Transfer of Science or Research
English
You can combine search terms with and, or and/or not, e.g. Philo not logy.
You can use brackets to separate combinations from each other, e.g. (Philo not logy) or (Psycho and logy).
Coherent groups of words will be located as complete phrases if you put them into quotation marks, e.g. “Federal Republic of Germany”.
You can also use the advanced search without entering search terms. It will then follow the criteria you have selected (e.g. country or subject area).
If you have not selected any criteria in a given category, the entire category will be searched (e.g. all subject areas or all countries).