Automatic Speech Recognition on Par with Humans in Noisy Conditions

Share on:

01/15/2025 09:30

Automatic Speech Recognition on Par with Humans in Noisy Conditions

Melanie Nyfeler Kommunikation
Universität Zürich

Are humans or machines better at recognizing speech? A new study shows that in noisy conditions, current automatic speech recognition (ASR) systems achieve remarkable accuracy and sometimes even surpass human performance. However, the systems need to be trained on an incredible amount of data, while humans acquire comparable skills in less time.

Automatic speech recognition (ASR) has made incredible advances in the past few years, especially for widely spoken languages such as English. Prior to 2020, it was typically assumed that human abilities for speech recognition far exceeded automatic systems, yet some current systems have started to match human performance. The goal in developing ASR systems has always been to lower the error rate, regardless of how people perform in the same environment. After all, not even people will recognize speech with 100% accuracy in a noisy environment.

In a new study, UZH computational linguistics specialist Eleanor Chodroff and a fellow researcher from Cambridge University, Chloe Patman, compared two popular ASR systems – Meta’s wav2vec 2.0 and Open AI’s Whisper – against native British English listeners. They tested how well the systems recognized speech in speech-shaped noise (a static noise) or pub noise, and produced with or without a cotton face mask.

Latest OpenAI system better – with one exception

The researchers found that humans still maintained the edge against both ASR systems. However, OpenAI’s most recent large ASR system, Whisper large-v3, significantly outperformed human listeners in all tested conditions except naturalistic pub noise, where it was merely on par with humans. Whisper large-v3 has thus demonstrated its ability to process the acoustic properties of speech and successfully map it to the intended message (i.e., the sentence). “This was impressive as the tested sentences were presented out of context, and it was difficult to predict any one word from the preceding words,” Eleanor Chodroff says.

Vast training data

A closer look at the ASR systems and how they’ve been trained shows that humans are nevertheless doing something remarkable. Both tested systems involve deep learning, but the most competitive system, Whisper, requires an incredible amount of training data. Meta’s wav2vec 2.0 was trained on 960 hours (or 40 days) of English audio data, while the default Whisper system was trained on over 75 years of speech data. The system that actually outperformed human ability was trained on over 500 years of nonstop speech. “Humans are capable of matching this performance in just a handful of years,” says Chodroff. “Considerable challenges also remain for automatic speech recognition in almost all other languages.”

Different types of errors

The paper also reveals that humans and ASR systems make different types of errors. English listeners almost always produced grammatical sentences, but were more likely to write sentence fragments, as opposed to trying to provide a written word for each part of the spoken sentence. In contrast, wav2vec 2.0 frequently produced gibberish in the most difficult conditions. Whisper also tended to produce full grammatical sentences, but was more likely to “fill in the gaps” with completely wrong information.

Contact for scientific information:

Prof. Dr. Eleanor Chodroff
Department of Computational Linguistics
University of Zurich
+41 76 426 27 07
eleanor.chodroff@uzh.ch

Original publication:

References
Chloe Patman, Eleanor Chodroff. Speech recognition in adverse conditions by humans and machines. JASA Express Lett. 4, 115204 (2024). DOI: https://doi.org/10.1121/10.0032473

More information:

https://www.news.uzh.ch/en/articles/media/2025/Spracherkennung.html

Images

Criteria of this press release:
Journalists
Information technology, Language / literature, Media and communication sciences, Social studies, Teaching / education
transregional, national
Research results, Transfer of Science or Research
English

idw – Informationsdienst Wissenschaft

idw-News App:

Automatic Speech Recognition on Par with Humans in Noisy Conditions

Melanie Nyfeler Kommunikation
Universität Zürich

Contact for scientific information:

Original publication:

More information:

idw-News App:

Automatic Speech Recognition on Par with Humans in Noisy Conditions

Melanie Nyfeler Kommunikation Universität Zürich

Contact for scientific information:

Original publication:

More information:

Advanced Search

Extent of search

Date of publication

Help

Search / advanced search of the idw archives

Combination of search terms

Brackets

Phrases

Selection criteria

Melanie Nyfeler Kommunikation
Universität Zürich