Languages with more speakers tend to be harder for machines to learn

07.11.2023 11:12

Languages with more speakers tend to be harder for machines to learn

Dr. Annette Trabold Presse- und Öffentlichkeitsarbeit
Leibniz-Institut für Deutsche Sprache

Just a few months ago, many people would have found it unimaginable how well artificial intelligence-based "language models" could imitate human speech. What ChatGPT writes is often indistinguishable from human-generated text. A research team at the Leibniz Institute for the German Language (IDS) in Mannheim, Germany have now used text material in 1,293 different languages to investigate how quickly different computer language models learn to "write". The surprising result: languages that are spoken by a large number of people tend to be more difficult for algorithms to learn than languages with a smaller linguistic community.

Language models are computer algorithms that can process and generate human language. A language model can recognize patterns and regularities in large amounts of textual data and thus gradually learns to predict future text. One particular language model is the so-called "Transformer" model, on which the well-known chatbot service, ChatGPT, has been built. As the algorithm is fed human-generated text, it develops an understanding of the probabilities with which word components, words and phrases occur in particular contexts. This learned knowledge is then used to make predictions, i.e. to generate new texts in new situations.
For example, when a model analyzes the sentence "In the dark night I heard a distant ...", it can predict that words like "howl" or "noise" would be appropriate continuations. This prediction is based on some "understanding" of the semantic relationships and probabilities of word combinations in the language.
In a new study, a team of linguists at the IDS has investigated how quickly computer language models learn to predict by training them on text material in 1,293 languages. The team used older and less complex language models as well as modern variants such as the Transformer model mentioned above. They looked at how long it took different algorithms to develop an understanding of patterns in the different languages. The study found that the amount of text an algorithm needs to process in order to learn a language – that is, to make predictions about what will follow – varies from language to language. It turns out that language algorithms tend to have a harder time learning languages with many native speakers than languages represented by a smaller number of speakers.
However, it is not as simple as it sounds. To validate the relationship between learning difficulty and speaker population size, it is essential to control for several factors. The challenge is that languages that are closely related (e.g., German and Swedish) are much more similar than languages that are distantly related (e.g., German and Thai). However, it is not only the degree of relatedness between languages that needs to be controlled for, but also other effects such as the geographical proximity between two languages or the quality of the text material used for training. "In our study, we used a variety of methods from applied statistics and machine learning to control for potential confounding factors as tightly as possible," explains Sascha Wolfer, one of the two authors of the study.
However, regardless of the method and the type of input text used, a stable statistical correlation was found between machine learnability and the size of the speaker population. "The result really surprised us; based on the current state of research, we would have expected the opposite: that languages with a larger population of speakers tend to be easier for a machine to learn," says Alexander Koplenig, lead author of the study. The reasons for this relationship can only be speculated about thus far. For example, an earlier study led by the same research team demonstrated that larger languages tend to be more complex overall. So maybe the increased learning effort "pays off" for human language learners: because once you have learned a complex language, you have more varied linguistic options available to you, which may allow you to express the same content in a shorter form. But more research is needed to test these (or other explanations) out. "We're still relatively at the beginning here," Koplenig points out. "The next step is to find out whether and to what extent our machine learning results can be transferred to human language acquisition."

The Leibniz Institute for the German Language (IDS) in Mannheim is the central scientific institution for the documentation of and research into the contemporary usage and recent history of the German language. The IDS is one of over 90 research and service institutions of the Leibniz Association (“Leibniz-Gemeinschaft”). It is jointly financed by the federal government and all 16 federal states and is under the administrative supervision of the state of Baden-Württemberg. Find more information here: http://www.ids-mannheim.de, https://twitter.com/IDS_Mannheim, http://www.facebook.com/ids.mannheim, https://www.instagram.com/ids_mannheim/ and http://www.leibniz-gemeinschaft.de.

Wissenschaftliche Ansprechpartner:

Dr. Sascha Wolfer
Leibniz-Institut für Deutsche Sprache
R 5, 6-13
D - 68161 Mannheim
Tel.: +49 621 / 1581 - 439
E-Mail: wolfer@ids-mannheim.de

Originalpublikation:

Original publication: Koplenig, Alexander & Wolfer, Sascha. 2023. Languages with more speakers tend to be harder to (machine-)learn. Scientific Reports 13(1). 18521. DOI: https://doi.org/10.1038/s41598-023-45373-z

Bilder

Merkmale dieser Pressemitteilung:
Journalisten
Sprache / Literatur
überregional
Forschungsergebnisse
Englisch

idw – Informationsdienst Wissenschaft

idw-News App:

Languages with more speakers tend to be harder for machines to learn

Dr. Annette Trabold Presse- und Öffentlichkeitsarbeit
Leibniz-Institut für Deutsche Sprache

Wissenschaftliche Ansprechpartner:

Originalpublikation:

idw-News App:

Languages with more speakers tend to be harder for machines to learn

Dr. Annette Trabold Presse- und Öffentlichkeitsarbeit Leibniz-Institut für Deutsche Sprache

Wissenschaftliche Ansprechpartner:

Originalpublikation:

Erweiterte Suche

Umfang der Suche

Datum der Veröffentlichung

Hilfe

Die Suche / Erweiterte Suche im idw-Archiv

Verknüpfungen

Klammern

Wortgruppen

Auswahlkriterien

Dr. Annette Trabold Presse- und Öffentlichkeitsarbeit
Leibniz-Institut für Deutsche Sprache