Researchers at the University of Passau have had human examiners compete against OpenAI's ChatGPT – and were themselves surprised by some of the results. The study has been published in the renowned Nature journal Scientific Reports.
How does high population growth affect gross domestic product? Economics students are all too familiar with exam questions like this. As free-text questions, they require not only specialist knowledge but also the ability to think and argue economically. However, marking these answers is a time-consuming task for university assistants: each answer must be checked and assessed individually.
Could artificial intelligence do this work? Researchers from the University of Passau in the fields of economics and computer science have investigated this question. Their study was recently published in the renowned Nature Journal Scientific Reports. OpenAI's GPT-4 language model performs similarly to human examiners in ranking open-text answers.
The results at a glance:
• When the AI model was asked to rank text responses according to correctness and completeness – in the sense of best, second best or worst answer – GPT achieved an assessment comparable to that of human examiners.
• Students cannot impress GPT with AI-generated texts: GPT showed no significant preference for AI-generated or longer answers.
• When evaluating text responses according to a points system, the AI model performed slightly worse in terms of quality. GPT tended to be more generous in its evaluations than humans, in some cases by almost an entire grade.
The researchers conclude that AI cannot yet replace human markers. ‘Writing good sample solutions and re-checking must remain human tasks,’ explains Professor Johann Graf Lambsdorff, Chair of Economic Theory at the University of Passau, who was responsible for the experimental design of the study together with Deborah Voß and Stephan Geschwind. Computer scientist Abdullah Al Zubaer programmed the technical implementation and evaluation under the supervision of Professor Michael Granitzer (Data Science). The researchers argue that exam tasks should continue to be closely supervised by humans. However, AI is certainly suitable as a critical second examiner.
New method for comparing AI and human assessment
There are already several studies on the assessment of AI as an examinee. However, studies on AI as an examiner are rare, and the few that exist use human assessment as a truthful basis. The Passau team goes one step further: it investigates whether AI assessments can compete with that of human examiners – without assuming that humans are always right.
For the experiment, the researchers used free-text answers from students in a macroeconomics course to six questions. The team selected 50 answers per question. The total of 300 answers were evaluated by trained correction assistants. At the same time, GPT was given the same evaluation task.
Since there is no clear ‘correct’ answer to open-ended questions, it is unclear whether an error lies with the AI or with humans. In order to be able to make a comparison nonetheless, the research team used a trick: it used the degree of agreement between the evaluations as a measure of proximity to a presumed truth. The higher the agreement, the closer to the truth.
The starting point was the agreement between the human examiners. One examiner was then replaced by GPT. If this resulted in a higher level of agreement, this was taken as an indication that the AI's assessment was better than that of the human examiners. In fact, GPT was able to slightly increase the score on individual questions. ‘We were partly surprised ourselves at how well the AI performed in some of the assessments,’ says Deborah Voß. Abdullah Al Zubaer adds: ‘In our tests, the quality of GPT-4 remained largely stable even with imprecise or incorrect instructions.’ According to the team, this shows that AI is robust and versatile, even if it still performs slightly weaker in point-based assessments.
Study as part of the interdisciplinary research project DeepWrite
The study was conducted as part of the DeepWrite project funded by the Federal Ministry of Research, Technology and Space (BMFTR). In this project, scientists from the University of Passau in the fields of law, economics, computer science and education are investigating how artificial intelligence can be used effectively in university teaching. Among other things, the team has developed the AI tool ArgueNiser, which helps students train their argumentation skills so that they can better answer the free-text questions mentioned at the beginning. The application is already being used in teaching at the University of Passau.
Professor Urs Kramer from the Passau Institute for the Didactics of Law is responsible for the overall management of the project. Professor Graf Lambsdorff heads the research area of Economics, while Professor Granitzer is in charge of the research area of Data Science. Deborah Voß, Stephan Geschwind and Abdullah Al Zubaer are members of the interdisciplinary research team. Voß and Geschwind are pursuing their doctorates at the Chair of Economic Theory, while Zubaer is doing so at the Chair of Data Science.
Professor Johann Graf Lambsdorff
Chair of Economics with a focus on Economic Theory
Innstraße 27, 94032 Passau
Email: Johann.GrafLambsdorff@uni-passau.de
Zubaer, A.A. et al. GPT-4 shows comparable performance to human examiners in ranking open-text answers. Sci Rep 15, 35045 (2025). https://www.nature.com/articles/s41598-025-21572-8
https://www.uni-passau.de/deepwrite DeepWrite project website
https://www.digital.uni-passau.de/en/beitraege/2025/project-deepwrite Training argumentation and writing with AI – insights into the DeepWrite project
https://www.uni-passau.de/en/deepwrite/argueniser AI tool econArgueNiser
Professor Johann Graf Lambsdorff and research assistant Deborah Voß.
Source: University of Passau
Copyright: University of Passau
Criteria of this press release:
Business and commerce, Journalists, Scientists and scholars, Students, Teachers and pupils, all interested persons
Economics / business administration, Information technology
transregional, national
Research projects, Research results
English

You can combine search terms with and, or and/or not, e.g. Philo not logy.
You can use brackets to separate combinations from each other, e.g. (Philo not logy) or (Psycho and logy).
Coherent groups of words will be located as complete phrases if you put them into quotation marks, e.g. “Federal Republic of Germany”.
You can also use the advanced search without entering search terms. It will then follow the criteria you have selected (e.g. country or subject area).
If you have not selected any criteria in a given category, the entire category will be searched (e.g. all subject areas or all countries).