idw – Informationsdienst Wissenschaft

Nachrichten, Termine, Experten

Grafik: idw-Logo
Science Video Project
idw-Abo

idw-News App:

AppStore

Google Play Store



Instanz:
Teilen: 
24.10.2025 11:55

Passau study shows: AI passes as second corrector in exams

Kathrin Haimerl Abteilung Kommunikation
Universität Passau

    Researchers at the University of Passau have had human examiners compete against OpenAI's ChatGPT – and were themselves surprised by some of the results. The study has been published in the renowned Nature journal Scientific Reports.

    How does high population growth affect gross domestic product? Economics students are all too familiar with exam questions like this. As free-text questions, they require not only specialist knowledge but also the ability to think and argue economically. However, marking these answers is a time-consuming task for university assistants: each answer must be checked and assessed individually.

    Could artificial intelligence do this work? Researchers from the University of Passau in the fields of economics and computer science have investigated this question. Their study was recently published in the renowned Nature Journal Scientific Reports. OpenAI's GPT-4 language model performs similarly to human examiners in ranking open-text answers.

    The results at a glance:
    • When the AI model was asked to rank text responses according to correctness and completeness – in the sense of best, second best or worst answer – GPT achieved an assessment comparable to that of human examiners.
    • Students cannot impress GPT with AI-generated texts: GPT showed no significant preference for AI-generated or longer answers.
    • When evaluating text responses according to a points system, the AI model performed slightly worse in terms of quality. GPT tended to be more generous in its evaluations than humans, in some cases by almost an entire grade.

    The researchers conclude that AI cannot yet replace human markers. ‘Writing good sample solutions and re-checking must remain human tasks,’ explains Professor Johann Graf Lambsdorff, Chair of Economic Theory at the University of Passau, who was responsible for the experimental design of the study together with Deborah Voß and Stephan Geschwind. Computer scientist Abdullah Al Zubaer programmed the technical implementation and evaluation under the supervision of Professor Michael Granitzer (Data Science). The researchers argue that exam tasks should continue to be closely supervised by humans. However, AI is certainly suitable as a critical second examiner.

    New method for comparing AI and human assessment

    There are already several studies on the assessment of AI as an examinee. However, studies on AI as an examiner are rare, and the few that exist use human assessment as a truthful basis. The Passau team goes one step further: it investigates whether AI assessments can compete with that of human examiners – without assuming that humans are always right.

    For the experiment, the researchers used free-text answers from students in a macroeconomics course to six questions. The team selected 50 answers per question. The total of 300 answers were evaluated by trained correction assistants. At the same time, GPT was given the same evaluation task.

    Since there is no clear ‘correct’ answer to open-ended questions, it is unclear whether an error lies with the AI or with humans. In order to be able to make a comparison nonetheless, the research team used a trick: it used the degree of agreement between the evaluations as a measure of proximity to a presumed truth. The higher the agreement, the closer to the truth.

    The starting point was the agreement between the human examiners. One examiner was then replaced by GPT. If this resulted in a higher level of agreement, this was taken as an indication that the AI's assessment was better than that of the human examiners. In fact, GPT was able to slightly increase the score on individual questions. ‘We were partly surprised ourselves at how well the AI performed in some of the assessments,’ says Deborah Voß. Abdullah Al Zubaer adds: ‘In our tests, the quality of GPT-4 remained largely stable even with imprecise or incorrect instructions.’ According to the team, this shows that AI is robust and versatile, even if it still performs slightly weaker in point-based assessments.

    Study as part of the interdisciplinary research project DeepWrite

    The study was conducted as part of the DeepWrite project funded by the Federal Ministry of Research, Technology and Space (BMFTR). In this project, scientists from the University of Passau in the fields of law, economics, computer science and education are investigating how artificial intelligence can be used effectively in university teaching. Among other things, the team has developed the AI tool ArgueNiser, which helps students train their argumentation skills so that they can better answer the free-text questions mentioned at the beginning. The application is already being used in teaching at the University of Passau.

    Professor Urs Kramer from the Passau Institute for the Didactics of Law is responsible for the overall management of the project. Professor Graf Lambsdorff heads the research area of Economics, while Professor Granitzer is in charge of the research area of Data Science. Deborah Voß, Stephan Geschwind and Abdullah Al Zubaer are members of the interdisciplinary research team. Voß and Geschwind are pursuing their doctorates at the Chair of Economic Theory, while Zubaer is doing so at the Chair of Data Science.


    Wissenschaftliche Ansprechpartner:

    Professor Johann Graf Lambsdorff
    Chair of Economics with a focus on Economic Theory
    Innstraße 27, 94032 Passau
    Email: Johann.GrafLambsdorff@uni-passau.de


    Originalpublikation:

    Zubaer, A.A. et al. GPT-4 shows comparable performance to human examiners in ranking open-text answers. Sci Rep 15, 35045 (2025). https://www.nature.com/articles/s41598-025-21572-8


    Weitere Informationen:

    https://www.uni-passau.de/deepwrite DeepWrite project website
    https://www.digital.uni-passau.de/en/beitraege/2025/project-deepwrite Training argumentation and writing with AI – insights into the DeepWrite project
    https://www.uni-passau.de/en/deepwrite/argueniser AI tool econArgueNiser


    Bilder

    Professor Johann Graf Lambsdorff and research assistant Deborah Voß.
    Professor Johann Graf Lambsdorff and research assistant Deborah Voß.
    Quelle: University of Passau
    Copyright: University of Passau


    Merkmale dieser Pressemitteilung:
    Journalisten, Lehrer/Schüler, Studierende, Wirtschaftsvertreter, Wissenschaftler, jedermann
    Informationstechnik, Wirtschaft
    überregional
    Forschungsergebnisse, Forschungsprojekte
    Englisch


     

    Professor Johann Graf Lambsdorff and research assistant Deborah Voß.


    Zum Download

    x

    Hilfe

    Die Suche / Erweiterte Suche im idw-Archiv
    Verknüpfungen

    Sie können Suchbegriffe mit und, oder und / oder nicht verknüpfen, z. B. Philo nicht logie.

    Klammern

    Verknüpfungen können Sie mit Klammern voneinander trennen, z. B. (Philo nicht logie) oder (Psycho und logie).

    Wortgruppen

    Zusammenhängende Worte werden als Wortgruppe gesucht, wenn Sie sie in Anführungsstriche setzen, z. B. „Bundesrepublik Deutschland“.

    Auswahlkriterien

    Die Erweiterte Suche können Sie auch nutzen, ohne Suchbegriffe einzugeben. Sie orientiert sich dann an den Kriterien, die Sie ausgewählt haben (z. B. nach dem Land oder dem Sachgebiet).

    Haben Sie in einer Kategorie kein Kriterium ausgewählt, wird die gesamte Kategorie durchsucht (z.B. alle Sachgebiete oder alle Länder).