idw – Informationsdienst Wissenschaft

Nachrichten, Termine, Experten

Grafik: idw-Logo
Science Video Project

idw-News App:


Google Play Store

27.05.2024 09:58

The next level of online-search: Making complex website-information accessable via AI

Jeremy Gob DFKI Kaiserslautern | Darmstadt
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, DFKI

    ‘Sovereign Cloud: Secure integration of business expert knowledge into large language models’ is the name of Sven Schmeier and his team's ambitious project. ‘The aim is to investigate the extent to which it is possible to make websites accessible and embed them in a RAG (Retrieval Augmented Generation) system in such a way that it is possible to ask complex questions about these websites,’ explains the DFKI-expert in AI language technologies.

    Using RAG, a language model is to be optimised so that it can refer to information outside of its own training data and incorporate this into an answer. In the case of the project, the relevant websites are to act as sources of knowledge.

    AI determines page content and prepares information

    If the project succeeds as planned, answers to questions such as ‘Which countries do the MAs who have studied computational linguistics and are working on speech recognition come from?’ will be just a finger exercise for the DFKI technology. Among other things, it opens up the possibility of finding out things on the basis of the website-specific RAGs that would otherwise hardly be visible or combinable.

    Another advantage: ‘The websites automatically become accessible because they can be presented in many languages, by text, voice, image, etc. and in simplified language,’ says Schmeier. At the same time, website maintenance would become much less complicated.

    Real answers

    Conventional search engines return documents as results to the person searching. RAGs, on the other hand, provide real answers - however, many problems that arise with RAGs from websites have not yet been solved.

    The solution approach of the researchers at DFKI: ‘Through the type of indexing, i.e. the transformation of the website content into the content of the RAG, we can find general solutions for the RAGs that can also be applied to other sources,’ explains Schmeier. This would be made possible, for example, by links within documents to other documents.

    Difficulties within the project

    Making all information accessible for corresponding search queries appears to be a mammoth task that involves a number of hurdles. Even if everything runs smoothly on the part of the AI application, the difficulty lies in the individuality of the websites.

    ‘When parsing the websites to create a robust textual representation of the websites, there have been application-specific challenges to date,’ report the researchers. While working on the project, Sven Schmeier and his team have to deal with ever new exceptions in the design and layout of websites.

    On the way to a solution

    Research is currently being conducted on two fronts. On the one hand, the creation of a benchmark data set for multi-hop information retrieval via web content - i.e. raw websites. On the other hand, the reasoning capabilities of open-source LLMs for navigating web content are being tested using our own textual web representations.

    However, the current zero-shot tests show that the language models used do not select the optimal actions based on the question/web content. In addition, the researchers have already identified significant differences between the open-source LLMs Llama2 70b and GPT4.

    The search for a suitable language model therefore continues. In the next series of tests, Gemini ultra 1.5 will be tested in the hope of achieving even better performance. The data set created by the researchers and the improved reasoning capabilities of the Gemini models should contribute to this effect in tandem.

    Wissenschaftliche Ansprechpartner:

    Dr. Sven Schmeier, Researcher Department Speech and Language Technology (DFKI)

    Weitere Informationen:


    Researcher at the desk working on a project
    Researcher at the desk working on a project


    Merkmale dieser Pressemitteilung:
    Informationstechnik, Medien- und Kommunikationswissenschaften, Wirtschaft
    Forschungs- / Wissenstransfer, Kooperationen


    Researcher at the desk working on a project

    Zum Download



    Die Suche / Erweiterte Suche im idw-Archiv

    Sie können Suchbegriffe mit und, oder und / oder nicht verknüpfen, z. B. Philo nicht logie.


    Verknüpfungen können Sie mit Klammern voneinander trennen, z. B. (Philo nicht logie) oder (Psycho und logie).


    Zusammenhängende Worte werden als Wortgruppe gesucht, wenn Sie sie in Anführungsstriche setzen, z. B. „Bundesrepublik Deutschland“.


    Die Erweiterte Suche können Sie auch nutzen, ohne Suchbegriffe einzugeben. Sie orientiert sich dann an den Kriterien, die Sie ausgewählt haben (z. B. nach dem Land oder dem Sachgebiet).

    Haben Sie in einer Kategorie kein Kriterium ausgewählt, wird die gesamte Kategorie durchsucht (z.B. alle Sachgebiete oder alle Länder).