idw – Informationsdienst Wissenschaft

Nachrichten, Termine, Experten

Grafik: idw-Logo
Grafik: idw-Logo

idw - Informationsdienst

Science Video Project

idw-News App:


Google Play Store

Share on: 
05/27/2024 09:58

The next level of online-search: Making complex website-information accessable via AI

Jeremy Gob DFKI Kaiserslautern | Darmstadt
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, DFKI

    ‘Sovereign Cloud: Secure integration of business expert knowledge into large language models’ is the name of Sven Schmeier and his team's ambitious project. ‘The aim is to investigate the extent to which it is possible to make websites accessible and embed them in a RAG (Retrieval Augmented Generation) system in such a way that it is possible to ask complex questions about these websites,’ explains the DFKI-expert in AI language technologies.

    Using RAG, a language model is to be optimised so that it can refer to information outside of its own training data and incorporate this into an answer. In the case of the project, the relevant websites are to act as sources of knowledge.

    AI determines page content and prepares information

    If the project succeeds as planned, answers to questions such as ‘Which countries do the MAs who have studied computational linguistics and are working on speech recognition come from?’ will be just a finger exercise for the DFKI technology. Among other things, it opens up the possibility of finding out things on the basis of the website-specific RAGs that would otherwise hardly be visible or combinable.

    Another advantage: ‘The websites automatically become accessible because they can be presented in many languages, by text, voice, image, etc. and in simplified language,’ says Schmeier. At the same time, website maintenance would become much less complicated.

    Real answers

    Conventional search engines return documents as results to the person searching. RAGs, on the other hand, provide real answers - however, many problems that arise with RAGs from websites have not yet been solved.

    The solution approach of the researchers at DFKI: ‘Through the type of indexing, i.e. the transformation of the website content into the content of the RAG, we can find general solutions for the RAGs that can also be applied to other sources,’ explains Schmeier. This would be made possible, for example, by links within documents to other documents.

    Difficulties within the project

    Making all information accessible for corresponding search queries appears to be a mammoth task that involves a number of hurdles. Even if everything runs smoothly on the part of the AI application, the difficulty lies in the individuality of the websites.

    ‘When parsing the websites to create a robust textual representation of the websites, there have been application-specific challenges to date,’ report the researchers. While working on the project, Sven Schmeier and his team have to deal with ever new exceptions in the design and layout of websites.

    On the way to a solution

    Research is currently being conducted on two fronts. On the one hand, the creation of a benchmark data set for multi-hop information retrieval via web content - i.e. raw websites. On the other hand, the reasoning capabilities of open-source LLMs for navigating web content are being tested using our own textual web representations.

    However, the current zero-shot tests show that the language models used do not select the optimal actions based on the question/web content. In addition, the researchers have already identified significant differences between the open-source LLMs Llama2 70b and GPT4.

    The search for a suitable language model therefore continues. In the next series of tests, Gemini ultra 1.5 will be tested in the hope of achieving even better performance. The data set created by the researchers and the improved reasoning capabilities of the Gemini models should contribute to this effect in tandem.

    Contact for scientific information:

    Dr. Sven Schmeier, Researcher Department Speech and Language Technology (DFKI)

    More information:


    Researcher at the desk working on a project
    Researcher at the desk working on a project


    Criteria of this press release:
    Economics / business administration, Information technology, Media and communication sciences
    transregional, national
    Cooperation agreements, Transfer of Science or Research



    Search / advanced search of the idw archives
    Combination of search terms

    You can combine search terms with and, or and/or not, e.g. Philo not logy.


    You can use brackets to separate combinations from each other, e.g. (Philo not logy) or (Psycho and logy).


    Coherent groups of words will be located as complete phrases if you put them into quotation marks, e.g. “Federal Republic of Germany”.

    Selection criteria

    You can also use the advanced search without entering search terms. It will then follow the criteria you have selected (e.g. country or subject area).

    If you have not selected any criteria in a given category, the entire category will be searched (e.g. all subject areas or all countries).