idw – Informationsdienst Wissenschaft

Nachrichten, Termine, Experten

Grafik: idw-Logo
Grafik: idw-Logo

idw - Informationsdienst
Wissenschaft

Science Video Project
idw-Abo

idw-News App:

AppStore

Google Play Store



Instance:
Share on: 
07/17/2025 10:02

A new method can detect whether copyright-protected images were used to train AI models

Felix Koltermann Unternehmenskommunikation
CISPA Helmholtz Center for Information Security

    In just a few years, what began as a scientific project—to use AI models for generating images—has become an everyday application. Along with this development, new problems have emerged. Increasingly, creators–such as photographers and illustrators–are asking whether their images have been used to train AI models. CISPA researcher Antoni Kowalczuk has now developed a technique that can prove whether specific images were employed in a model’s training. He published his findings in June 2025 in the paper “CDI: Copyrighted Data Identification in Diffusion Models” at the IEEE Conference on Computer Vision and Pattern Recognition.

    AI image generators have experienced explosive growth in recent years. Many of these systems—such as DALL·E, Midjourney, and Stable Diffusion—are based on so called diffusion models. “A diffusion model is a deep neural network that learns to generate images step by step by gradually removing noise from an image,” explains Antoni Kowalczuk, a PhD student at CISPA. These systems were trained on millions of images from the internet, allegedly without the creators’ consent, raising legal and ethical issues. “When the models were still used purely for scientific purposes, nobody really cared about the copyright question,” Kowalczuk recalls. “But once people started making money with these models, the issue suddenly became relevant. I thought my research could make a difference.”

    Why previous methods fail

    Existing techniques for determining whether AI models used particular images for training rely on a method called “Membership Inference Attacks” (MIA). These try to assess if a single image was used to train an AI model. However, research shows that when the models and their datasets grow—and they only tend to do so—the efficacy of MIAs falls to almost zero. “For this reason, my colleagues and I developed a new method called ‘Copyrighted Data Identification’ (CDI),” says the CISPA researcher. “The key idea behind CDI is that we don’t examine individual images, but entire datasets—for example, a collection of stock photos or a digital art portfolio.”

    How CDI works

    To check whether copyright protected material was used to train an AI model, Kowalczuk designed a four stage process for CDI. First, two datasets must be assembled: “The first contains images that the data owner believes were used to train this specific model. The second is a so called validation set, made up of images we are 100 % certain were not used in training,” explains the researcher. Next, both datasets are run through the AI model to observe its responses. Based on those responses, a model is trained that can predict whether the dataset in question was likely part of the training data. “At the end, a statistical test is performed to see whether the suspect dataset systematically scores higher than the validation set,” says Kowalczuk. If it does, that is strong evidence the AI was trained on those images, and if not, the result remains inconclusive.

    The CISPA researcher tested CDI a suite of existing AI models with available information about training data, for example, models trained on the ImageNet dataset. The results are promising, Kowalczuk reports: “CDI can detect with high accuracy whether a dataset was used in training, even on complex, large scale models. Even when we are unable to pin-point the exact images used in training, we can still successfully recognize if data in the set was used to train the model. CDI also yields reliable results when only a subset of the entire work was included in training.”

    Obstacles to practical application and deployment

    At present, CDI remains a method whose use—because of its complexity—is largely confined to researchers. “Some of the features we extract require full access to the model and its code,” notes Kowalczuk. “Moreover, there are very stringent criteria for the data samples we employ.” As a result, CDI currently offers mainly a theoretical proof of concept that it is possible to determine whether a particular set of images was used to train AI models. Developing a user friendly application for creators without deep technical expertise would require further modifications and advances that, for now, appear technically out of reach. “CDI is still quite young and there is much work to be done. But one thing is clear: once we have better methods, we may someday bridge the gap between theory and practical implementation,” the CISPA-researcher concludes.


    Images

    Visualization to the paper "CDI: Copyrighted Data Identification in Diffusion Models"
    Visualization to the paper "CDI: Copyrighted Data Identification in Diffusion Models"

    Copyright: CISPA


    Criteria of this press release:
    Journalists, Scientists and scholars
    Information technology
    transregional, national
    Research results
    English


     

    Help

    Search / advanced search of the idw archives
    Combination of search terms

    You can combine search terms with and, or and/or not, e.g. Philo not logy.

    Brackets

    You can use brackets to separate combinations from each other, e.g. (Philo not logy) or (Psycho and logy).

    Phrases

    Coherent groups of words will be located as complete phrases if you put them into quotation marks, e.g. “Federal Republic of Germany”.

    Selection criteria

    You can also use the advanced search without entering search terms. It will then follow the criteria you have selected (e.g. country or subject area).

    If you have not selected any criteria in a given category, the entire category will be searched (e.g. all subject areas or all countries).