idw - Informationsdienst
Wissenschaft
In just a few years, what began as a scientific project—to use AI models for generating images—has become an everyday application. Along with this development, new problems have emerged. Increasingly, creators–such as photographers and illustrators–are asking whether their images have been used to train AI models. CISPA researcher Antoni Kowalczuk has now developed a technique that can prove whether specific images were employed in a model’s training. He published his findings in June 2025 in the paper “CDI: Copyrighted Data Identification in Diffusion Models” at the IEEE Conference on Computer Vision and Pattern Recognition.
AI image generators have experienced explosive growth in recent years. Many of these systems—such as DALL·E, Midjourney, and Stable Diffusion—are based on so called diffusion models. “A diffusion model is a deep neural network that learns to generate images step by step by gradually removing noise from an image,” explains Antoni Kowalczuk, a PhD student at CISPA. These systems were trained on millions of images from the internet, allegedly without the creators’ consent, raising legal and ethical issues. “When the models were still used purely for scientific purposes, nobody really cared about the copyright question,” Kowalczuk recalls. “But once people started making money with these models, the issue suddenly became relevant. I thought my research could make a difference.”
Why previous methods fail
Existing techniques for determining whether AI models used particular images for training rely on a method called “Membership Inference Attacks” (MIA). These try to assess if a single image was used to train an AI model. However, research shows that when the models and their datasets grow—and they only tend to do so—the efficacy of MIAs falls to almost zero. “For this reason, my colleagues and I developed a new method called ‘Copyrighted Data Identification’ (CDI),” says the CISPA researcher. “The key idea behind CDI is that we don’t examine individual images, but entire datasets—for example, a collection of stock photos or a digital art portfolio.”
How CDI works
To check whether copyright protected material was used to train an AI model, Kowalczuk designed a four stage process for CDI. First, two datasets must be assembled: “The first contains images that the data owner believes were used to train this specific model. The second is a so called validation set, made up of images we are 100 % certain were not used in training,” explains the researcher. Next, both datasets are run through the AI model to observe its responses. Based on those responses, a model is trained that can predict whether the dataset in question was likely part of the training data. “At the end, a statistical test is performed to see whether the suspect dataset systematically scores higher than the validation set,” says Kowalczuk. If it does, that is strong evidence the AI was trained on those images, and if not, the result remains inconclusive.
The CISPA researcher tested CDI a suite of existing AI models with available information about training data, for example, models trained on the ImageNet dataset. The results are promising, Kowalczuk reports: “CDI can detect with high accuracy whether a dataset was used in training, even on complex, large scale models. Even when we are unable to pin-point the exact images used in training, we can still successfully recognize if data in the set was used to train the model. CDI also yields reliable results when only a subset of the entire work was included in training.”
Obstacles to practical application and deployment
At present, CDI remains a method whose use—because of its complexity—is largely confined to researchers. “Some of the features we extract require full access to the model and its code,” notes Kowalczuk. “Moreover, there are very stringent criteria for the data samples we employ.” As a result, CDI currently offers mainly a theoretical proof of concept that it is possible to determine whether a particular set of images was used to train AI models. Developing a user friendly application for creators without deep technical expertise would require further modifications and advances that, for now, appear technically out of reach. “CDI is still quite young and there is much work to be done. But one thing is clear: once we have better methods, we may someday bridge the gap between theory and practical implementation,” the CISPA-researcher concludes.
Visualization to the paper "CDI: Copyrighted Data Identification in Diffusion Models"
Copyright: CISPA
Criteria of this press release:
Journalists, Scientists and scholars
Information technology
transregional, national
Research results
English
You can combine search terms with and, or and/or not, e.g. Philo not logy.
You can use brackets to separate combinations from each other, e.g. (Philo not logy) or (Psycho and logy).
Coherent groups of words will be located as complete phrases if you put them into quotation marks, e.g. “Federal Republic of Germany”.
You can also use the advanced search without entering search terms. It will then follow the criteria you have selected (e.g. country or subject area).
If you have not selected any criteria in a given category, the entire category will be searched (e.g. all subject areas or all countries).