Memes Without Hate Speech: CISPA Researchers Develop ToxicBench for Safer AI Image Generation

16.04.2026 12:56

Memes Without Hate Speech: CISPA Researchers Develop ToxicBench for Safer AI Image Generation

Felix Koltermann Unternehmenskommunikation
CISPA Helmholtz Center for Information Security

Generative AI models can be prompted with just a few words to insert offensive or discriminatory text messages into images. Aditya Kumar from the SPRINT-ML Lab at the CISPA Helmholtz Center for Information Security is investigating how such outputs can be reliably prevented. To address this, he developed ToxicBench, a test dataset that evaluates how well image-generating AI systems handle offensive inputs. He also created a fine-tuning strategy to adapt the models accordingly. The results were presented in the paper “Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images” at the 40th AAAI Conference on Artificial Intelligence in Singapore.

AI image generators such as Stable Diffusion have revolutionized meme creation: countless new images can be generated within seconds. Originally a subcultural phenomenon, memes have now become an integral part of communication on social networks and in the digital public sphere. Their distinctive feature is the combination of images and text.
“Memes contain text captions that are embedded directly into images,” Kumar explains. Problems arise when these texts include insults or discriminatory content. Kumar and his team therefore wanted to find out how the generation of such problematic text messages in AI-generated images—whether memes or other image types—can be controlled.

Existing Safety Detectors Reach Their Limits

“We first looked at available image safety detectors. They were developed to detect so-called NSFW (‘not safe for work’) content,” the researcher explains. “While they work very well for offensive visual content in images, they reach their limits when it comes to unsafe text.” The reason is that visual safety detectors operate at the pixel level and are not designed to detect unsafe text embedded in images. “They can recognize visual features such as nudity, but they do not understand the semantic meaning of text embedded within images,” Kumar says. The study therefore identifies embedded text as a distinct safety risk area that has largely been overlooked by previous NSFW approaches.

A New Fine-Tuning Strategy Against Offensive Text

To address this problem, the researcher developed a novel fine-tuning strategy that specifically targets the models’ text-generation layers. “Normally, an unsafe prompt produces an unsafe image,” Kumar explains. “Our approach ensures that the same prompt generates a safe image.” In this process, the problematic word is replaced with a neutral one while preserving the overall image composition. “Instead of generating an offensive word, the model is optimized toward a specific harmless target image that is similar to the original word,” Kumar says. This additional training modifies the internal layers of the diffusion model itself, making the process effective in the long term. Since the method only alters a small number of the models’ layers—out of up to 40 layers in total—most of the image generation process remains unchanged, ensuring that image quality is not affected.

ToxicBench: Dataset and Evaluation Pipeline

To increase its value for the research community, Kumar has released ToxicBench, which includes both a benchmark dataset and an associated evaluation pipeline. The dataset contains 218 prompt templates, 437 unsafe words paired with harmless alternatives, more than 73,000 training image pairs and more than 21,000 test image pairs. “The evaluation pipeline works in two steps,” Kumar explains. “First, a diffusion model generates an image. Then the text contained in the image is extracted using optical character recognition (OCR) and evaluated by a toxicity classifier.” The study also introduces new metrics that specifically measure how much the generated text changes without degrading image quality. This makes it possible to check whether models produce unsafe text. If necessary, the fine-tuning strategy can then be applied to optimize the model. The work therefore not only provides a concrete safety method but also introduces, for the first time, a standardized measurement framework for toxic text in generated images.

Applications and Outlook

Open-source models such as Stable Diffusion are widely used by startups and developers. ToxicBench, which is freely available on GitHub, can be used directly for safety evaluation or for fine-tuning purposes. “This is particularly relevant for educational applications or publicly accessible systems,” Kumar emphasizes. The modified models themselves have not yet been released. Looking ahead, Kumar and his colleagues plan to remove unsafe content more comprehensively—not just unsafe text. “We are also working on improving scalability and applying our approach to newer diffusion models,” the CISPA researcher concludes.

Originalpublikation:

https://publications.cispa.de/articles/conference_contribution/Beautiful_Images_...

Bilder

Visualization to the paper "Beautiful Images, Toxic Words: Understanding and Addressing Offensive Te ...

Copyright: CISPA

Merkmale dieser Pressemitteilung:
Journalisten, jedermann
Informationstechnik
überregional
Forschungsergebnisse
Englisch

idw – Informationsdienst Wissenschaft

idw-News App:

Memes Without Hate Speech: CISPA Researchers Develop ToxicBench for Safer AI Image Generation

Felix Koltermann Unternehmenskommunikation
CISPA Helmholtz Center for Information Security

Originalpublikation:

idw-News App:

Memes Without Hate Speech: CISPA Researchers Develop ToxicBench for Safer AI Image Generation

Felix Koltermann Unternehmenskommunikation CISPA Helmholtz Center for Information Security

Originalpublikation:

Erweiterte Suche

Umfang der Suche

Datum der Veröffentlichung

Hilfe

Die Suche / Erweiterte Suche im idw-Archiv

Verknüpfungen

Klammern

Wortgruppen

Auswahlkriterien

Felix Koltermann Unternehmenskommunikation
CISPA Helmholtz Center for Information Security