Mindgard's founder Peter Garraghan described the output as "very gruesome, sometimes sexualised, sometimes both together" . Researcher Jim Nightingale, who led the testing, said he was left "shaken, and in tears" by what the system produced
.
The exploit is a form of adversarial prompting. Mindgard took a widely shared, harmless prompt intended for comedy and made small alterations to the instruction text. The crucial detail: the modified prompt did not explicitly specify the disturbing subject matter. The AI generated the gory and sexualized content "of its own volition" from what appeared to be an innocuous instruction .
This built on Mindgard's earlier research, which showed that ChatGPT's image safeguards could also be bypassed through memory manipulation — where custom user memory and system prompt context override safety filters without any backend access or model modification .
Mindgard alerted OpenAI to the vulnerability in May 2026. The company initially responded with only an automated reply . After the BBC inquired, OpenAI stated it had "introduced additional safeguards against this type of prompt"
. The company said it employs multiple layers of image safety protections combining automated systems with human review
.
However, Mindgard found that with further small changes to the prompt wording, the same bypass still produced concerning content even after OpenAI's fixes .
Comments
0 comments