Despite its standing as one of the more well-behaved generative AI models, Anthropic’s Claude 3.5 Sonnet can still be manipulated into producing racist hate speech and malware.
This can be done through persistent prompting that involves emotionally charged language.

A computer science student recently shared chat logs with reporters, showcasing how he managed to bypass the AI’s safeguards.
He reached out to reporters after seeing a report on a study by enterprise AI firm Chatterbox Labs, which concluded that Claude 3.5 Sonnet showed stronger resistance to generating harmful content compared to its competitors.
In their unrefined state, AI models will generate poor or harmful content if such material exists in their training data, which often includes web-scraped content. This issue is widely recognized.
As Anthropic noted in a post last year, “Currently, no one has figured out how to train highly powerful AI systems to consistently be helpful, truthful, and harmless.”
To minimize harmful output, developers of AI models, whether commercial or open-source, apply fine-tuning and reinforcement learning strategies.
These methods are intended to prevent AI from generating harmful text, images, or other content when prompted.
For instance, if you ask a commercial AI model to produce offensive material, it should respond with something akin to, “I’m sorry, I can’t do that.”
Anthropic has detailed Claude 3.5 Sonnet’s performance in its Model Card Addendum. According to the results, the model has undergone robust training, successfully rejecting 96.4% of harmful requests based on Wildchat Toxic test data, as well as in the Chatterbox Labs evaluation mentioned earlier.
Despite these safety measures, a computer science student informed reporters that he managed to circumvent Claude 3.5 Sonnet’s safeguards, making the model generate racist content and malicious code.
After a week of persistent testing, he expressed concern over the effectiveness of Anthropic’s safety protocols and hoped that reporters would publish his findings.
However, as reporters prepared to report the story, the student became worried about potential legal ramifications for “red teaming”—or conducting security research—on the Claude model, and ultimately chose to withdraw from the publication.
The student’s professor, when reached to confirm the claims, supported the decision to step back. The professor, who also preferred to remain anonymous, explained, “I believe the student acted hastily in contacting the media without fully considering the broader implications.
There could be legal or professional risks tied to this work, and publicizing it may bring unwanted attention and liability.”
Kang, after reviewing a harmful chat log linked to the jailbreak, remarked, “It’s common knowledge that all advanced AI models can be manipulated to get around safety mechanisms.”
He further pointed to an example of a Claude 3.5 Sonnet jailbreak that had surfaced on social media.
Kang acknowledged that although he hasn’t examined the specifics of the student’s method, “it’s well recognized in the jailbreaking community that emotional manipulation or role-playing is a typical tactic to circumvent safety protocols.”
He echoed Anthropic’s recognition of AI safety limitations, stating, “In the red-teaming community, it’s generally accepted that no lab has achieved foolproof safety measures for their large language models.”
Kang empathized with the student’s worries regarding the repercussions of highlighting security issues. He is a co-author of a paper released earlier this year titled “A Safe Harbor for AI Evaluation and Red Teaming.”
The paper emphasizes, “Independent evaluation and red teaming are essential for uncovering the risks associated with generative AI systems.”
It further explains that the terms of service and enforcement policies employed by leading AI companies to discourage misuse often discourage sincere safety evaluations.
This environment leads some researchers to worry that their investigations or the disclosure of their findings could trigger account suspensions or legal consequences.
The authors, several of whom have also published a related blog post summarizing the issue, urge major AI developers to commit to protecting those engaged in legitimate public interest security research on AI models. This call for protection extends to researchers examining the security of social media platforms as well.
The authors note, “For instance, OpenAI, Google, Anthropic, and Meta offer bug bounties and even safe harbors.”
However, they point out that companies like Meta and Anthropic maintain the right to determine “final and sole discretion for whether you are acting in good faith and in accordance with this Policy.”
They argue that this ability to make real-time judgments about acceptable conduct, rather than providing clear, assessable guidelines in advance, creates uncertainty and discourages research efforts.
Upon learning of the student’s change of perspective and inquiring whether Anthropic intended to take legal action for the alleged violation of terms of service, a spokesperson refrained from outright denying the possibility of litigation.
Instead, they highlighted the company’s Responsible Disclosure Policy, which “includes Safe Harbor protections for researchers.”
Furthermore, the support page titled “Reporting Harmful or Illegal Content” states, “[W]e welcome reports about safety issues, ‘jailbreaks,’ and similar concerns to enhance the safety and harmlessness of our models.”
Other News You May Like