How Anthropic’s Jailbreak Challenge Put AI Safety Defenses to the Test

2 Pages

Anthropic partnered with HackerOne to run an AI red teaming challenge on Claude 3.5 Sonnet, aiming to test and improve its Constitutional Classifiers against harmful content, especially CBRN-related queries. From Feb 3–10, 339 researchers engaged in over 300,000 interactions, using various jailbreak methods. Four teams earned $55,000 in bounties, with one successfully applying a universal jailbreak. The exercise highlighted the value of external researchers in identifying vulnerabilities and reinforcing AI safety, demonstrating that collaborative red teaming is vital for building secure and resilient AI systems.

Join for free to read

Ebook Beyond the hype: Capturing the potential of AI and gen AI in…

Guide Put AI to work for customer service

White Paper Navigating the Risks of Artificial Intelligence - how…

Report The new key to automotive success: Put customer experience in…

More from HackerOne

White Paper AI-Augmented Offensive Security

Ebook HackerOne Challenge

Guide The Ultimate Guide to Managing Ethical and Security Risks in AI

Case Study Be Cyber Smart: How TikTok uses Cyberstrength to stay a step…

Case Study

How Anthropic’s Jailbreak Challenge Put AI Safety Defenses to the Test

How Anthropic’s Jailbreak Challenge Put AI Safety Defenses to the Test

You Might Also Like

More from HackerOne