OpenAI puts cash bounties on AI safety failures

OpenAI is widening the bug bounty lens again, but this time it’s not just hunting for classic security flaws like XSS or account‑takeovers—it’s asking the internet to help break its AI in ways that could actually hurt people in the real world. The new Safety Bug Bounty program, announced on March 25, 2026, is explicitly about abuse and safety risks in AI behavior, not just bugs in code, and it sits alongside the company’s existing, more traditional security bounty on Bugcrowd.

If you’ve followed OpenAI for a while, this move feels like a logical next chapter rather than a surprise plot twist. The company has been running a standard security bug bounty since 2023, paying researchers to report vulnerabilities in ChatGPT, its APIs, and infrastructure—everything from access control issues to data exposure bugs. Over time, that effort has expanded, with reward ceilings going up and new AI‑focused security programs layered on top, including specialized “bio bug bounty” challenges for GPT-5 and the ChatGPT Agent designed to probe biological misuse risks. The new Safety Bug Bounty effectively takes that philosophy—crowdsourcing scrutiny—and points it straight at AI agents and safety failures that may not look like conventional infosec vulnerabilities at all.

At the heart of this new program is a simple question: what happens when AI systems stop being just text predictors and start acting like agents that browse the web, move data around, and take actions on your behalf? OpenAI’s answer is to explicitly pay people to find out how that can go wrong before attackers do. The scope reads like a checklist of emerging AI threat models—third‑party prompt injection, data exfiltration via agents, and scenarios where an OpenAI agent reliably does something it very clearly shouldn’t. The emphasis isn’t on minor policy bypasses or coaxing the model into saying rude things; OpenAI is asking for issues that could lead to “plausible and material harm,” which is an unusually blunt bar for a public bounty.

Prompt injection is one of the core worries here, and for good reason. As OpenAI’s products move into more agentic territory—think ChatGPT browsing the web, interacting with APIs, or running through a toolchain—those agents can easily come across untrusted text on websites, in emails, or in documents. If that text can reliably hijack the agent, override the user’s instructions, and, say, leak sensitive data or trigger a harmful workflow, that’s no longer a theoretical concern; it’s an abuse vector. OpenAI’s rules even quantify this: to count as an in‑scope issue, a prompt‑injection style attack has to be reproducible at least 50 percent of the time, which is a pretty practical way to separate flukes from reliable exploitation.

Beyond prompt injection, the program also calls out any case where an OpenAI agent performs a disallowed action on OpenAI’s own infrastructure “at scale.” That could range from triggering automated actions across many accounts to mass scraping or manipulating internal systems via the agent layer itself. There’s also a more open‑ended bucket: any other potentially harmful action by an agent that leads to real‑world risk, as long as the reporter can show a concrete path to harm and a clear remediation step. It’s a notable shift away from traditional security scope documents that tend to be asset‑driven; here, the unit of analysis is “could someone get hurt if this is abused?” rather than “is this endpoint vulnerable?”.

Another interesting inclusion is OpenAI’s own proprietary information. The program is explicitly interested in model generations that leak proprietary reasoning details or other internal data that shouldn’t be exposed. That could include internal reasoning traces, system prompts, or implementation details that give away how certain safety or alignment systems are wired. In practice, this blurs the line between model safety and corporate confidentiality: a model that can be coaxed into dumping its own guts is both a security problem and a safety issue, because that information can be weaponized to build better jailbreaks or mimic protected capabilities.

There’s also a slice of the program dedicated to “account and platform integrity.” Here, OpenAI is inviting reports around things like bypassing anti‑automation measures, gaming trust signals, or evading bans and restrictions—essentially all the mechanics that keep abusive users from scaling up their activity. If an issue lets someone access features or data beyond their permissions, that’s still routed to the classic Security Bug Bounty, but anything that erodes the integrity of the platform’s defenses is fair game under the safety umbrella. That split mirrors how a lot of big platforms now separate product‑abuse teams from pure infosec, but it’s rare to see it codified so clearly in a public bounty scope.

Notably, jailbreaks—the sport of tricking models into saying disallowed things—are officially out of scope for this particular program. That might sound counterintuitive until you look at how OpenAI has started carving out specialized campaigns for high‑stakes harm types instead. For biorisk, for example, the company has run invite‑only Bio Bug Bounties where researchers compete to find a “universal jailbreak” that can push GPT-5 or the ChatGPT Agent through a ten‑level biology and chemistry safety challenge, with rewards up to $25,000. The Safety Bug Bounty is more like the generalist front door, while those bio programs are precision tools aimed at a narrow, very sensitive slice of misuse.

If you’re hoping to get paid for every clever content‑policy bypass you discover, you’re probably going to be disappointed. OpenAI is pretty clear that generic content‑policy violations without demonstrable real‑world safety or abuse impact fall outside the rewardable scope. Examples they give include jailbreaks that just make the model rude or produce information easily available via a search engine—annoying, maybe, but not exactly catastrophic. However, they leave themselves a bit of wiggle room: if a researcher can show that a flaw directly facilitates user harm and propose actionable, discrete remediation steps, OpenAI may still treat it as in scope on a case‑by‑case basis. That’s a subtle but important signal that substance matters more than clever screenshots.

Under the hood, this whole thing runs on Bugcrowd, the same platform that’s been managing OpenAI’s core bug bounty since 2023. Researchers apply through a dedicated Safety Bug Bounty engagement page, where submissions are triaged by OpenAI’s Safety and Security Bug Bounty teams and routed to the right program depending on scope. That infrastructure has already been battle‑tested: Bugcrowd’s OpenAI program has handled everything from low‑severity nuisances to high‑impact findings and has a reputation for fast triage and clear rules about what’s in or out of bounds. For the safety‑focused program, that same machinery now gets pointed at the messier problem of AI abuse.

Zoom out, and the Safety Bug Bounty is part of a broader pattern: OpenAI steadily externalizing more of its safety work instead of treating it as a black box handled only by internal teams. The GPT-5 Bio Bug Bounty, for example, openly acknowledges that the company expects jailbreaking attempts and wants expert outsiders to try to defeat its safeguards before a full‑scale rollout. Similarly, the Agent Bio Bug Bounty around the ChatGPT Agent models is framed as an opportunity for red‑teamers and biosecurity specialists to stress‑test safety systems in a controlled way, again with clear prize money on the table. For the wider AI ecosystem, that mix of public bounties and targeted invite‑only challenges is likely to be watched—and copied—by other labs that are inching toward similarly powerful agentic systems.

Of course, a bug bounty is not a magic shield. Critics will point out that paying hackers to find issues doesn’t address deeper concerns around data use, corporate incentives, or the sheer speed at which increasingly capable models are being deployed. There’s also the question of how transparent OpenAI will be about the problems this program turns up; not every safety bug can be responsibly disclosed in public without risking copycat abuse. Still, in a landscape where many AI companies talk vaguely about safety but keep the details locked away, formalizing a dedicated, AI‑specific safety bounty—complete with clear scope, concrete harm thresholds, and integration into an existing security pipeline—is a tangible step rather than just another blog‑post promise.

For researchers, this is an invitation to think less like a penetration tester and more like a hybrid of security engineer, abuse analyst, and sociotechnical risk modeler. The reports OpenAI is asking for aren’t just stack traces and PoCs; they’re narratives about how a specific failure mode in an AI agent can be turned into scaled harm, plus practical ideas for fixing it. And for users and regulators watching from the sidelines, the Safety Bug Bounty is a small but telling indicator of where AI risk conversations are heading: away from “does this model ever say something wrong?” toward “what happens when this model, as an agent, is wired into everything else we do?”.