By using this site, you agree to the Privacy Policy and Terms of Use.
Accept

GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIAnthropicTech

Anthropic releases Bloom, an agentic tool for continuous AI evaluation

Bloom is Anthropic’s open source framework for probing AI behavior.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
Dec 20, 2025, 2:00 AM EST
Share
We may get a commission from retail offers. Learn more
Anthropic illustration
Image: Anthropic
SHARE

Anthropic quietly dropped a new tool this month that feels like the next pragmatic step in turning alignment work from artisanal craft into repeatable engineering: Bloom, an open-source, agentic framework that automatically generates and runs behavioral evaluations against frontier AI models. Rather than hand-crafting a handful of tests and hoping they still matter weeks later, Bloom lets a researcher define a behavior once, then procedurally spawns fresh scenarios, runs them, and scores how often and how strongly that behavior shows up — all in a single, configurable pipeline.

At its core, Bloom is built around the idea of a “seed”: a small configuration that specifies the behavior you care about (think “sycophancy,” “self-preservation,” or “instructed long-horizon sabotage”), example transcripts if you have them, and a set of parameters that shape how aggressive or diverse the evaluation should be. From that seed, Bloom scaffolds an entire evaluation suite — generating scenarios, simulating users and tools, running the target model in parallel rollouts, and then asking a judge model to score each transcript. The repository and quick-start make it clear the project is intended to be practical: you can run it locally, plug in different LLM providers, and scale experiments with Weights & Biases.

That procedural approach is what separates Bloom from older, static benchmarks. Traditional test sets are fragile: they take time to design, they can leak into training data, and they often stop being discriminative as models iterate. Bloom’s evaluations intentionally change every run — the scenarios “bloom” differently depending on the seed — but the seed itself is meant to be cited with results so other teams can reproduce experiments. In short, variability where it helps (scenario generation), stability where it matters (seeded reproducibility).

Technically, Bloom behaves like a small team of coordinating agents. The first “understanding” agent digests the researcher’s behavior description and examples, expanding them into what counts as the behavior and why it matters. An “ideation” agent writes candidate scenarios — who the user is, the system prompt, and whether tools are in play. A “rollout” agent executes those scenarios against the model under test, producing multi-turn transcripts. Finally, a judgment agent scores each transcript and a meta-judge aggregates suite-level metrics like elicitation rate and average behavior presence. You can swap models at each stage, tweak conversation length, add secondary metrics such as “realism” or “evaluation awareness,” and export transcripts for qualitative review. The GitHub repo and Anthropic’s write-up both walk through the exact parameters and example configurations.

Anthropic positions Bloom as a complement, not a replacement, to the tools it released earlier — notably Petri, which explores a model’s broader behavioral profile across many conversations. Where Petri is breadth-first, surfacing many behaviors across diverse multi-turn exchanges, Bloom is depth-first: pick one behavior and measure it, systematically and scalably. That pairing matters because it reflects a shift in tooling strategy: use one system to flag suspicious patterns broadly, and another to probe one behavior with rigor.

The team didn’t just open-source code and walk away. They ran Bloom on a suite of cases to show what it can do. In internal validations, Anthropic compared production Claude models against intentionally misaligned “model organisms” (system-prompted models designed to be quirky or unsafe). Across ten such quirks, Bloom separated the misaligned organisms from baseline models in nine cases; in the tenth case, manual review suggested the baseline actually displayed similar behavior. They also hand-labeled 40 transcripts and compared human judgments with Bloom’s automated judges: Claude Opus 4.1 produced the strongest Spearman correlation to human scores (0.86), with Sonnet 4.5 behind it at 0.75. Those figures aren’t just PR — they’re built into the technical report and the repo’s validation scripts.

A concrete example Anthropic shares is a revisit of “self-preferential bias” — situations where a model favors its own interests when asked to rank options. Using example transcripts patterned after a Claude system card, Bloom reproduced the same model ranking from the original evaluation and found that raising the judge model’s reasoning effort reduced bias for some models. The platform’s ability to add secondary filters — for instance, dropping transcripts that seem unrealistic or that show the model “gaming” the evaluation — also improved the clarity and quality of the findings. That combination of replication plus deeper filtering is the kind of workflow that could make behavior audits both faster and more defensible.

Bloom’s release matters for more than just Anthropic-adjacent teams. As models are embedded into tool-rich environments and given more autonomy, measuring behavior becomes a continuous engineering problem, not a one-off research exercise. By open-sourcing a configurable, agentic pipeline, Anthropic is giving other researchers and red-teamers a shared backbone for tasks like testing nested jailbreaks, measuring evaluation awareness, and constructing sabotage traces — work that’s increasingly urgent as capabilities accelerate. The code, examples, and sample seeds are available on GitHub, and Anthropic’s Alignment Science blog includes a longer technical dive and experimental appendices for anyone who wants to reproduce the benchmarks.

Of course, the tool isn’t a magic bullet. Procedural generation shifts part of the burden from prompt engineering to configuration design: choices about diversity, evaluator models, conversation length, and example selection visibly change absolute scores even when relative rankings are stable. And automated judges — however well-tuned — will never replace thoughtful human review for novel or surprising failure modes. Still, Bloom’s real contribution is pragmatic: it lowers the cost of running disciplined, repeatable behavior checks, and it gives the community a common language (seed files, judge configs, transcripts) to compare results.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Leave a Comment

Leave a ReplyCancel reply

Most Popular

The $19 Apple polishing cloth supports iPhone 17, Air, Pro, and 17e

Apple MacBook Neo: big power, surprising price, one clear target — Windows

Everything Nothing announced on March 5: Headphone (a), Phone (4a), and Phone (4a) Pro

OpenAI’s GPT-5.4 is coming — and it’s sooner than you think

BenQ’s new 5K Mac monitor costs $999 — here’s what you’re getting

Also Read
Close-up of a person holding the Google Pixel 10 Pro Fold in Moonstone gray with both hands, rear-facing triple camera array and Google "G" logo prominently visible, worn against a silver knit top and blue jacket with a poolside background.

Pixel Care+ makes owning a Pixel a lot less scary — here’s why

Woman with blonde curly hair sitting outside in a lush park, holding a blue Google Pixel 10 and smiling at the screen.

Pixel 10a, Pixel 10, Pixel 10 Pro: one winner for every buyer

Google Search AI Mode showing Canvas in action, with a split-screen view of a conversational AI chat on the left and an "EE Opportunity Tracker" scholarship and grant tracking dashboard on the right, displaying a total funding secured amount of $5,000, scholarship cards with deadlines, and status labels including "To Apply" and "Awarded."

Google’s Canvas AI Mode rolls out to everyone in the U.S.

Google NotebookLM app listing on the Apple App Store displayed on an iPhone screen, showing the app icon, tagline "Understand anything," a Get button with In-App Purchases noted, 1.9K ratings, age rating 4+, and a chart ranking of No. 36 in Productivity.

NotebookLM Cinematic Video Overviews are live — here’s what’s new

A Google Messages conversation on an Android phone showing a real-time location sharing card powered by Find Hub and Google Maps, displaying a live map view near San Francisco Botanical Garden with a blue location dot, labeled "Your location – Sharing until 10:30 AM," within a chat about meeting up for coffee.

Google Messages real-time location sharing is here — here’s how it works

Screenshot of the Perplexity Pro interface with the model picker dropdown open, displaying GPT-5.4 labeled as New with the Thinking toggle switched on, and other available models including Sonar, Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.6 (Max-only), and Kimi K2.5.

GPT-5.4 is now on Perplexity — here’s what Pro/Max users get

A Microsoft Excel spreadsheet titled "Consumer Full 3 Statement Model" displaying a Balance Sheet in millions of dollars with historical financial data across four years (2020A–2023A), showing line items including cash and equivalents, accounts receivable, inventory, PP&E, goodwill, total assets, accounts payable, current debt maturities, and total liabilities, alongside an open ChatGPT sidebar panel where a user has asked ChatGPT to build an EBITDA-to-free-cash-flow conversion bridge with charts placed on the Balance Sheet tab, and the AI is actively responding by planning the analysis, filling in financing cash rows, and executing multiple actions in real time.

ChatGPT for Excel is here — and it runs on GPT‑5.4

ChatGPT logo and wordmark in white on a soft blue and orange gradient background, representing OpenAI’s ChatGPT platform.

OpenAI’s GPT-5.4 can click, type, and work your PC for you

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.