By using this site, you agree to the Privacy Policy and Terms of Use.
Accept

GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIAnthropicTech

Anthropic releases Bloom, an agentic tool for continuous AI evaluation

Bloom is Anthropic’s open source framework for probing AI behavior.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
Dec 20, 2025, 2:00 AM EST
Share
We may get a commission from retail offers. Learn more
Anthropic illustration
Image: Anthropic
SHARE

Anthropic quietly dropped a new tool this month that feels like the next pragmatic step in turning alignment work from artisanal craft into repeatable engineering: Bloom, an open-source, agentic framework that automatically generates and runs behavioral evaluations against frontier AI models. Rather than hand-crafting a handful of tests and hoping they still matter weeks later, Bloom lets a researcher define a behavior once, then procedurally spawns fresh scenarios, runs them, and scores how often and how strongly that behavior shows up — all in a single, configurable pipeline.

At its core, Bloom is built around the idea of a “seed”: a small configuration that specifies the behavior you care about (think “sycophancy,” “self-preservation,” or “instructed long-horizon sabotage”), example transcripts if you have them, and a set of parameters that shape how aggressive or diverse the evaluation should be. From that seed, Bloom scaffolds an entire evaluation suite — generating scenarios, simulating users and tools, running the target model in parallel rollouts, and then asking a judge model to score each transcript. The repository and quick-start make it clear the project is intended to be practical: you can run it locally, plug in different LLM providers, and scale experiments with Weights & Biases.

That procedural approach is what separates Bloom from older, static benchmarks. Traditional test sets are fragile: they take time to design, they can leak into training data, and they often stop being discriminative as models iterate. Bloom’s evaluations intentionally change every run — the scenarios “bloom” differently depending on the seed — but the seed itself is meant to be cited with results so other teams can reproduce experiments. In short, variability where it helps (scenario generation), stability where it matters (seeded reproducibility).

Technically, Bloom behaves like a small team of coordinating agents. The first “understanding” agent digests the researcher’s behavior description and examples, expanding them into what counts as the behavior and why it matters. An “ideation” agent writes candidate scenarios — who the user is, the system prompt, and whether tools are in play. A “rollout” agent executes those scenarios against the model under test, producing multi-turn transcripts. Finally, a judgment agent scores each transcript and a meta-judge aggregates suite-level metrics like elicitation rate and average behavior presence. You can swap models at each stage, tweak conversation length, add secondary metrics such as “realism” or “evaluation awareness,” and export transcripts for qualitative review. The GitHub repo and Anthropic’s write-up both walk through the exact parameters and example configurations.

Anthropic positions Bloom as a complement, not a replacement, to the tools it released earlier — notably Petri, which explores a model’s broader behavioral profile across many conversations. Where Petri is breadth-first, surfacing many behaviors across diverse multi-turn exchanges, Bloom is depth-first: pick one behavior and measure it, systematically and scalably. That pairing matters because it reflects a shift in tooling strategy: use one system to flag suspicious patterns broadly, and another to probe one behavior with rigor.

The team didn’t just open-source code and walk away. They ran Bloom on a suite of cases to show what it can do. In internal validations, Anthropic compared production Claude models against intentionally misaligned “model organisms” (system-prompted models designed to be quirky or unsafe). Across ten such quirks, Bloom separated the misaligned organisms from baseline models in nine cases; in the tenth case, manual review suggested the baseline actually displayed similar behavior. They also hand-labeled 40 transcripts and compared human judgments with Bloom’s automated judges: Claude Opus 4.1 produced the strongest Spearman correlation to human scores (0.86), with Sonnet 4.5 behind it at 0.75. Those figures aren’t just PR — they’re built into the technical report and the repo’s validation scripts.

A concrete example Anthropic shares is a revisit of “self-preferential bias” — situations where a model favors its own interests when asked to rank options. Using example transcripts patterned after a Claude system card, Bloom reproduced the same model ranking from the original evaluation and found that raising the judge model’s reasoning effort reduced bias for some models. The platform’s ability to add secondary filters — for instance, dropping transcripts that seem unrealistic or that show the model “gaming” the evaluation — also improved the clarity and quality of the findings. That combination of replication plus deeper filtering is the kind of workflow that could make behavior audits both faster and more defensible.

Bloom’s release matters for more than just Anthropic-adjacent teams. As models are embedded into tool-rich environments and given more autonomy, measuring behavior becomes a continuous engineering problem, not a one-off research exercise. By open-sourcing a configurable, agentic pipeline, Anthropic is giving other researchers and red-teamers a shared backbone for tasks like testing nested jailbreaks, measuring evaluation awareness, and constructing sabotage traces — work that’s increasingly urgent as capabilities accelerate. The code, examples, and sample seeds are available on GitHub, and Anthropic’s Alignment Science blog includes a longer technical dive and experimental appendices for anyone who wants to reproduce the benchmarks.

Of course, the tool isn’t a magic bullet. Procedural generation shifts part of the burden from prompt engineering to configuration design: choices about diversity, evaluator models, conversation length, and example selection visibly change absolute scores even when relative rankings are stable. And automated judges — however well-tuned — will never replace thoughtful human review for novel or surprising failure modes. Still, Bloom’s real contribution is pragmatic: it lowers the cost of running disciplined, repeatable behavior checks, and it gives the community a common language (seed files, judge configs, transcripts) to compare results.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Leave a Comment

Leave a ReplyCancel reply

Most Popular

Garmin launches D2 Mach 2 Pro aviator watch with built-in inReach

DJI Power 1000 Mini is the new sweet spot for portable 1kWh stations

GoPro Mission 1 series is powerful, pricey, and not for casual users

Cheap MacBook Neo spurs Microsoft to stack student deals on Windows 11 laptops

DJI Osmo Mobile 8P debuts with detachable remote and smarter tracking

Also Read
Gemini Embedding 2

Gemini Embedding 2 is now live for multimodal AI

Anthropic logo displayed as bold black uppercase text on a light beige background.

Anthropic’s secret Mythos AI just slipped into the wrong hands

A computer-generated image of a circular object that is defined as the OpenAI logo.

OpenAI Privacy Filter brings open-weight PII redaction to everyone

2027 BMW 7 Series

2027 BMW 7 Series debuts with Neue Klasse tech and bold luxury

General Motors' Newport Solar array in Arkansas.

GM now powers all U.S. operations with 100% renewable electricity

Logitech Combo Touch for iPad Air

Logitech’s new iPad Air M4 lineup nails portability and productivity

Person using a laptop on a wooden desk designing a Peppa Pig–themed baby shower invitation in Canva, with a coffee cup and books nearby.

Canva adds Peppa Pig templates for busy parents and time-poor teachers

Opera browser start page with sidebar integrations showing YouTube and Twitch icons alongside speed dial shortcuts like Reddit, Netflix, and Medium.

Opera One now pins YouTube and Twitch to your browser sidebar

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.