By using this site, you agree to the Privacy Policy and Terms of Use.
Accept

GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Best Deals
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIAnthropicTech

Anthropic releases Bloom, an agentic tool for continuous AI evaluation

Bloom is Anthropic’s open source framework for probing AI behavior.

By
Shubham Sawarkar
Shubham Sawarkar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
Dec 20, 2025, 2:00 AM EST
Share
We may get a commission from retail offers. Learn more
Anthropic illustration
Image: Anthropic
SHARE

Anthropic quietly dropped a new tool this month that feels like the next pragmatic step in turning alignment work from artisanal craft into repeatable engineering: Bloom, an open-source, agentic framework that automatically generates and runs behavioral evaluations against frontier AI models. Rather than hand-crafting a handful of tests and hoping they still matter weeks later, Bloom lets a researcher define a behavior once, then procedurally spawns fresh scenarios, runs them, and scores how often and how strongly that behavior shows up — all in a single, configurable pipeline.

At its core, Bloom is built around the idea of a “seed”: a small configuration that specifies the behavior you care about (think “sycophancy,” “self-preservation,” or “instructed long-horizon sabotage”), example transcripts if you have them, and a set of parameters that shape how aggressive or diverse the evaluation should be. From that seed, Bloom scaffolds an entire evaluation suite — generating scenarios, simulating users and tools, running the target model in parallel rollouts, and then asking a judge model to score each transcript. The repository and quick-start make it clear the project is intended to be practical: you can run it locally, plug in different LLM providers, and scale experiments with Weights & Biases.

That procedural approach is what separates Bloom from older, static benchmarks. Traditional test sets are fragile: they take time to design, they can leak into training data, and they often stop being discriminative as models iterate. Bloom’s evaluations intentionally change every run — the scenarios “bloom” differently depending on the seed — but the seed itself is meant to be cited with results so other teams can reproduce experiments. In short, variability where it helps (scenario generation), stability where it matters (seeded reproducibility).

Technically, Bloom behaves like a small team of coordinating agents. The first “understanding” agent digests the researcher’s behavior description and examples, expanding them into what counts as the behavior and why it matters. An “ideation” agent writes candidate scenarios — who the user is, the system prompt, and whether tools are in play. A “rollout” agent executes those scenarios against the model under test, producing multi-turn transcripts. Finally, a judgment agent scores each transcript and a meta-judge aggregates suite-level metrics like elicitation rate and average behavior presence. You can swap models at each stage, tweak conversation length, add secondary metrics such as “realism” or “evaluation awareness,” and export transcripts for qualitative review. The GitHub repo and Anthropic’s write-up both walk through the exact parameters and example configurations.

Anthropic positions Bloom as a complement, not a replacement, to the tools it released earlier — notably Petri, which explores a model’s broader behavioral profile across many conversations. Where Petri is breadth-first, surfacing many behaviors across diverse multi-turn exchanges, Bloom is depth-first: pick one behavior and measure it, systematically and scalably. That pairing matters because it reflects a shift in tooling strategy: use one system to flag suspicious patterns broadly, and another to probe one behavior with rigor.

The team didn’t just open-source code and walk away. They ran Bloom on a suite of cases to show what it can do. In internal validations, Anthropic compared production Claude models against intentionally misaligned “model organisms” (system-prompted models designed to be quirky or unsafe). Across ten such quirks, Bloom separated the misaligned organisms from baseline models in nine cases; in the tenth case, manual review suggested the baseline actually displayed similar behavior. They also hand-labeled 40 transcripts and compared human judgments with Bloom’s automated judges: Claude Opus 4.1 produced the strongest Spearman correlation to human scores (0.86), with Sonnet 4.5 behind it at 0.75. Those figures aren’t just PR — they’re built into the technical report and the repo’s validation scripts.

A concrete example Anthropic shares is a revisit of “self-preferential bias” — situations where a model favors its own interests when asked to rank options. Using example transcripts patterned after a Claude system card, Bloom reproduced the same model ranking from the original evaluation and found that raising the judge model’s reasoning effort reduced bias for some models. The platform’s ability to add secondary filters — for instance, dropping transcripts that seem unrealistic or that show the model “gaming” the evaluation — also improved the clarity and quality of the findings. That combination of replication plus deeper filtering is the kind of workflow that could make behavior audits both faster and more defensible.

Bloom’s release matters for more than just Anthropic-adjacent teams. As models are embedded into tool-rich environments and given more autonomy, measuring behavior becomes a continuous engineering problem, not a one-off research exercise. By open-sourcing a configurable, agentic pipeline, Anthropic is giving other researchers and red-teamers a shared backbone for tasks like testing nested jailbreaks, measuring evaluation awareness, and constructing sabotage traces — work that’s increasingly urgent as capabilities accelerate. The code, examples, and sample seeds are available on GitHub, and Anthropic’s Alignment Science blog includes a longer technical dive and experimental appendices for anyone who wants to reproduce the benchmarks.

Of course, the tool isn’t a magic bullet. Procedural generation shifts part of the burden from prompt engineering to configuration design: choices about diversity, evaluator models, conversation length, and example selection visibly change absolute scores even when relative rankings are stable. And automated judges — however well-tuned — will never replace thoughtful human review for novel or surprising failure modes. Still, Bloom’s real contribution is pragmatic: it lowers the cost of running disciplined, repeatable behavior checks, and it gives the community a common language (seed files, judge configs, transcripts) to compare results.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Leave a Comment

Leave a ReplyCancel reply

Most Popular

Disney+ Hulu bundle costs just $10 for the first month right now

The creative industry’s biggest anti-AI push is officially here

Bungie confirms March 5 release date for Marathon shooter

The fight over Warner Bros. is now a shareholder revolt

This rugged Android phone boots Linux and Windows 11

Also Read
Nelko P21 Bluetooth label maker

This Bluetooth label maker is 57% off and costs just $17 today

Blue gradient background with eight circular country flags arranged in two rows, representing Estonia, the United Arab Emirates, Greece, Jordan, Slovakia, Kazakhstan, Trinidad and Tobago, and Italy.

National AI classrooms are OpenAI’s next big move

A computer-generated image of a circular object that is defined as the OpenAI logo.

OpenAI thinks nations are sitting on far more AI power than they realize

The image shows the TikTok logo on a black background. The logo consists of a stylized musical note in a combination of cyan, pink, and white colors, creating a 3D effect. Below the musical note, the word "TikTok" is written in bold, white letters with a slight shadow effect. The design is simple yet visually striking, representing the popular social media platform known for short-form videos.

TikTok’s American reset is now official

Sony PS-LX5BT Bluetooth turntable

Sony returns to vinyl with two new Bluetooth turntables

Promotional graphic for Xbox Developer_Direct 2026 showing four featured games with release windows: Fable (Autumn 2026) by Playground Games, Forza Horizon 6 (May 19, 2026) by Playground Games, Beast of Reincarnation (Summer 2026) by Game Freak, and Kiln (Spring 2026) by Double Fine, arranged around a large “Developer_Direct ’26” title with the Xbox logo on a light grid background.

Everything Xbox showed at Developer_Direct 2026

Promotional artwork for Forza Horizon 6 showing a red sports car drifting on a wet mountain road in Japan, with cherry blossom petals in the air, Mount Fuji and a Tokyo city skyline in the background, a blue off-road SUV following behind, and the Forza Horizon 6 logo in the top right corner.

Forza Horizon 6 confirmed for May with Japan map and 550+ cars

Close-up top-down view of the Marathon Limited Edition DualSense controller on a textured gray surface, highlighting neon green graphic elements, industrial sci-fi markings, blue accent lighting, and Bungie’s Marathon design language.

Marathon gets its own limited edition DualSense controller from Sony

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2025 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.