Kaggle Benchmarks now run locally

Kaggle Benchmarks just became a lot more interesting – and a lot more local. With Google and Kaggle rolling out proper local development for Kaggle Benchmarks, you can now design, debug, and ship AI evaluations straight from your editor or coding agent, instead of wrestling with browser forms and glued-together scripts.

If that sounds like a small workflow tweak, it isn’t. It’s a pretty big signal about where AI evaluation is heading: away from static leaderboards and toward something that looks a lot more like software development.

For years, Kaggle has been the place you went to compete on machine learning problems or grab datasets for your next side project. Now it wants to be “the world’s AI proving ground,” and Kaggle Benchmarks is the product carrying that ambition. Benchmarks, in Kaggle’s vocabulary, are structured evaluations: you define tasks that probe a specific capability – say, mathematical reasoning, tool use, or multimodal understanding – then group those tasks into a benchmark and run a lineup of models against it.

Community Benchmarks, launched earlier this year, opened that process up to anyone. You could go to the Kaggle UI, create a task, add some models, and get a tidy leaderboard that shows who’s winning on your custom evaluation. Now, the new local development update essentially takes that same pipeline and plugs it directly into your local stack: VS Code, Cursor, Antigravity, Claude Code, or whatever editor and AI agent you live in.

Under the hood, the experience is powered by a few components: the Kaggle Benchmarks Python SDK, the Kaggle CLI, and a “write-kaggle-benchmarks” skill that AI agents can load to scaffold tasks and push them to Kaggle. The basic idea is simple: write Python code that defines a task, run it locally to validate that it behaves the way you expect, then push it to Kaggle where it joins the public benchmark ecosystem and can be run against a roster of frontier models.

One of the clearest walkthroughs of this workflow comes from a demo by Nick Kang, a product manager on Kaggle Benchmarks. In the video, he installs the “write-kaggle-benchmarks” skill, initializes his environment, and asks his coding agent to build a test that checks whether a model can correctly read a colorblindness test image and output the number hidden in the dots. The agent responds by generating a Python file that uses the Kaggle Benchmarks SDK: it pulls in the image URL, defines the expected behavior (“output only the number as a single digit”), and sets an assertion that the model’s response must contain the number 6.

From there, it’s classic software development muscle memory. He runs the Python file locally, which calls Kaggle’s model routing infrastructure via a local proxy, and gets immediate feedback: assertion passed, plus token counts, latency, and cost. That’s an important detail – you’re not just seeing correctness, you’re seeing performance characteristics baked into the developer loop. Once he’s happy with the behavior, he runs a CLI command to push the task to Kaggle, where it becomes a persistent entry in the Benchmarks platform.

This is where the local story connects back to the cloud. Once the task lives on Kaggle, you can run it across multiple models – in the demo, they hit Gemini 3 Flash Preview, Gemini 3.5 Flash, GPT 5.5, and Anthropic’s Opus 4.7 – and get a consolidated view of pass/fail, latency, cost, and token usage. The results show up both in your local CLI and in the Kaggle UI, so you can stay in the terminal or share a polished leaderboard with colleagues.

From a developer experience perspective, the biggest change is who’s in the driver’s seat: you or your AI agent. Kaggle isn’t just assuming you’ll hand-code every evaluation; it’s explicitly designing for a world where you describe the benchmark in natural language and let an agent generate the code, wire up the SDK, validate the task, and push it. In a LinkedIn post announcing the feature, Kaggle highlighted this exact pattern: describe what you want, have your agent write the code and validate locally, then run the benchmark against “every SOTA model in one go” with pass/fail and cost metrics flowing back into the agent’s panel.

If you’ve spent the last year watching eval frameworks proliferate – from bespoke internal harnesses at big labs to open source projects focused on LLM evals – Kaggle’s move slots neatly into that landscape. The twist is that this isn’t yet another standalone framework. It’s a bridge between your local environment and a shared, discoverable ecosystem of benchmarks. Tasks you push don’t just live on your laptop; they can be made public, grouped into benchmarks, and used by other developers or even model labs as signals for model improvements.

Scale matters here. Since Kaggle Benchmarks launched in January 2026, the community has already created around 10,000 tasks to measure AI model capabilities. That volume is both a strength and a challenge. On one hand, it suggests a broad appetite for custom evals – everything from “can this model solve Olympiad-style math problems?” to “does this agent follow safety constraints in tool calls?” On the other, it raises the question of how to separate high-quality tasks from noisy or redundant ones. Kaggle’s bet seems to be that better tooling – including local development and a well-structured SDK – will nudge creators toward more rigorous, reproducible evaluations.

The SDK itself is designed with that rigor in mind. According to the library’s documentation, it gives you a structured way to define tasks, specify how models should be called, and assert correctness of outputs, rather than leaving you to hack together ad hoc scripts. Combine that with the download options in the Kaggle UI – where you can export leaderboard data as CSVs for your own analysis – and you start to get something that feels closer to an analytics pipeline than a leaderboard screenshot.

It’s also worth situating this in Google’s broader AI story. Kaggle is owned by Google, and the Benchmarks product sits in a world where Google’s Gemini models are competing head-to-head with OpenAI, Anthropic, and others. By making it trivial for developers to run their own evaluations across “frontier” models and share the results publicly, Kaggle is, in effect, encouraging a decentralized eval culture: instead of waiting for lab-authored benchmarks or marketing claims, the community can define the tests that matter to them and see how models actually behave.

The Community Benchmarks launch earlier in the year framed this explicitly. Kaggle pitched it as a way for the “global AI community” to design and share custom benchmarks that go beyond static accuracy scores and better reflect real-world model behavior. The workflow there is very similar: create tasks, group them into a benchmark, add models, and get a leaderboard. The new local dev tools are essentially the missing piece for developers who don’t want to live inside a web form for every iteration.

What does this look like for an everyday developer or data scientist in practice? Imagine you’re building a retrieval-augmented system for a finance app and you’re worried about hallucinations around regulations or dates. You could define a set of tasks – each one a prompt plus a ground truth and a validation function – and run them locally as part of your CI pipeline using the Kaggle Benchmarks SDK. When you’re confident in the behavior, you push those tasks to Kaggle, run them across multiple models, and decide whether it’s time to switch providers or adjust your prompting strategy.

Or maybe you’re part of an AI safety team and want to stress-test models on risky instructions or jailbreak attempts. You could design tasks that encode red-team scenarios, assert safe behavior, and publish them as a benchmark that others can reuse or extend. Local development means you can iterate quickly – change one edge case, rerun locally, inspect logs, and only push when the task behaves exactly the way you intend.

From a tooling perspective, there’s also an interesting angle around AI agents themselves. Kaggle explicitly calls out “coding agents” and environments like Cursor and Antigravity as first-class citizens in this workflow. That suggests a near-future pattern where agents don’t just help you write application logic, but also maintain your evaluation suite: they generate new tasks when you describe a failure mode, refactor existing ones, and keep your benchmarks in sync with how your product is evolving.

There are, of course, open questions. Not every developer wants to expose their internal evals to the public, even if Kaggle offers private options. Some teams will prefer self-hosted frameworks for compliance reasons, or to keep sensitive test cases away from public view. And the quality of community benchmarks will likely be uneven – just as with any open platform, there will be gems and there will be rough drafts.

But the direction of travel is clear: evaluations are becoming more modular, more shareable, and more integrated into everyday developer workflows. Kaggle Benchmarks’ move to local development is one more step in that direction, and it’s a fairly pragmatic one. It doesn’t ask you to abandon your tools or adopt a brand new mental model. It just lets you treat benchmarks like code – versioned, testable, and portable between your laptop and the cloud.

For the AI ecosystem, that’s significant. Benchmarks shape incentives. They decide what “good” looks like, and in a space where model capabilities are advancing quickly, having a living, community-driven benchmark ecosystem – one that can be edited in a pull request or spun up from a prompt to an AI agent – might be one of the few ways to keep evaluation even remotely in step with the models themselves.

If you’re already running your own ad hoc eval scripts, the question is less “should you care?” and more “how much of this pipeline do you want to outsource to a shared platform?” With local Kaggle Benchmarks, you don’t have to answer that all at once. You can start by wrapping a single test in the SDK, run it locally, and see whether pushing it to a global leaderboard actually helps your team make better decisions about which models to trust – and where they still fall short.