OpenAI quietly built an AI data agent for its own employees

OpenAI has quietly turned one of the most painful parts of modern tech work—digging through data—into a testbed for what an AI “coworker” can actually look like inside a large company. Instead of shipping a new external product, the company has built an in‑house AI data agent that sits on top of its internal data platform and behaves less like a BI dashboard and more like a full‑stack analyst that never logs off.

At a high level, this agent lets OpenAI employees ask messy, open‑ended questions in plain language—anything from “How did ChatGPT usage change after last November’s launch?” to “Which geographies are driving the biggest revenue swings?”—and get back not just a chart, but the entire analytical path: which tables were used, what filters were applied, what assumptions were made, and where the raw data lives. It is powered by GPT-5.2 and Codex, wrapped in a scaffolding of metadata, code analysis, institutional knowledge, and a memory system that helps it learn from past mistakes.

The backdrop here is scale. OpenAI’s internal data platform serves more than 3,500 internal users and spans roughly 600 petabytes across about 70,000 datasets, a size where “just finding the right table” can chew up hours before a single query even runs. Teams across engineering, data science, go‑to‑market, finance, and research all need to answer high‑stakes questions on product health and business performance. In that setting, the traditional workflow—ping a data expert, wait for a dashboard update, iterate over Slack—is both too slow and too fragile. When SQL queries can easily run over 180 lines and still silently misfire because of a bad join or a missing filter, OpenAI is effectively arguing that you can’t fix this with prettier charts; you need a smarter layer in front of the warehouse.

The in‑house data agent is that layer. It’s available wherever OpenAI employees already work: as a Slack agent, a web UI, inside IDEs, via the Codex CLI using the MCP protocol, and directly inside the internal ChatGPT app through an MCP connector. The experience is deliberately conversational. An engineer might ask about NYC taxi trip reliability on a test dataset—“Which pickup‑to‑dropoff ZIP pairs are the most unreliable, and when does that variability occur?”—and the agent will plan the analysis end‑to‑end: define what “unreliable” means (for instance, p50 vs p95 travel times), inspect schemas, select tables, write and run SQL, refine its own approach if intermediate results look wrong, and finally synthesize the findings in human language.

That self‑correcting loop is one of the more interesting aspects of the system. Instead of following a rigid script, the agent evaluates its progress at every step. If a join unexpectedly returns zero rows, or a filter blows away most of the dataset, it treats that as a signal that something went off the rails, investigates, and tries again—without the user having to manually debug. The idea is to push the tedious iteration that analysts normally do in a notebook or SQL editor into the agent itself, so humans spend more time thinking about whether the metric is the right one, not why the query timed out.

Under the hood, the agent’s accuracy depends on how well it is grounded in the messy reality of OpenAI’s data and organization. The company has built a six‑layer context stack that starts with basic metadata and ends with live runtime checks. At the base, the agent ingests table schemas, column types, lineage information, and historical queries, which tells it not just what each dataset looks like, but how humans have historically joined and filtered those tables. On top of that, domain experts annotate tables and columns with plain‑language descriptions, caveats, and business meaning—critical details that never show up in column names.

The third layer is where Codex comes in. Instead of treating the warehouse as a black box, the agent crawls the code that actually generates each table, extracting definitions about granularity, primary keys, freshness guarantees, and what the table is supposed to represent. This is what lets it distinguish, for example, between a dataset that contains only first‑party ChatGPT traffic and one that aggregates multiple channels, even if their schemas look similar. That code‑level understanding is continuously refreshed, so the system doesn’t rely on humans to manually keep documentation in sync.

Layer four is “institutional knowledge”: the sea of Slack threads, Google Docs, and Notion pages where real‑world nuance lives—launch timelines, incident postmortems, internal codenames, and canonical metric definitions. The agent indexes those documents as embeddings, permissioned and filtered at query time, so it can, for instance, explain a sudden dip in connector usage by linking it to a logging issue after a specific launch rather than misclassifying it as a business problem.

The fifth layer is memory, and this is where the system starts to feel like a coworker who learns. When a user corrects the agent or adds an important nuance—say, the exact way an experiment is gated or the subtle filter needed to get the “official” version of a metric—the agent can propose saving that as a memory for future runs. Those memories can be global or personal, edited manually, and are specifically aimed at storing non‑obvious constraints that are hard to infer from schemas or logs alone. The effect is that recurring questions get more accurate over time, rather than the agent rediscovering the same fixes again and again.

Finally, when everything else falls short—if a table has no prior usage or the existing context looks stale—the agent can still drop down to runtime context, issuing live queries to inspect schemas, sample data, and talk to surrounding systems like metadata services, Airflow, or Spark. Every day, OpenAI runs an offline pipeline to aggregate all those context layers into a normalized representation, convert it into embeddings using its own embeddings API, and store it for fast retrieval. At query time, the agent performs retrieval‑augmented generation, pulling only the most relevant slices of embedded context before writing SQL and hitting the warehouse, so it can stay responsive even when operating over tens of thousands of tables.

If all that sounds heavyweight for “just answering data questions,” that’s the point. OpenAI positions the agent less as a one‑shot answer bot and more as a teammate you can reason with. It carries conversation context across turns, supports clarification and mid‑course corrections, and doesn’t freeze when instructions are vague. If you ask about business growth without a date range, it can assume sensible windows like the last week or month while still prompting you when that assumption might matter. When the company noticed that teams were repeatedly running the same analyses—weekly business reviews, standard table validations—it wrapped those into reusable workflows inside the agent, encoding best practices so that anyone can spin up those reports without rebuilding the mental model from scratch.

All of this raises an obvious question: how do you keep such a system from quietly drifting out of alignment as the codebase, schemas, and business evolve? OpenAI’s answer is to treat evaluation like unit testing. The team uses the company’s Evals API to define curated sets of question–answer pairs, each representing an important metric or analytical pattern, and pairs them with a “golden” SQL query that expresses the expected behavior. For each eval, the natural‑language question is sent to the agent’s query generation endpoint, the resulting SQL is executed, and its output is compared with the golden result—not by naive string matching, but by comparing dataframes and feeding both SQL and result differences into a grader that scores correctness and acceptable variation. Those evals run continuously during development as canaries, catching regressions before they show up in day‑to‑day use.

Security is another non‑negotiable piece. OpenAI emphasizes that the agent does not circumvent existing permissions; it acts purely as an interface layer, inheriting the same access controls that govern the underlying data. If a user doesn’t have access to a table, the agent can’t see it either, and will either flag the lack of permissions or fall back to alternatives the user is allowed to query. At the same time, it is designed to be transparent about how it reached its conclusions: each answer is accompanied by a summary of its reasoning, the assumptions it made, and direct links to the query results so users can inspect and verify the raw data.

The article from OpenAI also pulls back the curtain on what the team learned building this kind of agent. One lesson: giving the model access to every possible tool sounded powerful, but turned out to be confusing; overlapping capabilities made it harder for the agent to pick the right path, so the team cut and consolidated tools instead of adding more. Another lesson: overly prescriptive prompts actually hurt performance. Encoding rigid step‑by‑step instructions for every analytical task forced the agent down wrong paths when the real‑world data shape didn’t match the template, whereas higher‑level goals plus GPT-5’s own reasoning produced better, more flexible behavior. And perhaps the most pragmatic takeaway: “meaning lives in code.” Schemas and historical queries tell you what a table looks like and how it has been used, but only the pipelines that create it reveal the assumptions, freshness guarantees, and business intent that make or break an analysis.

Although this particular agent is explicitly internal—OpenAI stresses that it is not an external product but an in‑house tool tailored to its own data, permissions, and workflows—it reads like a blueprint for what the next generation of analytics stacks might look like. Instead of a patchwork of dashboards, ad hoc notebooks, and tribal knowledge, you get an AI layer that understands the warehouse, the code, and the company’s language well enough to feel like another analyst on the team, just one that can calmly reason over 600 petabytes at 2 am. For everyone watching how AI will reshape day‑to‑day technical work, OpenAI’s data agent is less about a flashy new model and more about what you can build when you treat reasoning, context, and memory as first‑class infrastructure.