In software engineering, there are normal bugs—a missed semicolon, a memory leak, a runaway loop—and then there are the phantoms. These are the bugs that defy the basic rules of computer science, where a system fails in ways that shouldn’t be physically possible.
Recently, the infrastructure team at OpenAI ran into a phantom. A few months ago, servers powering Rockset—a specialized C++ data system OpenAI acquired in 2024 to handle real-time search for ChatGPT—started crashing inexplicably.
When the engineers cracked open the core dumps (snapshots of the program’s memory at the exact moment of a crash), what they saw made no sense. Normal functions seemed to execute perfectly, but when they finished, they tried to return to bogus memory addresses, prompting the Linux kernel to kill the program. In some cases, the return address slot was simply completely empty, reading as NULL. In other instances, the stack pointer register—the CPU’s way of keeping track of where it is in a program’s memory—was off by exactly eight bytes, as if it had shifted entirely on its own mid-execution.
In a C++ application, crashes are usually caused by stray memory writes. But a stray write that lands perfectly on a saved return address and corrupts nothing else is phenomenally unlikely. A stack pointer spontaneously misaligning itself without the use of highly specific, low-level code was even stranger. Every hypothesis the engineering team threw at the wall bounced right back.
Related /
At first, the team took what they called a “doctor” approach. They zeroed in on a handful of isolated crashes, put them under a microscope, and tried to diagnose a root cause based on the hyper-specific details of those few events. Because they assumed all of these bizarre crashes were symptoms of a single underlying illness, they kept finding evidence that contradicted their own theories.
Realizing they were stuck, the team zoomed out and shifted to an “epidemiologist” approach. Instead of looking at individual patients, they looked at the entire population. They had ChatGPT write a script to automatically download, filter, and categorize every single Rockset core dump from the past year.
Once they graphed the cleaned data, the mystery unraveled instantly. They weren’t looking at one impossible bug. They were looking at two completely unrelated bugs that happened to overlap perfectly, creating an investigative nightmare.
The first culprit was a hardware issue. By mapping out the crashes where the stack pointer shifted by eight bytes, the engineers noticed a glaring geographic pattern. The failures weren’t a software flaw at all; they traced back to a single, physically defective Azure server where the CPU simply wasn’t doing math correctly. No matter which virtual machine landed on that specific host, the hardware quietly corrupted the registers. OpenAI denylisted the bad host, and the misaligned-stack crashes vanished.
But the return-to-NULL crashes remained, scattered across regions and clusters. With the hardware issue isolated, the team re-examined the remaining core dumps and noticed a new pattern: the crashes were all happening during C++ exception unwinding.
When C++ throws an error exception, the system has to dynamically search the program’s memory stack for the right code block to handle it. It’s a complex transfer of control that requires restoring CPU registers to their previous states. To handle this, OpenAI’s binary was relying on GNU libunwind, a widely used open-source library.
Digging into the library’s source code, the engineers found a tiny, highly specific race condition in an assembly language routine called _Ux86_64_setcontext. The bug was practically microscopic. The routine updates the stack pointer, and then, in the very next step, reads the return address.
Between those two steps is a vulnerable window exactly one instruction wide. If a system signal arrived at the exact picosecond between those two instructions, the Linux kernel would build a signal frame that overwrote the return address with NULL.
On a modern CPU, that one-instruction window stays open for roughly 100 picoseconds. So how was a fraction-of-a-fraction-of-a-second vulnerability crashing servers multiple times a day? And more importantly, the GNU libunwind bug is more than 18 years old. Why was it suddenly taking down OpenAI’s infrastructure in the summer of 2026?
It came down to a perfect storm of scale and timing. Rockset handles massive query workloads and uses C++ exceptions as a standard way to manage data ingest limits, intentionally throwing tens of thousands of exceptions per second. At the same time, OpenAI aggressively uses automated system signals to track query CPU time at the millisecond level.
Recently, the team had tweaked their signal handler, slightly increasing its memory footprint on the stack. That tiny bump in memory usage, combined with the massive volume of exceptions and rapid-fire system signals, pushed the 18-year-old race condition over the edge. At fleet scale, a 100-picosecond window happening millions of times a day practically guaranteed a collision every few hours.
To fix the issue, the team swapped out GNU libunwind for an alternative library that didn’t share the flaw, immediately stabilizing the system. They then upstreamed a patch for the 18-year-old open-source bug to save other developers from the same headache down the road.
Ultimately, the investigation proved that when systems reach a massive scale, traditional debugging isn’t enough. You can stare at assembly code all day, but sometimes you just need to step back, pull the data on the entire population, and let the patterns speak for themselves.
Discover more from GadgetBond
Subscribe to get the latest posts sent to your email.
