GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIComputingOpenAITech

The epidemiology of a software crash

A routine infrastructure crash became a year-long investigation when OpenAI engineers realized their server issues were actually two separate, phantom bugs colliding at scale.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
Jul 1, 2026, 6:38 AM EDT
Share
We may get a commission from retail offers. Learn more
Illustration showing two technology components feeding into a warning dialog, representing how combined software and hardware interactions can trigger system errors or security vulnerabilities.
Image: OpenAI
SHARE

In software engineering, there are normal bugs—a missed semicolon, a memory leak, a runaway loop—and then there are the phantoms. These are the bugs that defy the basic rules of computer science, where a system fails in ways that shouldn’t be physically possible.

Recently, the infrastructure team at OpenAI ran into a phantom. A few months ago, servers powering Rockset—a specialized C++ data system OpenAI acquired in 2024 to handle real-time search for ChatGPT—started crashing inexplicably.

When the engineers cracked open the core dumps (snapshots of the program’s memory at the exact moment of a crash), what they saw made no sense. Normal functions seemed to execute perfectly, but when they finished, they tried to return to bogus memory addresses, prompting the Linux kernel to kill the program. In some cases, the return address slot was simply completely empty, reading as NULL. In other instances, the stack pointer register—the CPU’s way of keeping track of where it is in a program’s memory—was off by exactly eight bytes, as if it had shifted entirely on its own mid-execution.

In a C++ application, crashes are usually caused by stray memory writes. But a stray write that lands perfectly on a saved return address and corrupts nothing else is phenomenally unlikely. A stack pointer spontaneously misaligning itself without the use of highly specific, low-level code was even stranger. Every hypothesis the engineering team threw at the wall bounced right back.

Related /

  • C and C++ are again the future of AI software

At first, the team took what they called a “doctor” approach. They zeroed in on a handful of isolated crashes, put them under a microscope, and tried to diagnose a root cause based on the hyper-specific details of those few events. Because they assumed all of these bizarre crashes were symptoms of a single underlying illness, they kept finding evidence that contradicted their own theories.

Realizing they were stuck, the team zoomed out and shifted to an “epidemiologist” approach. Instead of looking at individual patients, they looked at the entire population. They had ChatGPT write a script to automatically download, filter, and categorize every single Rockset core dump from the past year.

Once they graphed the cleaned data, the mystery unraveled instantly. They weren’t looking at one impossible bug. They were looking at two completely unrelated bugs that happened to overlap perfectly, creating an investigative nightmare.

The first culprit was a hardware issue. By mapping out the crashes where the stack pointer shifted by eight bytes, the engineers noticed a glaring geographic pattern. The failures weren’t a software flaw at all; they traced back to a single, physically defective Azure server where the CPU simply wasn’t doing math correctly. No matter which virtual machine landed on that specific host, the hardware quietly corrupted the registers. OpenAI denylisted the bad host, and the misaligned-stack crashes vanished.

But the return-to-NULL crashes remained, scattered across regions and clusters. With the hardware issue isolated, the team re-examined the remaining core dumps and noticed a new pattern: the crashes were all happening during C++ exception unwinding.

When C++ throws an error exception, the system has to dynamically search the program’s memory stack for the right code block to handle it. It’s a complex transfer of control that requires restoring CPU registers to their previous states. To handle this, OpenAI’s binary was relying on GNU libunwind, a widely used open-source library.

Digging into the library’s source code, the engineers found a tiny, highly specific race condition in an assembly language routine called _Ux86_64_setcontext. The bug was practically microscopic. The routine updates the stack pointer, and then, in the very next step, reads the return address.

Between those two steps is a vulnerable window exactly one instruction wide. If a system signal arrived at the exact picosecond between those two instructions, the Linux kernel would build a signal frame that overwrote the return address with NULL.

On a modern CPU, that one-instruction window stays open for roughly 100 picoseconds. So how was a fraction-of-a-fraction-of-a-second vulnerability crashing servers multiple times a day? And more importantly, the GNU libunwind bug is more than 18 years old. Why was it suddenly taking down OpenAI’s infrastructure in the summer of 2026?

It came down to a perfect storm of scale and timing. Rockset handles massive query workloads and uses C++ exceptions as a standard way to manage data ingest limits, intentionally throwing tens of thousands of exceptions per second. At the same time, OpenAI aggressively uses automated system signals to track query CPU time at the millisecond level.

Recently, the team had tweaked their signal handler, slightly increasing its memory footprint on the stack. That tiny bump in memory usage, combined with the massive volume of exceptions and rapid-fire system signals, pushed the 18-year-old race condition over the edge. At fleet scale, a 100-picosecond window happening millions of times a day practically guaranteed a collision every few hours.

To fix the issue, the team swapped out GNU libunwind for an alternative library that didn’t share the flaw, immediately stabilizing the system. They then upstreamed a patch for the 18-year-old open-source bug to save other developers from the same headache down the road.

Ultimately, the investigation proved that when systems reach a massive scale, traditional debugging isn’t enough. You can stare at assembly code all day, but sometimes you just need to step back, pull the data on the entire population, and let the patterns speak for themselves.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Leave a Comment

Leave a ReplyCancel reply

Most Popular

Gemini can now create images based on your own life

Google’s 2026 Environmental Report: A tougher road to net-zero

Anthropic’s Claude arrives on Azure

OpenAI’s first piece of hardware is a desk-sized surprise

C and C++ are again the future of AI software

Also Read
Promotional graphic for Claude for Linux featuring the Anthropic logo, the text "Claude for Linux – The fastest way to talk with Claude," and a "Get started" button on a minimalist interface.

Linux developers get an official native Claude Desktop app

Anthropic Claude Managed Agents update overview highlighting five new features: streaming sessions, agent overrides, deployment webhooks, reverse pagination, and credential scoping.

Anthropic refines Claude Managed Agents with five new updates

The image features the number five artistically formed out of various botanical illustrations, including green leaves and a mix of orange, white, and yellow flowers, all set against a solid, muted light-beige background.

Anthropic’s new Claude Sonnet 5 brings pro-level agency to the masses

The number five composed of several butterflies

Anthropic brings Fable 5 and Mythos 5 back online

Image showing that Claude can display proteins, structures, and molecules

Anthropic debuts Claude Science to speed up complex research

iPad Pro shows the Beat Breaker tool in Logic Pro, and MacBook Pro shows the Auto Mask tool in Final Cut Pro.

Apple puts AI to work in the latest Creator Studio update

Gemini logo surrounded by translucent glass chat bubbles on a light background for Play Store promotion.

Google unveils faster image and video models for developers

Google Slides logo featuring a yellow presentation slide icon with a white rectangular frame centered on a soft light blue background.

Google Slides now generates full presentations with Gemini

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.