How One .unwrap() Call Took Down ChatGPT, X, Spotify, and 1 in 5 Websites on the Internet
At 11:20 UTC on November 18, 2025, the internet started breaking.
X went down. Then ChatGPT. Then Spotify. Then Discord, Figma, Canva, Zoom, Coinbase, Vercel, and Reddit. Downdetector — the website people go to check if the internet is broken — went down too. It also runs on Cloudflare.
By the time the dust settled, roughly 1 in 5 webpages on the internet was throwing errors. 2.4 billion aggregated monthly active users across 700+ services were affected. Cloudflare's CEO described it as the company's worst network outage since 2019.
The root cause: a single uncaught Rust exception.
The Six-Line Code Path That Broke the Internet
Here's the exact code that triggered the cascade:
let (feature_values, _) = features
.append_with_names(&self.config.feature_names)
.unwrap();
That .unwrap() is the culprit. In Rust, .unwrap() on an Err result causes an immediate thread panic. The developers wrote explicit code that said: "if this fails, crash."
And it did.
But to understand why it failed, you have to trace back through a surprisingly mundane chain of events that started four hours earlier.
The Full Chain: A SQL Query Without a Filter
11:05 UTC — A Cloudflare database engineer deploys a routine access control change to ClickHouse, their analytics database. The change migrates authentication from shared system accounts to individual user accounts.
A query that fetches bot detection feature names now runs without a database name filter:
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
That missing filter means the query now returns columns from two databases — default and r0 — instead of one. The feature count nearly doubles, jumping from ~60 entries to ~120+.
The problem: This inflated feature file immediately gets picked up by Quicksilver, Cloudflare's internal configuration distribution system, which propagates changes to every data center globally within seconds.
Every five minutes, every edge server worldwide refreshes its Bot Management configuration file. Because ClickHouse nodes were being gradually updated, bad files were generated only on updated nodes at first — creating eerie cycling failure-and-recovery waves across the network until all nodes were generating the bad data simultaneously.
11:20 UTC — Full outage begins.
Why the New Rust Proxy Made It Catastrophic
Cloudflare runs two proxies in parallel: the old FL (written in C) and the newer FL2 (written in Rust).
When FL encountered the oversized config file, it handled it quietly — it simply returned zero bot scores for all traffic and kept serving requests. Degraded, but not dead.
FL2 hit the hardcoded 200-feature limit and panicked. The .unwrap() call bubbled up as a thread fl2_worker_thread panicked error and cascaded 5xx responses across the global infrastructure.
The bitter irony: Rust is widely celebrated for preventing undefined behavior. It forces you to acknowledge failures through the Result type. The developers did acknowledge it — they just chose .unwrap(), which says "if this goes wrong, crash immediately." In a test environment, that's often the right call. On infrastructure serving 20% of the internet, it wasn't.
The fix that engineers on Hacker News converged on:
let (feature_values, _) = match features.append_with_names(&self.config.feature_names) {
Ok(values) => values,
Err(e) => {
log::error!("Feature buffer overflow: {} features received",
self.config.feature_names.len());
return Err((ErrorFlags::FEATURE_OVERFLOW, -1));
}
};
Log the anomaly. Serve cached features. Keep going. Don't crash.
The Circular Dependency That Made Everything Worse
When the outage hit, Cloudflare engineers couldn't log into their own dashboard to respond. Why?
Turnstile — Cloudflare's CAPTCHA system — was down. And Turnstile powered authentication on the Cloudflare dashboard. A failing service was blocking access to the tool needed to fix it.
Cloudflare's CEO Matthew Prince wrote the post-mortem himself in Lisbon that evening (over sushi and burritos, per his own account), and published it within 24 hours. The circular dependency problem was called out explicitly. It consumed critical response time during a period where every minute cost the internet hundreds of millions of dollars.
The estimated revenue loss across affected services: $180–360 million.
Three Weeks Later, They Did It Again
On December 5, 2025 — less than three weeks after the first outage — Cloudflare had another one.
The root cause was structurally identical: a configuration change deployed globally via Quicksilver with no staged rollout. This time it was a Lua runtime error in the FL1 proxy triggered by a killswitch that left a rule_result.execute object as nil — a code path that had never been exercised before in that exact combination.
Duration: 25 minutes. Impact: ~28% of all Cloudflare HTTP traffic.
Cloudflare's CTO Dane Knecht: "These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours."
The November post-mortem had already identified the staged rollout gap. The December outage exploited the exact same gap before the fix was implemented.
Code Orange: What Cloudflare Is Actually Doing About It
After two major outages in six weeks, Cloudflare declared "Code Orange" — their internal designation for highest-priority engineering work, superseding all other company priorities.
The initiative: Fail Small.
Three concrete workstreams:
1. Staged configuration rollouts (Health Mediated Deployment) Currently, Quicksilver pushes config changes to 90%+ of servers globally in seconds. Under HMD, every config change — DNS records, security rules, feature files — will go through the same staged deployment process used for software releases: employees first → free tier → paid customers → global. Automatic rollback if health checks fail. Full implementation target: end of Q1 2026.
2. Failure mode audit ("Fail Open") Every critical service interface must be audited under the assumption that it will fail. The new requirement: systems log the error and fall back to a known-good state rather than crash or drop traffic. The November incident's corrupted config file should have defaulted to the last validated configuration. Failed bot detection should have allowed traffic through, not blocked it.
3. Emergency access overhaul The Turnstile circular dependency that slowed incident response will be eliminated. Break-glass procedures for internal tools will be rebuilt from scratch. Security protocols that block engineers from responding to outages they're trying to fix are themselves a reliability failure.
The CrowdStrike Pattern, Repeating
Engineers on Hacker News were quick to draw the parallel. The 2024 CrowdStrike incident — which bricked 8.5 million Windows machines globally — shared the same architectural fingerprint: a machine-generated data file distributed globally without staged rollout or automatic rollback, where the validation step that would have caught the problem was bypassed or insufficient.
The difference: CrowdStrike's bad update required manual human intervention to recover (physically rebooting machines in safe mode). Cloudflare's was self-healing once the rollback propagated.
The meta-lesson that keeps getting relearned: the most catastrophic infrastructure failures aren't caused by novel zero-days or sophisticated attacks. They're caused by routine configuration changes pushed globally without guardrails.
HN commenter abalone, an incident response veteran: "Reliability isn't just avoiding bugs — it's visibility and rollback capability."
What Every Engineering Team Should Take From This
Cloudflare runs at a scale most companies will never approach. But the failure modes are completely reproducible at any scale:
1. Global config deploys are zero-day attacks you launch on yourself. If your deployment pipeline for configuration changes doesn't have staged rollout, you're one bad file away from a full outage. This applies to feature flags, WAF rules, DNS changes, and anything else touching production systems simultaneously.
2. .unwrap() and equivalent "crash on any error" patterns need production review.
In hot paths — code that processes every request — panicking should be a conscious, documented decision, not a default. Linting rules like unwrap_used = "deny" in Rust are worth considering for infrastructure-critical code.
3. Circular dependencies in your incident response tooling are a reliability failure. If your monitoring, alerting, or access tools depend on the system they're monitoring, map those dependencies before they matter.
4. The post-mortem-to-repeat gap is your biggest risk. Cloudflare identified the staged rollout gap in November and got burned by the same gap in December. The window between "we know what's wrong" and "we've fixed it" is when you're most vulnerable.
The Larger Picture
Cloudflare proxies roughly 20% of all internet traffic. It's infrastructure so foundational that when it breaks, Downdetector — the site people use to report outages — breaks with it.
That level of concentration is worth thinking about independently of any single incident. When your content, your platform, your SaaS product, and your monitoring tools all run through the same provider, your reliability is only as good as their worst day.
Diversifying infrastructure dependencies is expensive and operationally complex. But so is having your site return 5xx errors to every visitor, your analytics go dark, and your incident response tooling stop working — simultaneously — because of a SQL query that forgot a WHERE clause.
If reliability engineering, incident response design, or infrastructure resilience is something your team is actively working on, we'd be glad to talk. These problems aren't just Cloudflare's — they show up at every scale.
Sources: Cloudflare Nov 18 post-mortem · Cloudflare Dec 5 post-mortem · Code Orange: Fail Small · Gremlin analysis · Pragmatic Engineer coverage · InfoQ Code Orange · HN .unwrap() thread