"Every outage has a thesis. Every migration has a subplot. We read the infrastructure so you don't have to reverse-engineer the postmortem."
— The Deploy Editorial Desk
Published every Tuesday. Read by 4,200+ senior engineers, platform leads, and CTOs who build the systems that keep everything else running.
The migration that sparked three re-orgs and one public postmortem — we traced the architectural decisions back to a single Confluence page from 2021.
Why Monzo's Platform Team Chose to Rebuild Their Internal Developer Portal Instead of Buying Backstage
The build-versus-buy decision rarely lives in a spreadsheet. It lives in a Slack thread at 11 PM where someone finally says what everyone's been thinking. We reconstructed the 14-month deliberation using engineering blog posts, job postings, and three engineers who'd since moved on.
"Backstage is a framework, not a product. The moment you customize it, you own it — and nobody told us that in the sales call."
Full breakdown in Issue #47 — including the Confluence page that started it all.
Observed across 12 enterprise migrations, Q4 2025 – Q1 2026
The CI/CD Collapse: Jenkins → GitHub Actions → Dagger in 18 months
The pattern is consistent: teams migrate to GH Actions for the YAML familiarity, hit the 6-minute job limit on monorepos, and start asking about Dagger six months later. Nobody's publishing this sequence publicly — we mapped it from job postings and conference talk abstracts.
Radar Readings
SDK maturity is the blocker. Go SDK is production-ready; Python is not.
signal strengthTeam consolidation post-Series B. Feature velocity has slowed noticeably.
signal strengthHybrid cloud architecture finally makes sense for regulated industries.
signal strengthKubernetes-native but the operational overhead is rarely justified at scale.
signal strengthFull radar with 14 tools, sourced methodology, and dissent notes — in the next dispatch.
Patterns that keep appearing in incident reviews, architecture docs, and engineering blog posts — catalogued before they become conference keynotes.
These aren't theoretical. Each pattern was extracted from real production systems — traced through public postmortems, engineering job postings, and the occasional leaked architecture diagram. We annotate what the original authors left implicit.
Six more patterns in the archive — event-driven failure modes, sidecar proliferation, and the control plane antipattern.
We read every public postmortem so you don't have to read the vendor-sanitized version. Here's what the original documents actually say.
The BGP Route Leak That Wasn't
A misconfigured RPKI validator created a split-brain condition between two route reflectors. The monitoring system saw divergence but classified it as transient. It wasn't.
"The alert fired at T+4 minutes. The on-call engineer acknowledged at T+6. The first meaningful diagnostic action happened at T+22. We need to talk about the 16 minutes in between."
Alert fatigue created a 16-minute gap between acknowledgment and action. This pattern appears in 8 of the last 12 major CDN incidents we've reviewed.
Database Connection Pool Exhaustion at 3× Expected Load
A viral Hacker News thread drove 3× normal signup traffic. The connection pool was sized for peak organic load, not viral load. The gap between those two numbers is a thesis statement about growth assumptions.
"We had headroom for a good day. We didn't have headroom for the day someone posts you on HN and you become the top comment."
Viral load events follow a different distribution than organic peaks. Sizing for P99 organic traffic leaves you exposed to social amplification events.
Every issue includes one dissected incident. The kind of analysis that usually lives in a private Notion doc.
From 340 incident reports, Q4 2025 – Q1 2026
Where production incidents actually originate — vs. where teams look first.
of incidents traced to configuration drift — not code changes, not dependency failures. A config file changed by a human, not a deploy pipeline.
longer MTTR when config drift is the root cause vs code regression
median time before teams pivot from code investigation to config
We pulled this from 340 incident retrospectives published between Oct 2025 and Jan 2026. The config drift signal has been consistent for three quarters. Nobody is writing about it at scale.
Incident Root Cause Distribution — Q1 2026
RPKI, feature flags, env vars, infra-as-code drift
Third-party APIs, managed services, DNS
Application bugs, memory leaks, race conditions
Autoscaler misconfiguration, connection pool exhaustion
Runbook deviations, manual interventions
n=340 · incidents from public postmortems and engineering blogs · percentages exceed 100% due to multi-cause classification
Every issue includes one data brief. The methodology is always cited. The raw numbers are always linked.

