Weekly Dispatch · Infrastructure Intelligence

"Every outage has a thesis. Every migration has a subplot. We read the infrastructure so you don't have to reverse-engineer the postmortem."

— The Deploy Editorial Desk

Published every Tuesday. Read by 4,200+ senior engineers, platform leads, and CTOs who build the systems that keep everything else running.

Platform Engineering◆Kubernetes Internals◆Incident Postmortems◆Build vs Buy◆eBPF Observability◆GitOps Migrations◆DORA Metrics◆Service Mesh Reality◆Cost Attribution◆Toil Reduction◆Platform Engineering◆Kubernetes Internals◆Incident Postmortems◆Build vs Buy◆eBPF Observability◆GitOps Migrations◆DORA Metrics◆Service Mesh Reality◆Cost Attribution◆Toil Reduction◆

01 · This Week's Signal

Issue #47

Server racks in a modern data center with blue lighting and dense cable infrastructure

The migration that sparked three re-orgs and one public postmortem — we traced the architectural decisions back to a single Confluence page from 2021.

Lead Signal · Feb 18, 2026

Why Monzo's Platform Team Chose to Rebuild Their Internal Developer Portal Instead of Buying Backstage

The build-versus-buy decision rarely lives in a spreadsheet. It lives in a Slack thread at 11 PM where someone finally says what everyone's been thinking. We reconstructed the 14-month deliberation using engineering blog posts, job postings, and three engineers who'd since moved on.

"Backstage is a framework, not a product. The moment you customize it, you own it — and nobody told us that in the sales call."
— Senior Platform Engineer, Monzo (via internal retro doc)

14 modeliberation period

3×team size at peak

62%adoption in 90 days

Full breakdown in Issue #47 — including the Confluence page that started it all.

02 · Toolchain Radar

Migration Map

Observed across 12 enterprise migrations, Q4 2025 – Q1 2026

The CI/CD Collapse: Jenkins → GitHub Actions → Dagger in 18 months

Ed. note

The pattern is consistent: teams migrate to GH Actions for the YAML familiarity, hit the 6-minute job limit on monorepos, and start asking about Dagger six months later. Nobody's publishing this sequence publicly — we mapped it from job postings and conference talk abstracts.

Radar Readings

DaggerWatch

SDK maturity is the blocker. Go SDK is production-ready; Python is not.

signal strength

EarthlyHold

Team consolidation post-Series B. Feature velocity has slowed noticeably.

signal strength

BuildkiteAdopt

Hybrid cloud architecture finally makes sense for regulated industries.

signal strength

TektonAvoid

Kubernetes-native but the operational overhead is rarely justified at scale.

signal strength

Full radar with 14 tools, sourced methodology, and dissent notes — in the next dispatch.

03 · Architecture Patterns

Recurring Structures

Patterns that keep appearing in incident reviews, architecture docs, and engineering blog posts — catalogued before they become conference keynotes.

These aren't theoretical. Each pattern was extracted from real production systems — traced through public postmortems, engineering job postings, and the occasional leaked architecture diagram. We annotate what the original authors left implicit.

Abstract network topology visualization with interconnected nodes and data flow patterns

Pattern Afor blast-radius reduction

Cell-Based Architecture

First documented at Amazon in 2011. Now appearing in post-mortems at orgs with 200+ services. The blast radius is the design constraint, not the afterthought.

94%of major outages are cross-cell in traditional designs

We traced 8 public postmortems. Cell-based orgs had 3× faster incident isolation.

Complex server infrastructure with layered rack systems and dense networking equipment

Pattern Bthe golden path problem

Platform Abstraction Layers

Every platform team eventually builds an abstraction layer. The question is whether it becomes a golden path or a maintenance tar pit. The difference is usually one architectural decision made in year one.

67%of platform teams rebuild their abstraction layer within 3 years

The rebuild trigger is almost always the same: the abstraction leaked at the wrong moment.

Six more patterns in the archive — event-driven failure modes, sidecar proliferation, and the control plane antipattern.

04 · Incident Dissections

Postmortem Analysis

We read every public postmortem so you don't have to read the vendor-sanitized version. Here's what the original documents actually say.

CloudflareNov 2024

Duration37 min

SurfaceGlobal DNS resolution

BGPDNSRPKIOn-call process

The BGP Route Leak That Wasn't

A misconfigured RPKI validator created a split-brain condition between two route reflectors. The monitoring system saw divergence but classified it as transient. It wasn't.

"The alert fired at T+4 minutes. The on-call engineer acknowledged at T+6. The first meaningful diagnostic action happened at T+22. We need to talk about the 16 minutes in between."
— Cloudflare Engineering Blog, post-incident review

Deeper Signal

Alert fatigue created a 16-minute gap between acknowledgment and action. This pattern appears in 8 of the last 12 major CDN incidents we've reviewed.

LinearJan 2025

Duration22 min

SurfaceIssue creation, sync engine

Connection poolingCapacity planningPostgres

Database Connection Pool Exhaustion at 3× Expected Load

A viral Hacker News thread drove 3× normal signup traffic. The connection pool was sized for peak organic load, not viral load. The gap between those two numbers is a thesis statement about growth assumptions.

"We had headroom for a good day. We didn't have headroom for the day someone posts you on HN and you become the top comment."
— Linear engineering retrospective

Deeper Signal

Viral load events follow a different distribution than organic peaks. Sizing for P99 organic traffic leaves you exposed to social amplification events.

Every issue includes one dissected incident. The kind of analysis that usually lives in a private Notion doc.

05 · The Data

Q1 2026 Sample

From 340 incident reports, Q4 2025 – Q1 2026

Where production incidents actually originate — vs. where teams look first.

73%

of incidents traced to configuration drift — not code changes, not dependency failures. A config file changed by a human, not a deploy pipeline.

4.2×

longer MTTR when config drift is the root cause vs code regression

18 min

median time before teams pivot from code investigation to config

We pulled this from 340 incident retrospectives published between Oct 2025 and Jan 2026. The config drift signal has been consistent for three quarters. Nobody is writing about it at scale.

Incident Root Cause Distribution — Q1 2026

Configuration drift73%

RPKI, feature flags, env vars, infra-as-code drift

Dependency failure (external)48%

Third-party APIs, managed services, DNS

Code regression31%

Application bugs, memory leaks, race conditions

Capacity / scaling failure22%

Autoscaler misconfiguration, connection pool exhaustion

Human error (non-config)14%

Runbook deviations, manual interventions

n=340 · incidents from public postmortems and engineering blogs · percentages exceed 100% due to multi-cause classification

MTTR by Root Cause Type

Config drift

94 min↑ 12%

Code regression

22 min↓ 8%

Capacity failure

38 min→ flat

External dependency

61 min↑ 3%

Every issue includes one data brief. The methodology is always cited. The raw numbers are always linked.

The infrastructure is always more interesting than the press release.

One dispatch, every Tuesday. No sponsored content. No conference-circuit rehash. Just the infrastructure signal that matters to engineers who build the systems that don't get to fail.

"The toolchain radar alone is worth the subscription. Saved me from a very expensive Earthly migration."

Priya KrishnamurthyPrincipal SRE, Stripe

"I forward the incident dissections to my entire on-call rotation. Better than any postmortem training I've seen."

Marcus OyelaranDirector of Platform Engineering, Shopify

"The data briefs are cited in our quarterly architecture reviews. Actual numbers, actual methodology."

Annika LindströmCTO, Northstack