LogoDeploy
Infrastructure Intelligence
Reserve My Seat
Weekly Dispatch · Infrastructure Intelligence

"Every outage has a thesis. Every migration has a subplot. We read the infrastructure so you don't have to reverse-engineer the postmortem."

— The Deploy Editorial Desk

Published every Tuesday. Read by 4,200+ senior engineers, platform leads, and CTOs who build the systems that keep everything else running.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

Platform EngineeringKubernetes InternalsIncident PostmortemsBuild vs BuyeBPF ObservabilityGitOps MigrationsDORA MetricsService Mesh RealityCost AttributionToil ReductionPlatform EngineeringKubernetes InternalsIncident PostmortemsBuild vs BuyeBPF ObservabilityGitOps MigrationsDORA MetricsService Mesh RealityCost AttributionToil Reduction
01 · This Week's Signal
Issue #47
Server racks in a modern data center with blue lighting and dense cable infrastructure

The migration that sparked three re-orgs and one public postmortem — we traced the architectural decisions back to a single Confluence page from 2021.

Lead Signal · Feb 18, 2026

Why Monzo's Platform Team Chose to Rebuild Their Internal Developer Portal Instead of Buying Backstage

The build-versus-buy decision rarely lives in a spreadsheet. It lives in a Slack thread at 11 PM where someone finally says what everyone's been thinking. We reconstructed the 14-month deliberation using engineering blog posts, job postings, and three engineers who'd since moved on.

"Backstage is a framework, not a product. The moment you customize it, you own it — and nobody told us that in the sales call."
— Senior Platform Engineer, Monzo (via internal retro doc)
14 modeliberation period
team size at peak
62%adoption in 90 days

Full breakdown in Issue #47 — including the Confluence page that started it all.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

02 · Toolchain Radar
Migration Map

Observed across 12 enterprise migrations, Q4 2025 – Q1 2026

The CI/CD Collapse: Jenkins → GitHub Actions → Dagger in 18 months

JENKINS2018–2023groovy hellplugin rotMIGRATION WAVE 1GH ACTIONS2023–2025vendor lock-insurfaces hereWAVE 2 (EMERGING)DAGGER2025→portable pipelinesn=12 orgs · 200–5,000 eng headcount · observed Q4 2025–Q1 2026
Ed. note

The pattern is consistent: teams migrate to GH Actions for the YAML familiarity, hit the 6-minute job limit on monorepos, and start asking about Dagger six months later. Nobody's publishing this sequence publicly — we mapped it from job postings and conference talk abstracts.

Radar Readings

DaggerWatch

SDK maturity is the blocker. Go SDK is production-ready; Python is not.

signal strength
EarthlyHold

Team consolidation post-Series B. Feature velocity has slowed noticeably.

signal strength
BuildkiteAdopt

Hybrid cloud architecture finally makes sense for regulated industries.

signal strength
TektonAvoid

Kubernetes-native but the operational overhead is rarely justified at scale.

signal strength

Full radar with 14 tools, sourced methodology, and dissent notes — in the next dispatch.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

03 · Architecture Patterns
Recurring Structures

Patterns that keep appearing in incident reviews, architecture docs, and engineering blog posts — catalogued before they become conference keynotes.

These aren't theoretical. Each pattern was extracted from real production systems — traced through public postmortems, engineering job postings, and the occasional leaked architecture diagram. We annotate what the original authors left implicit.

Abstract network topology visualization with interconnected nodes and data flow patterns
Pattern Afor blast-radius reduction

Cell-Based Architecture

First documented at Amazon in 2011. Now appearing in post-mortems at orgs with 200+ services. The blast radius is the design constraint, not the afterthought.

94%of major outages are cross-cell in traditional designs

We traced 8 public postmortems. Cell-based orgs had 3× faster incident isolation.

Complex server infrastructure with layered rack systems and dense networking equipment
Pattern Bthe golden path problem

Platform Abstraction Layers

Every platform team eventually builds an abstraction layer. The question is whether it becomes a golden path or a maintenance tar pit. The difference is usually one architectural decision made in year one.

67%of platform teams rebuild their abstraction layer within 3 years

The rebuild trigger is almost always the same: the abstraction leaked at the wrong moment.

Six more patterns in the archive — event-driven failure modes, sidecar proliferation, and the control plane antipattern.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

04 · Incident Dissections
Postmortem Analysis

We read every public postmortem so you don't have to read the vendor-sanitized version. Here's what the original documents actually say.

CloudflareNov 2024
Duration37 min
SurfaceGlobal DNS resolution
BGPDNSRPKIOn-call process

The BGP Route Leak That Wasn't

A misconfigured RPKI validator created a split-brain condition between two route reflectors. The monitoring system saw divergence but classified it as transient. It wasn't.

"The alert fired at T+4 minutes. The on-call engineer acknowledged at T+6. The first meaningful diagnostic action happened at T+22. We need to talk about the 16 minutes in between."
Cloudflare Engineering Blog, post-incident review
Deeper Signal

Alert fatigue created a 16-minute gap between acknowledgment and action. This pattern appears in 8 of the last 12 major CDN incidents we've reviewed.

LinearJan 2025
Duration22 min
SurfaceIssue creation, sync engine
Connection poolingCapacity planningPostgres

Database Connection Pool Exhaustion at 3× Expected Load

A viral Hacker News thread drove 3× normal signup traffic. The connection pool was sized for peak organic load, not viral load. The gap between those two numbers is a thesis statement about growth assumptions.

"We had headroom for a good day. We didn't have headroom for the day someone posts you on HN and you become the top comment."
Linear engineering retrospective
Deeper Signal

Viral load events follow a different distribution than organic peaks. Sizing for P99 organic traffic leaves you exposed to social amplification events.

Every issue includes one dissected incident. The kind of analysis that usually lives in a private Notion doc.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

05 · The Data
Q1 2026 Sample

From 340 incident reports, Q4 2025 – Q1 2026

Where production incidents actually originate — vs. where teams look first.

73%

of incidents traced to configuration drift — not code changes, not dependency failures. A config file changed by a human, not a deploy pipeline.

4.2×

longer MTTR when config drift is the root cause vs code regression

18 min

median time before teams pivot from code investigation to config

We pulled this from 340 incident retrospectives published between Oct 2025 and Jan 2026. The config drift signal has been consistent for three quarters. Nobody is writing about it at scale.

Incident Root Cause Distribution — Q1 2026

Configuration drift73%

RPKI, feature flags, env vars, infra-as-code drift

Dependency failure (external)48%

Third-party APIs, managed services, DNS

Code regression31%

Application bugs, memory leaks, race conditions

Capacity / scaling failure22%

Autoscaler misconfiguration, connection pool exhaustion

Human error (non-config)14%

Runbook deviations, manual interventions

n=340 · incidents from public postmortems and engineering blogs · percentages exceed 100% due to multi-cause classification

MTTR by Root Cause Type
Config drift
94 min↑ 12%
Code regression
22 min↓ 8%
Capacity failure
38 min→ flat
External dependency
61 min↑ 3%

Every issue includes one data brief. The methodology is always cited. The raw numbers are always linked.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

The infrastructure is always more interesting than the press release.

One dispatch, every Tuesday. No sponsored content. No conference-circuit rehash. Just the infrastructure signal that matters to engineers who build the systems that don't get to fail.

I manage infrastructure for 50+ services

4,247 engineers already on the waitlist

"The toolchain radar alone is worth the subscription. Saved me from a very expensive Earthly migration."

Priya KrishnamurthyPrincipal SRE, Stripe

"I forward the incident dissections to my entire on-call rotation. Better than any postmortem training I've seen."

Marcus OyelaranDirector of Platform Engineering, Shopify

"The data briefs are cited in our quarterly architecture reviews. Actual numbers, actual methodology."

Annika LindströmCTO, Northstack