Insights
Writing on resilience, organisations, and AI.
Essays and newsletter issues on why engineering organisations keep having the same incidents — and what the feedback loops, organisational patterns, and tensions actually look like in practice.
All issues
27
The Invoice That Arrives After the Incident Is Over Why learning always loses when the other side of the comparison is zero
26 The Severity Argument You Keep Having The argument no rubric will ever settle — and what would
25 When No One in the Room Has Carried the Pager Why I built the Resilience Companion — and why “bootstrap” is the right word for it.
24 The MTTR Argument You Keep Having A metric inherited from manufacturing, applied to systems that don't behave like production lines
23 The Interpretation Layer Why detection isn’t enough, and what the recent Lovable incident tells us about the often most neglected part of your organization.
22 When the pair programmer is confidently wrong Notes from migrating the Resilium Labs website and newsletter from Squarespace to Cloudflare, with Claude in the loop
21 When Guidance Becomes Compliance A short story about Iceland, Well-Architected Reviews, and what drift really tells you
20 The Resilience Myths List Things we keep telling ourselves about resilience that aren't true
19 When AI is a single point of failure you can’t audit And the eval frameworks won't save you
18 What 1,000 Executives Know But Can't Fix CockroachDB surveyed 1,000 senior technology executives and found 86 outages per year, 196-minute average resolution times, and a remarkably flat distribution of failure causes that tells a deeper story than the repor…
17 When Architecture Becomes Fluid AWS shipped agent plugins that architect, secure, and deploy serverless applications. When agents can rearchitect systems on the fly to maintain function, architecture becomes fluid -- a runtime variable, not a design…
16 We Mistake "Hasn't Failed Yet" for "Won't Fail" Multi-AZ, cloud neutrality, geopolitical stability. We treated them as physics. A look at why organizations stop questioning the foundations that hold them up.
15 AI doesn't solve your problems. It moves them somewhere you can't see yet. There's a seductive story about AI in operations: deploy it, metrics improve, problems solved. But improved metrics and solved problems are not the same thing. David Woods' Messy 9 framework explains where the problem…
14 Why We Still Suck at Resilience and Why I Wrote a Book About It I wrote a book about why organizations confuse performing resilience with actually being resilient. Three days later, I'm already questioning part of what I wrote.
13 The Prevention Paradox at Civilizational Scale Effective prevention creates doubt about its necessity. The pattern that hollows out engineering resilience is the same one that just broke the world order.
12 Why Your Chaos Experiments Give You False Confidence Your chaos experiment worked perfectly. Database failed over, circuit breaker tripped, traffic rerouted, recovery completed in 30 seconds. Three months later, the same scenario in production triggered a 23-minute deat…
11 What to do after the hypothesis conversation Most teams make the same mistake after discovering gaps in their system understanding: they either panic and try to fix everything, or they run experiments without investigating first. Here's how to decide what to inv…
10 Your best chaos engineering happens before you break anything Most chaos engineering starts with breaking things. Start here instead: the 45-minute conversation that reveals more than most experiments ever will.
09 When AI Writes Your Code, Chaos Engineering Writes Your Insurance Policy AI generates code faster than we can understand it. Chaos engineering reveals hidden failures, documents risks, and creates feedback loops to improve both code generation and operations.
08 Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces Why do organizations with all the right resilience practices still fail during crises? The answer lies in understanding the difference between controls and guardrails. Controls create friction during normal operations…
07 Why MTTR is a Misleading Metric (And What to Track Instead) Many engineering teams watch MTTR dashboards that tell misleading stories about their incident response. Here's the mathematical proof of why MTTR fails and practical alternatives your team can implement immediately -…
06 The Prevention Paradox: Why Successful Resilience Work Becomes Its Own Enemy The Prevention Paradox describes a destructive cycle where successful resilience work makes itself appear unnecessary, leading organizations to systematically disinvest in the very capabilities that prevent disasters.…
05 The Quiet Erosion: How Organizations Drift Into Failure Learn how small, reasonable decisions gradually push organizations toward failure. A detailed case study of TrendCart's drift from safety to crisis and recovery.
04 Beyond Root Cause: A Better Approach to Understanding Complex System Failures Discover why traditional root cause analysis and 5 Whys frameworks fall short in complex systems. Learn practical alternatives and the 'Trojan Horse' approach to implement meaningful change in your organization's inci…
03 Beyond Traditional Resilience Resilium Labs offers a paradigm shift in resilience engineering, moving beyond rigid frameworks to embrace complexity, champion uncertainty, prioritize recovery, and implement elegant simplicity. This approach transfo…
02 Transform Disruption into Competitive Advantage Let's be honest; disruption is the norm, not the exception. Headlines regularly feature outages affecting banks, e-commerce platforms, entertainment providers, and airlines. Failure has become an everyday reality.But…
01 What is Resilience Engineering? Resilience Engineering goes beyond traditional reliability by focusing not just on preventing failures, but on successfully adapting to them when they occur. With applications across software development, healthcare,…