Insights
Writing on resilience, organizations, and AI.
Essays and newsletter issues on why engineering organizations keep having the same incidents — and what the feedback loops, organizational patterns, and tensions actually look like in practice.
All issues
20
What 1,000 Executives Know But Can't Fix CockroachDB surveyed 1,000 senior technology executives and found 86 outages per year, 196-minute average resolution times, and a remarkably flat distribution of failure causes that tells a deeper story than the repor…
19 When Architecture Becomes Fluid AWS shipped agent plugins that architect, secure, and deploy serverless applications. When agents can rearchitect systems on the fly to maintain function, architecture becomes fluid -- a runtime variable, not a design…
18 We Mistake "Hasn't Failed Yet" for "Won't Fail" Multi-AZ, cloud neutrality, geopolitical stability. We treated them as physics. A look at why organizations stop questioning the foundations that hold them up.
17 AI doesn't solve your problems. It moves them somewhere you can't see yet. There's a seductive story about AI in operations: deploy it, metrics improve, problems solved. But improved metrics and solved problems are not the same thing. David Woods' Messy 9 framework explains where the problem…
16 Why We Still Suck at Resilience and Why I Wrote a Book About It I wrote a book about why organizations confuse performing resilience with actually being resilient. Three days later, I'm already questioning part of what I wrote.
15 The Prevention Paradox at Civilizational Scale Effective prevention creates doubt about its necessity. The pattern that hollows out engineering resilience is the same one that just broke the world order.
14 Why Your Chaos Experiments Give You False Confidence Your chaos experiment worked perfectly. Database failed over, circuit breaker tripped, traffic rerouted, recovery completed in 30 seconds. Three months later, the same scenario in production triggered a 23-minute deat…
13 What to do after the hypothesis conversation Most teams make the same mistake after discovering gaps in their system understanding: they either panic and try to fix everything, or they run experiments without investigating first. Here's how to decide what to inv…
12 Your best chaos engineering happens before you break anything Most chaos engineering starts with breaking things. Start here instead: the 45-minute conversation that reveals more than most experiments ever will.
11 When AI Writes Your Code, Chaos Engineering Writes Your Insurance Policy AI generates code faster than we can understand it. Chaos engineering reveals hidden failures, documents risks, and creates feedback loops to improve both code generation and operations.
10 Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces Why do organizations with all the right resilience practices still fail during crises? The answer lies in understanding the difference between controls and guardrails. Controls create friction during normal operations…
09 Why MTTR is a Misleading Metric (And What to Track Instead) Many engineering teams watch MTTR dashboards that tell misleading stories about their incident response. Here's the mathematical proof of why MTTR fails and practical alternatives your team can implement immediately -…
08 The Prevention Paradox: Why Successful Resilience Work Becomes Its Own Enemy The Prevention Paradox describes a destructive cycle where successful resilience work makes itself appear unnecessary, leading organizations to systematically disinvest in the very capabilities that prevent disasters.…
07 The Quiet Erosion: How Organizations Drift Into Failure Learn how small, reasonable decisions gradually push organizations toward failure. A detailed case study of TrendCart's drift from safety to crisis and recovery.
06 Beyond Root Cause: A Better Approach to Understanding Complex System Failures Discover why traditional root cause analysis and 5 Whys frameworks fall short in complex systems. Learn practical alternatives and the 'Trojan Horse' approach to implement meaningful change in your organization's inci…
05 Beyond Traditional Resilience Resilium Labs offers a paradigm shift in resilience engineering, moving beyond rigid frameworks to embrace complexity, champion uncertainty, prioritize recovery, and implement elegant simplicity. This approach transfo…
04 Transform Disruption into Competitive Advantage Let's be honest; disruption is the norm, not the exception. Headlines regularly feature outages affecting banks, e-commerce platforms, entertainment providers, and airlines. Failure has become an everyday reality.But…
03 Gamechangers in Resilience - Interview with Iluminr Adrian shares key insights: resilience comes from controlled stress exposure, like Finland's sauna-to-ice tradition. Architecture reviews often miss component interactions and degradation patterns. Removing complexity…
02 What is Resilience Engineering? Resilience Engineering goes beyond traditional reliability by focusing not just on preventing failures, but on successfully adapting to them when they occur. With applications across software development, healthcare,…
01 On failure Thoughts are like drops of water: with our thoughts we can drown in a sea of negativity, or we can float on the ocean of life. - Louise Hay