20
What 1,000 Executives Know But Can't Fix CockroachDB surveyed 1,000 senior technology executives and found 86 outages per year, 196-minute average resolution times, and a remarkably flat distribution of failure causes that tells a deeper story than the repor…
Mar 2026
19
When Architecture Becomes Fluid AWS shipped agent plugins that architect, secure, and deploy serverless applications. When agents can rearchitect systems on the fly to maintain function, architecture becomes fluid -- a runtime variable, not a design…
Mar 2026
18
We Mistake "Hasn't Failed Yet" for "Won't Fail" Multi-AZ, cloud neutrality, geopolitical stability. We treated them as physics. A look at why organizations stop questioning the foundations that hold them up.
Mar 2026
17
AI doesn't solve your problems. It moves them somewhere you can't see yet. There's a seductive story about AI in operations: deploy it, metrics improve, problems solved. But improved metrics and solved problems are not the same thing. David Woods' Messy 9 framework explains where the problem…
Mar 2026
16
Why We Still Suck at Resilience and Why I Wrote a Book About It I wrote a book about why organizations confuse performing resilience with actually being resilient. Three days later, I'm already questioning part of what I wrote.
Feb 2026
15
The Prevention Paradox at Civilizational Scale Effective prevention creates doubt about its necessity. The pattern that hollows out engineering resilience is the same one that just broke the world order.
Feb 2026
14
Why Your Chaos Experiments Give You False Confidence Your chaos experiment worked perfectly. Database failed over, circuit breaker tripped, traffic rerouted, recovery completed in 30 seconds. Three months later, the same scenario in production triggered a 23-minute deat…
Jan 2026
13
What to do after the hypothesis conversation Most teams make the same mistake after discovering gaps in their system understanding: they either panic and try to fix everything, or they run experiments without investigating first. Here's how to decide what to inv…
Dec 2025
12
Your best chaos engineering happens before you break anything Most chaos engineering starts with breaking things. Start here instead: the 45-minute conversation that reveals more than most experiments ever will.
Nov 2025
11
When AI Writes Your Code, Chaos Engineering Writes Your Insurance Policy AI generates code faster than we can understand it. Chaos engineering reveals hidden failures, documents risks, and creates feedback loops to improve both code generation and operations.
Oct 2025
10
Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces Why do organizations with all the right resilience practices still fail during crises? The answer lies in understanding the difference between controls and guardrails. Controls create friction during normal operations…
Aug 2025
09
Why MTTR is a Misleading Metric (And What to Track Instead) Many engineering teams watch MTTR dashboards that tell misleading stories about their incident response. Here's the mathematical proof of why MTTR fails and practical alternatives your team can implement immediately -…
Jul 2025
08
The Prevention Paradox: Why Successful Resilience Work Becomes Its Own Enemy The Prevention Paradox describes a destructive cycle where successful resilience work makes itself appear unnecessary, leading organizations to systematically disinvest in the very capabilities that prevent disasters.…
Jun 2025
07
The Quiet Erosion: How Organizations Drift Into Failure Learn how small, reasonable decisions gradually push organizations toward failure. A detailed case study of TrendCart's drift from safety to crisis and recovery.
May 2025
06
Beyond Root Cause: A Better Approach to Understanding Complex System Failures Discover why traditional root cause analysis and 5 Whys frameworks fall short in complex systems. Learn practical alternatives and the 'Trojan Horse' approach to implement meaningful change in your organization's inci…
May 2025
05
Beyond Traditional Resilience Resilium Labs offers a paradigm shift in resilience engineering, moving beyond rigid frameworks to embrace complexity, champion uncertainty, prioritize recovery, and implement elegant simplicity. This approach transfo…
May 2025
04
Transform Disruption into Competitive Advantage Let's be honest; disruption is the norm, not the exception. Headlines regularly feature outages affecting banks, e-commerce platforms, entertainment providers, and airlines. Failure has become an everyday reality.But…
May 2025
03
Gamechangers in Resilience - Interview with Iluminr Adrian shares key insights: resilience comes from controlled stress exposure, like Finland's sauna-to-ice tradition. Architecture reviews often miss component interactions and degradation patterns. Removing complexity…
May 2025
02
What is Resilience Engineering? Resilience Engineering goes beyond traditional reliability by focusing not just on preventing failures, but on successfully adapting to them when they occur. With applications across software development, healthcare,…
May 2025
01
On failure Thoughts are like drops of water: with our thoughts we can drown in a sea of negativity, or we can float on the ocean of life. - Louise Hay
Oct 2024