The #CloudFlare outage follows a pattern we have seen before. Was it Google last time?
1. Generate an exciting config file
2. Auto-deploy the file everywhere
3. Everything everywhere crashes
All these big systems want to be able to react quickly to certain types of events, so they probably all have this failure mode baked in. Because security! Or some such.
"Obviously" the files deployed this way "should" be validated carefully. And there "should" be canaries and staged roll-outs... Should.
There's a fun " #devops is hard" lesson here ( #CloudFlare).
1. Because Security, you want to be able to deploy global changes very quickly
2. Because Reliability, you want staged roll-outs that pause or even auto-revert if key metrics get worse
You can't have both 1 and 2 at the same time.
And the temptation to go fast sometimes WILL prove irresistible.
So if you're looking for ways to globally cripple a big cloud, this is the pattern to look for: what is too urgent for staged roll-outs?