#Tag · bonfire of thepocolips

The #CloudFlare outage follows a pattern we have seen before. Was it Google last time?

1. Generate an exciting config file
2. Auto-deploy the file everywhere
3. Everything everywhere crashes

All these big systems want to be able to react quickly to certain types of events, so they probably all have this failure mode baked in. Because security! Or some such.

"Obviously" the files deployed this way "should" be validated carefully. And there "should" be canaries and staged roll-outs... Should.

BjarniBjarniBjarni 🙊 🇮🇸 🍏

@HerraBRE@mastodon.xyz replied · 2 days ago

There's a fun " #devops is hard" lesson here ( #CloudFlare).

1. Because Security, you want to be able to deploy global changes very quickly
2. Because Reliability, you want staged roll-outs that pause or even auto-revert if key metrics get worse

You can't have both 1 and 2 at the same time.

And the temptation to go fast sometimes WILL prove irresistible.

So if you're looking for ways to globally cripple a big cloud, this is the pattern to look for: what is too urgent for staged roll-outs?

bonfire of thepocolips

come over, warm up. coffee?

bonfire of thepocolips: About · Code of conduct · Privacy ·

bonfire social · 1.0.0 no JS en

Automatic federation enabled