Table of contents

Fibery 2.5 Hours Downtime Post-mortem. the Drama.

April 07, 2022 / 4 min read

Show Table of Contents

🦐 Lessons learned
🎰 Action items
👨‍🔧 I. feelings

Characters: I. (Developer), S. (Developer), A. (Developer), AWS clusters (Edge and Prod). Other minor characters.

👨‍🔧 I. slept peacefully when the curious sunbeam poked his semi-opened eye. I. closed the eye and recalled that he forgot to close the curtains yesterday. “Not good…” — thought I. and woke up.

It was a relatively warm and sunny day in Warsaw. I. decided to set up a new cluster for the new client. We had a cluster in Europe, but this new cluster should serve US customers. Clusters setup are never fun, so I. invited 🏃‍♂️ S. to share the boredom of configurations, weird problems, and rare joy of unexpected solutions.

Nothing was smooth. Somehow the steps for the EU cluster did not lead to the expected results in the US cluster. They explored working configs for EU and tried to find the roots of the problems. No luck.

6:33 pm (Warsaw time). “We have to start from scratch” — said I. — “I’ll delete the cluster and we will repeat the procedure again”. “Well, OK, I’ll grab some tea” — replied S. “Maybe I’ll have to take a break?” — thought I. and looked through the window. The sun broke through the clouds and poke I. into the eye again. “Gosh.” I. navigated to Stacks, selected the stack, and clicked “Delete stack”.

6:33:31 pm Monitoring services turned red. Slack channels filled with alarms.

6:34:21 pm A. 🧔‍♂️ writes to I.: “We have a problem. Down rush. Is it you?”

6:34:37 pm I.: “No”.

6:34:45 pm A.: “Then we have a serious alarm”.

6:34:51 pm I.: “Time to panic. We’ve accidentally deleted the whole production cluster” 💀

6:36:04 pm I. asked S. for help and they huddle together to discuss options. The first option was to set up the new cluster, but US setup was not smooth and it might take several hours with unclear results. We also had another running cluster in EU (we call it “Edge”) for testing and frequent internal releases. We decided to switch all accounts to this cluster and add more resources to it quickly. I. and S. got to work.

6:40 pm Customers started to ping us via Intercom, email and Community. We gave 2 hours till full recovery estimate.

… [Some technical details you don’t want to know: switching services to the new cluster, switching main PostgreSQL DB to the new cluster, recovering external secrets, moving Redis and MongoDB].

9:06 pm. All services were switched to the Edge cluster, accounts recovered and became operational. Fibery was up. 😅

2:07 am. We received notifications from a few customers that their accounts moved to read-only mode. It appeared that the licenses service relied on our own Fibery instance (we use Fibery as a CRM), but since we replaced Edge with Prod, our instance was down.

3:00 am. We fixed this issue, but license service deployment was not possible (human factor 🛌).

8:05 am. The license service problem was fixed when the human factor woke up 👻.

X:xx am. We have to recover our own account and we need to set up a new cluster for that. Hopefully, it will not cause any problems to the current Edge cluster (now Prod).

🦐 Lessons learned

🎰 Action items

Never delete a production cluster.
Never delete a production cluster.
Add protection against cluster deletion.
Improve disaster recovery procedure.

👨‍🔧 I. feelings

I feel terrible. It’s like you’ve dropped a baby. What wisdom have I gained? Be more careful? But you always test something and delete stuff, I do that constantly… We should have enabled stack deletion protection, true. Sadness and despair…

So no one told you life was gonna be this way
Your job’s a joke, you’re broke
Your love life’s DOA
It’s like you’re always stuck in second gear
When it hasn’t been your day, your week, your month
Or even your year, but
I’ll be there for you

Psst... Wanna try Fibery? 👀

Infinitely flexible product discovery & development platform.

Start your free trial

Rated 4.8 on G2 and Capterra

More gems from our Radically Honest Blog:

Startup Diary

Fibery in 2024. Year in Review

Tl;dr: 280 new paid customers (and 530 total) ❤️. 450 new features and 1400 fixed bugs 💪. 51 new releases 🌶️. Some failures (AI, Product teams strategy) 😱. Some wins (back to roots, your company's operating system focus) 🎢. Dopamine and cortisol 😍😨.

Startup Diary

#54 Fibery.io in the First Half of 2024

Eight months have passed since the last Fibery digest, and it's time to reflect on the past and share future plans. From narrowing down core use cases to re-positioning Fibery as a no-code product discovery & development platform

Fibery 2.0 — product discovery & development platform that answers your questions

Product Updates

Fibery 2.0 — Product Discovery & Development Platform That Answers Your Questions

If Fibery wasn't your top choice as a PM, it definitely will be from now on. Analyze user feedback & market signals, identify top insights, and unite discovery with development: this is Fibery 2.0