Table of contents

Fibery 2.5 Hours Downtime Post-mortem. the Drama.

April 07, 2022 / 4 min read

Show Table of Contents

🦐 Lessons learned
🎰 Action items
👨‍🔧 I. feelings

Characters: I. (Developer), S. (Developer), A. (Developer), AWS clusters (Edge and Prod). Other minor characters.

👨‍🔧 I. slept peacefully when the curious sunbeam poked his semi-opened eye. I. closed the eye and recalled that he forgot to close the curtains yesterday. “Not good…” — thought I. and woke up.

It was a relatively warm and sunny day in Warsaw. I. decided to set up a new cluster for the new client. We had a cluster in Europe, but this new cluster should serve US customers. Clusters setup are never fun, so I. invited 🏃‍♂️ S. to share the boredom of configurations, weird problems, and rare joy of unexpected solutions.

Nothing was smooth. Somehow the steps for the EU cluster did not lead to the expected results in the US cluster. They explored working configs for EU and tried to find the roots of the problems. No luck.

6:33 pm (Warsaw time). “We have to start from scratch” — said I. — “I’ll delete the cluster and we will repeat the procedure again”. “Well, OK, I’ll grab some tea” — replied S. “Maybe I’ll have to take a break?” — thought I. and looked through the window. The sun broke through the clouds and poke I. into the eye again. “Gosh.” I. navigated to Stacks, selected the stack, and clicked “Delete stack”.

6:33:31 pm Monitoring services turned red. Slack channels filled with alarms.

6:34:21 pm A. 🧔‍♂️ writes to I.: “We have a problem. Down rush. Is it you?”

6:34:37 pm I.: “No”.

6:34:45 pm A.: “Then we have a serious alarm”.

6:34:51 pm I.: “Time to panic. We’ve accidentally deleted the whole production cluster” 💀

6:36:04 pm I. asked S. for help and they huddle together to discuss options. The first option was to set up the new cluster, but US setup was not smooth and it might take several hours with unclear results. We also had another running cluster in EU (we call it “Edge”) for testing and frequent internal releases. We decided to switch all accounts to this cluster and add more resources to it quickly. I. and S. got to work.

6:40 pm Customers started to ping us via Intercom, email and Community. We gave 2 hours till full recovery estimate.

🦐 We have serious problems on the cluster. ETA is 2 hours from now. We hope to fix it faster. Data is safe.
— Fibery (@fibery_io) April 6, 2022

… [Some technical details you don’t want to know: switching services to the new cluster, switching main PostgreSQL DB to the new cluster, recovering external secrets, moving Redis and MongoDB].

9:06 pm. All services were switched to the Edge cluster, accounts recovered and became operational. Fibery was up. 😅

2:07 am. We received notifications from a few customers that their accounts moved to read-only mode. It appeared that the licenses service relied on our own Fibery instance (we use Fibery as a CRM), but since we replaced Edge with Prod, our instance was down.

3:00 am. We fixed this issue, but license service deployment was not possible (human factor 🛌).

8:05 am. The license service problem was fixed when the human factor woke up 👻.

X:xx am. We have to recover our own account and we need to set up a new cluster for that. Hopefully, it will not cause any problems to the current Edge cluster (now Prod).

🦐 Lessons learned

🎰 Action items

Never delete a production cluster.
Never delete a production cluster.
Add protection against cluster deletion.
Improve disaster recovery procedure.

👨‍🔧 I. feelings

I feel terrible. It’s like you’ve dropped a baby. What wisdom have I gained? Be more careful? But you always test something and delete stuff, I do that constantly… We should have enabled stack deletion protection, true. Sadness and despair…

So no one told you life was gonna be this way
Your job’s a joke, you’re broke
Your love life’s DOA
It’s like you’re always stuck in second gear
When it hasn’t been your day, your week, your month
Or even your year, but
I’ll be there for you

Psst... Wanna try Fibery? 👀

Infinitely flexible product discovery & development platform.

Start your free trial

Rated 4.8 on G2 and Capterra

More gems from our Radically Honest Blog:

We are hiring

A Product Manager Who Writes

We are looking for a seasoned product manager with a repulsive need to write (and share) their revolutionary, edgy, and snarky thoughts on any topic related to product management

Slow December 2023: Slow down, exhale, think deep, do things

Essays

Slow December 2023: Slow Down, Exhale, Think Deep, Do Things

Tune in for the results of the 2nd annual Slow December at Fibery. Contrary to popular belief, it was not just about chilling and procrastinating: our team experimented a lot and shipped features that you might already be using.

Startup Diary

Fibery.io 2023 Year in Review

TLDR: 160 new customers (and 340 total) ❤️. 300 new features and 1300 fixed bugs 💪. 45 new releases 🌶️. 8 new teammates 🇫🇮🇵🇱🇨🇾🇨🇭🇵🇱🇭🇺🇵🇱🇵🇱. First offline retreat ✈️. No status meetings ✏️. Inevitable AI 👾. Slow December 🐢. Dopamine and cortisol 😍😨.