Radically Honest Blog

Fibery 2.5 Hours Downtime Post-mortem. the Drama.

Characters: I. (Developer), S. (Developer), A. (Developer), AWS clusters (Edge and Prod). Other minor characters.

👨‍🔧 I. slept peacefully when the curious sunbeam poked his semi-opened eye. I. closed the eye and recalled that he forgot to close the curtains yesterday. “Not good…” — thought I. and woke up.

It was a relatively warm and sunny day in Warsaw. I. decided to set up a new cluster for the new client. We had a cluster in Europe, but this new cluster should serve US customers. Clusters setup are never fun, so I. invited 🏃‍♂️ S. to share the boredom of configurations, weird problems, and rare joy of unexpected solutions.

Nothing was smooth. Somehow the steps for the EU cluster did not lead to the expected results in the US cluster. They explored working configs for EU and tried to find the roots of the problems. No luck.

6:33 pm (Warsaw time). “We have to start from scratch” — said I. — “I’ll delete the cluster and we will repeat the procedure again”. “Well, OK, I’ll grab some tea” — replied S. “Maybe I’ll have to take a break?” — thought I. and looked through the window. The sun broke through the clouds and poke I. into the eye again. “Gosh.” I. navigated to Stacks, selected the stack, and clicked “Delete stack”.

6:33:31 pm Monitoring services turned red. Slack channels filled with alarms.

6:34:21 pm A. 🧔‍♂️ writes to I.: “We have a problem. Down rush. Is it you?”

6:34:37 pm I.: “No”.

6:34:45 pm A.: “Then we have a serious alarm”.

6:34:51 pm I.: “Time to panic. We’ve accidentally deleted the whole production cluster” 💀

6:36:04 pm I. asked S. for help and they huddle together to discuss options. The first option was to set up the new cluster, but US setup was not smooth and it might take several hours with unclear results. We also had another running cluster in EU (we call it “Edge”) for testing and frequent internal releases. We decided to switch all accounts to this cluster and add more resources to it quickly. I. and S. got to work.

6:40 pm Customers started to ping us via Intercom, email and Community. We gave 2 hours till full recovery estimate.

[Some technical details you don’t want to know: switching services to the new cluster, switching main PostgreSQL DB to the new cluster, recovering external secrets, moving Redis and MongoDB].

9:06 pm. All services were switched to the Edge cluster, accounts recovered and became operational. Fibery was up. 😅

2:07 am. We received notifications from a few customers that their accounts moved to read-only mode. It appeared that the licenses service relied on our own Fibery instance (we use Fibery as a CRM), but since we replaced Edge with Prod, our instance was down.

3:00 am. We fixed this issue, but license service deployment was not possible (human factor 🛌).

8:05 am. The license service problem was fixed when the human factor woke up 👻.

X:xx am. We have to recover our own account and we need to set up a new cluster for that. Hopefully, it will not cause any problems to the current Edge cluster (now Prod).

🦐 Lessons learned

🎰 Action items

  • Never delete a production cluster.
  • Never delete a production cluster.
  • Add protection against cluster deletion.
  • Improve disaster recovery procedure.

👨‍🔧 I. feelings

I feel terrible. It’s like you’ve dropped a baby. What wisdom have I gained? Be more careful? But you always test something and delete stuff, I do that constantly… We should have enabled stack deletion protection, true. Sadness and despair…

So no one told you life was gonna be this way
Your job’s a joke, you’re broke
Your love life’s DOA
It’s like you’re always stuck in second gear
When it hasn’t been your day, your week, your month
Or even your year, but
I’ll be there for you
cry

Psst... Wanna try Fibery? 👀

Infinitely flexible work & knowledge hub.