How To Write an Incident Postmortem: My Chat Incident Case Study
An application of engineering on-call best practices in real life
If you have the Substack App installed and follow me, you might be aware of what I’m about to talk about. If not, what I’m referring to is the incident that impacted many of my subscribers on April 30th: an influx of new threads being created in the chat combined with substack notifications bugs resulted in subscribers being bombarded with push notifications.
I’d like to apologize again for the inconvenience this issue might have caused some of you. I read all the messages in the chat and email and I’m truly sorry for your experience.
While mitigating the issue, I thought:
Why not use this incident as a teachable moment and explain how I handled it? What better way to explain how to write a postmortem than by showing an example?
In this article we’re covering:
The backstory of the incident
The postmortem including action items
Awards for the funniest, saddest, and most entrepreneurial chat posts
What on-call principles I used during the incident
How I escalated with Substack and their response
So if you want to see on-call skills applied IRL, keep reading.
Thank you to this week’s post sponsor!
AIQCON is inviting you at the first AI Quality conference to discuss:
How to formulate the correct functional requirements for AI-driven products
How to assess risk for AI-driven products and choose the correct testing methodology
How to choose a pass/no-pass threshold given your line of business
Which engineering and machine learning modifications to make to improve our performance on those metrics
On June 25th, in San Francisco, join industry leaders, government representatives, and builders to create gold standards of AI Quality Metrics so that AI does what it is intended to do and doesn't risk being a liability for organizations.
Backstory
On Tuesday, April 30th, I was in San Diego for a few days visiting a good friend of mine. Around 11, we decided to take his convertible out for a ride on the coast to get coffee and food.
It was supposed to be a beautiful day, perfect for enjoying the beautiful coastal views of the Pacific. Not even 5 minutes into our ride, I got a DM that was going to change our plans.
Borges Notes was incredibly kind to go out of his way to DM me on a different platform to let me know about the chat issues. I immediately got on it.
My first reaction to finding out about the issue was to panic. Quickly, I realized this was no different than being on call. Getting paged to deal with a mysterious issue isn’t new to me, as I’ve been many years on call and had to deal with some really gnarly, sometimes out-of-my-control situations.
I debated returning home to get my laptop but quickly realized I didn’t have time for it.
Mitigation wasn’t straightforward, but I was able to get to a solution in less than 30 minutes while doing everything from my phone, in the car. I missed the scenic views, but at least I was able to make it on time for brunch 😅
To learn more about the exact steps I took, keep reading!
The Substack Chat Postmortem
Date: 4/30/2024
Authors: Me 😊
1. Incident Overview
Description:
New thread creations generated a storm of app push notifications, which impacted substack chat subscribers of “The Caring Techie Newsletter” who got bombarded with push notifications, while the newsletter author was unable to properly turn off the chat.
Timeline:
11:05 am PST: a series of threads started getting created on “The Caring Techie Newsletter” chat app saying hi and hello
11:07 am PST: people starting to get visibly bothered by the storm of push notifications
between 11:07 am -11:12 am PST: people continued to type short messages in the chat, while others tried to make people stop, some people commented they muted the chat or unsubscribed, and others were unable to mute, getting visibly more and more irritated
11:18 am PST: the on-call engineer (aka me) gets pinged on X by a wonderful subscriber that “something is going on with Substack! Try to deactivate the chat please!”
11:21 am PST: I disabled the chat altogether from the web UI interface, also checked why I didn’t get the series of notifications despite my app telling me that notifications are enabled
11:25 am PST: double checked with the subscriber if it worked, looks like it hadn’t
11:26 am PST: went to the app settings and disabled creating threads; changing any settings was throwing 400 errors without any logs
11:28 am PST: re-enabled chat from web UI, then disabled thread creation in the app UI
11:29 am PST: confirmed the change stopped the notifications
11:29 am PST: incident mitigated
TTM (Total time to mitigate): 24 minutes
2. Impact
Scope and severity:
To properly determine the scope, I’d need to have % of newsletter subscribers that also have the app & have notifications enabled
The severity level would be a SEV-2 or L4 (depending on what scale you use) since it severely affected key features of the chat
Metrics affected:
The only metric I have access to is subscriber count which decreased by about 300 subscribers
I would’ve loved to look at some push notification charts, if I had access to internal metrics
3. Root Causes
If I worked for Substack, here is where I would apply the “5 whys” to understand the root cause of the incident
4. Resolution and Recovery
Quickly glance at the chat to assess the situation and determine the gravity
The first attempt to mitigate failed: disabling the chat from the UI
The second attempt to mitigate: re-enabling the chat from the UI and disabling thread creation from the app
5. Lessons Learned
What went well:
Lovely subscribers who reached out to me to let me know that there is an issue
Prioritizing mitigation to reduce customer impact
What could have gone better:
I didn’t get any of the notifications that bothered everybody because my systems notifications for the Substack App were disabled during vacation
The inconsistency between the UI and the App led to confusion and wasted time
Subscribers reporting they were still getting messages even after unsubscribing, trying to mute and disable notifications 😭
6. Preventative measures
Disabled subscriber thread creations at least not until the AIs have been addressed
7. Action Items
8. Appendices
Additional Documentation:
Here are some of the more notable messages from the chat:
and the most annoyed ones:
Sign-Off:
Irina Stanescu, newsletter writer / involutary oncall 😅
🥇Award ceremony
The award for the funniest comment goes to Rohan Dehal, who posted this 😂😂:
The award for the saddest comment — literally the last thing I wanted 😭 — goes to:
And the award for the most entrepreneurial comment - always be hustling!
What on-call principles I used to handle the incident
Take a deep breath and stay calm
This step is essential in mitigating any crisis.
Focus on mitigation
While the incident was still active, there was no time to look at how many people unsubscribed, read, or engage with all the comments in the thread.
Escalating to Substack and waiting for them to fix the issue also wasn’t an option.
I had to do something that stopped the messages, and that’s where all my focus went.
Don’t close the incident without thorough verification
One common mistake when handling incidents is prematurely declaring the incident as mitigated without proper verification. In my case, I had to check with my subscriber whether the chat was really turned off, and it turns out it wasn’t! So my first solution hadn’t worked.
Root cause only after mitigation
A few hours after mitigation, I sent an email to my POC at Substack asking them for help escalating the incident. This was the natural course of action since I have no visibility into Substack’s systems.
Comms
It’s important to show the customers you’re actively trying to solve the issue they reported. In an ideal case I would’ve let people know I’m looking into the issue, but didn’t have time and decided to focus on mitigation. I was able to send out comms only once the issue was addressed.
Escalating to the Substack team
The same day the incident happened, I escalated the issue with my Substack POC, and shortly after I got an email from the head of Trust & Safety.
I won’t post the entire exchange, but I’ll tell you this:
The Substack team took full responsibility for the issues
They confirmed that they were able to reproduce the issues I reported and provided a root cause
“We believe the notification issues are likely caused by third-party delays with activity notifications being stuck in Apple queues. We're currently working to resolve this issue as well as disabled chats not reflecting in the app UI.”
Promised to follow up when the full list of issues is resolved
Committed to solving chat issues so chat is a safe place for open and interesting conversations
Conclusion
Hope you found the lessons in this article valuable, despite it being a very painful experience for me and “The Caring Techie” subscribers.
Failing in public sometimes is as valuable as building in public. The best approach to building software is to not shy away from looking at what didn’t go well and learning from it.
Don’t forget to be kind, even in situations like these. There usually isn’t any ill intent, just systems malfunctioning or software not designed properly. Don’t go attacking people.
On-call folks deserve awards. I forgot how stressful it is. Thank you to all those keeping systems up and running. Not all heroes wear capes.
I apologize once more if this incident impacted you in any way.
Until next time,
Your Caring Techie
Thanks again to today’s sponsor: AIQCON, the first AI Quality conference.
Mark your calendars: June 25th in San Francisco! Get your tickets using the code ‘testinprod’ for 20% off at www.aiqualityconference.com.
Great incident write up and response, Irina.
I love seeing you use your engineering skills into solve a “business problem” like this.
It shows you really are a caring techie and treat your subscribers with respect and run this newsletter like a business.
Keep at it friend! Mistakes happen. Acknowledging them and working to resolve is what matters. 👏🏼💙
I'm glad to see you enjoyed the meme and got a laugh out of the situation, Irina! It's a testament to staying positive even when things get a bit chaotic. 😄 🙌