Post-Mortem: 11/27/2023 (Outbound SMS delivery Issue)
Erol Toker avatar
Written by Erol Toker
Updated over a week ago

Impact:

  • Region: US1 data center.

  • The main issue reported during this time was difficulty in getting

  • There was no degradation in calling or SMS

Timeline:

  • 17:05 EDT - first user reports recieved

  • 17:14 EDT - new read db added

  • 17:20 EDT - issue resolved

Root Cause Analysis:

  • The issue was caused by a spike in requests related to a query on a table that overloaded one of our read replica databases in the US1 region. This caused delays in the read database getting replicating data from the master DB

  • The result was that for specific types of operations that followed a specific pattern ( 1) write to the DB 2) do an immediate refresh to update the rendered state in the app) there was a delay in getting the updated DB state for call dispositions, so to the end users it looked like disposition updates were failing, but really what was happening was the previous state was coming back. When users applied the disposition on a subsequent attempt, the updated status would be reported back to the front end client.

  • The root cause of the problematic query was identified

  • The root

Short Term Mitigation

  • The table in question just started experiencing performance problems due to size, and had years of data. We reduced the size of the table to ensure short term spikes would not cause a similar problem.

Long Term Mitigation

  • Our team is in the process of adding an index to the table in question in the next maintenance window.

Did this answer your question?