Spruce Inbox failing to load
Incident Report for Spruce Health
Postmortem

Summary + customer impact

The Spruce system experienced an outage from 6:45am PT to 9:45am PT on January 25 2023. During this time:

  • The Spruce Inbox failed to load for patients and providers
  • Providers could not place voice or video calls to patients given that the inbox would not load
  • Providers could not send SMS or secure messages to patients
  • Patients could not send secure messages to practices
  • Inbound fax, SMS and email arrived to the Spruce inbox in a delayed fashion.
  • Inbound calls were operational during this time, however voicemails arrived to the Spruce inbox in a delayed fashion.
  • Workflow automation was executed albeit in a delayed fashion.

The outage was caused due to CPU exhaustion on one of the core databases. The engineering team believes the CPU exhaustion to be caused due to a frequently run inefficient query that was optimized as part of the fix. The engineering team will be proactively and closely monitoring the system over the rest of the week to ensure that there is no signs of database CPU exhaustion during peak hours of the day.

Analysis

The Spruce engineering team immediately reacted to the monitoring alarms that were triggered to investigate the issue. The issue is believed to be caused due to an inefficient query that is frequently executed across the entire customer base that built up over time and finally tilted one of the database into CPU exhaustion. The engineering team deployed a fix for the inefficient query at around 9am PT.

It took 45 minutes to bring the system back up once the query optimization fix was deployed because the engineering team wanted to ensure that bringing the system back all at once did not cause resource exhaustion on various parts of the system. So they brought various components up in a serialized manner while constantly monitoring the CPU utilization on the impacted database.

The backup communications system was activated around 8am PT so that any customer had registered their contact information received notifications for inbound calls and SMS over email via a secure expiring URL.

Action items

  • Make the Backup communications system self-service in the product so that anyone can register for it.
  • Automatically send an SMS in response to an inbound SMS when the backup system is activated. Currently, we only send an automated text if the provider has signed up for notifications from the backup system, rather for any inbound SMS.
  • Install a global rate limiter per account to ensure at the API layer so that we have protections around client applications constantly retrying and causing a large spike in traffic when the platform is experiencing issues.
  • Gain deeper insight into our asynchronous workers through metrics so that we can look at the overall health of the workers running across the platform and ensure there are no runaway workers causing platform wide issues.
  • Make it part of the engineering on-call person’s responsibilities to proactively monitor database performance metrics and identify any database statements that need optimizing before they build up over time
  • Clean up ever-growing tables to ensure that they are not impacting general database performance across key services
  • Investigate how AWS Performance Insights API can be leveraged to automatically notify the engineering team of database queries that take too long to execute or scan too many rows.
Posted Jan 25, 2023 - 17:46 PST

Resolved
The Spruce system experienced an outage from 6:45am PT to 9:45am PT on January 25, 2023. During this time:

- The Spruce Inbox failed to load for patients and providers
- Providers could not place voice or video calls to patients, given that the inbox would not load
- Providers could not send SMS or secure messages to patients
- Patients could not send secure messages to practices
- Inbound fax, SMS messages, and email arrived to the Spruce inbox in a delayed fashion
- Inbound calls were operational during this time; however, voicemails arrived to the Spruce inbox in a delayed fashion
- Workflow automation was executed, albeit in a delayed fashion

The outage was caused by CPU exhaustion on one of the core databases. The engineering team believes the CPU exhaustion to have been the result of a frequently run inefficient query that built up over time, and which was optimized as part of the fix for this incident. The engineering team will be proactively and closely monitoring the system in the next days to ensure that there is no sign of database CPU exhaustion, including especially during peak hours of the day.

We know how important it is for Spruce to be fully operational at all times. Working to build and maintain a medical communications system gives us all immense energy on a daily basis and is not a job we take lightly. We're very sorry for the outage and the impact to practices and patients. We will continue to work hard in pursuit of a highly available and reliable service. If you have any questions at all, please don't hesitate to reach us at support@sprucehealth.com.
Posted Jan 25, 2023 - 17:44 PST
Monitoring
The system should be fully functional at this point. We are continuing to monitor the system. We will post an incident report once we've had a chance to investigate more deeply here.
Posted Jan 25, 2023 - 10:05 PST
Identified
We are starting to see some recovery and are slowly ramping the system back up to fully serviceable to see the impact on database and CPU in general. We'll keep updating this page as we have more to share.
Posted Jan 25, 2023 - 09:45 PST
Update
We have made a database optimization for a high frequency lookup. We have intentionally brought down the API layer that clients connect to while we work to ensure that the rest of the system is functional. Once all asynchronous work has been completed, we will turn on the API layer slowly to ensure that we are not seeing any CPU performance issues again.
Posted Jan 25, 2023 - 09:21 PST
Update
We continue to investigate the issue with no root cause yet unfortunately. We are all hands on deck working to identify the reason for the outage.
Posted Jan 25, 2023 - 08:52 PST
Update
The backup system for notifying providers of incoming SMS, Fax, call events and voicemails has now been activated. Anyone that has registered contact information for our backup system will now get notified over email. You can read more about the backup system here: https://help.sprucehealth.com/article/424-spruce-backup-system
Posted Jan 25, 2023 - 08:20 PST
Update
We have not identified the root cause yet. The inbox continues to be down for most. We are actively investigating this issue.
Posted Jan 25, 2023 - 07:51 PST
Investigating
The Spruce inbox is unable to load at the moment and consequently patients and providers are unable to view/send messages, fax or make calls. Inbound calls should be working. Inbound fax likely delayed.
Posted Jan 25, 2023 - 07:17 PST
This incident affected: Web App, Mobile Apps, Phone Call Routing, SMS Routing, and Video Calling.