Delayed notifications
Incident Report for Spruce Health
Postmortem

Summary

Our cloud infrastructure provider, AWS reported elevated errors and latency for the System Manager Parameter Store service in US-East-1. You can see their report here. This AWS incident impacted the ability for AWS Lambda functions to execute since they could not read parameter values from the parameter store. This AWS incident in turn impacted the Spruce platform as described below.

Context

The Spruce platform leverages AWS Lambdas (server-less functions) to process inbound SMS, user-facing app-based push notifications, and badge count updates for smartphones and the web. The benefit of the Lambdas is that they automatically scale up in times of high demand and scale down to maintain a minimum number of serverless functions.

The Spruce platform has designed redundancy in place to process inbound SMS and app-based push notifications without needing AWS Lambdas. In the case of failure for AWS Lambdas to execute, inbound SMS is received by an application API via a fallback webhook from our telecommunications infrastructure provider (Twilio). App based push notifications are processed by application level workers that are running in the same tasks that service the rest of the platform and are listening on the same distributed queues that the AWS Lambdas listen on.

Every application level task also depends on AWS Parameter Store to pick configuration values. The AWS Parameter store is accessed by each application instance at the time of startup to pick up the configuration values and then used for the lifecycle of the application task until the next deployment or instance replacement.

Impact

Good news: Given the fallback logic in place as described above, inbound SMS and user-facing app-based push notifications continued to function without any impact to Lambdas failing. All existing application level tasks continued to operate normally as well since they do not rely on the AWS Parameter Store for normal functioning and only rely on it at startup time.

Where the impact was felt: There is typically a limited number of application-level workers running to service badge-count updates and user-facing app-based push notifications. While there was no impact to the user-facing app-based push notifications, badge-count updates were delayed because they are high throughput (also provide real time updates to the web) given the limited number of workers available to service the badge-count updates.

While typically, the Spruce engineering team can easily increase the number of tasks to keep up with and service the badge count updates, in this case that was not possible given that new tasks would not start up given their reliance on AWS Parameter store to pick up their configuration value. So the engineering team decided to stay put and survive with the delayed badge-count updates knowing that the platform was operating normally otherwise.

The user facing impact was as follows:

  • The unread badge count on the Spruce application on smartphones did not update in real time to reflect the right badge count
  • The web-app did not update in real time as it typically does given that the web relies on the badge-count updates to refresh it’s state.

Action Items

We have 1 solid action item that will improve the overall platform here, which is to reduce dependence on AWS Parameter store by having an in-memory cache (Redis) as a fallback. Each time an application task starts up, it will write its configuration values to Redis. If a task cannot access AWS Parameter store, it will then fallback to access the values from Redis. We are prioritizing this change given the immediate impact it can bring to the system.

This change alone will make it so that in the future, if AWS Parameter store is impacted, we can continue to bring up as many application level tasks as we’d like and AWS Lambdas would continue to function normally as well.

Posted Sep 09, 2022 - 12:29 PDT

Resolved
We have started seeing recovery as of 10:34am PT. All badge count updates are processing normally and without delay now.

AWS just confirmed (as of 11:27am PT) what we are seeing, that they too are seeing recovery.

We will resolve this incident for now since the system is back to operating normally. If you have any questions or concerns please don't hesitate to reach out to us via the Spruce Support conversation in app or support@sprucehealth.com.
Posted Sep 09, 2022 - 11:34 PDT
Update
We have identified that Spruce app notifications are actually processing normally without a delay. So app notifications and video calling notifications are working just fine. We had misdiagnosed the issue.

It is only badge count updates that are delayed at this point. So if you receive a new Page or if you receive a new message, your badge count will not update. But you will see the app notification on your smartphone.
Posted Sep 09, 2022 - 10:22 PDT
Update
We are continuing to investigate this issue.
Posted Sep 09, 2022 - 10:13 PDT
Investigating
Spruce app push notifications are currently delayed due to an issue with our underlying cloud service provider (AWS). We are continuing to monitor the situation and will post an update as soon as we have one to share.

Video call notifications may also be impacted as a result. So if you are engaging in video calls with your patients, please ask them to have the Spruce app open while waiting for a video call from you.

Inbound/outbound calls, secure message exchanges, email, fax and SMS all continue to operate normally.
Posted Sep 09, 2022 - 08:57 PDT
This incident affected: Video Calling and Spruce app notifications.