Our cloud infrastructure provider, AWS reported elevated errors and latency for the System Manager Parameter Store service in US-East-1. You can see their report here. This AWS incident impacted the ability for AWS Lambda functions to execute since they could not read parameter values from the parameter store. This AWS incident in turn impacted the Spruce platform as described below.
The Spruce platform leverages AWS Lambdas (server-less functions) to process inbound SMS, user-facing app-based push notifications, and badge count updates for smartphones and the web. The benefit of the Lambdas is that they automatically scale up in times of high demand and scale down to maintain a minimum number of serverless functions.
The Spruce platform has designed redundancy in place to process inbound SMS and app-based push notifications without needing AWS Lambdas. In the case of failure for AWS Lambdas to execute, inbound SMS is received by an application API via a fallback webhook from our telecommunications infrastructure provider (Twilio). App based push notifications are processed by application level workers that are running in the same tasks that service the rest of the platform and are listening on the same distributed queues that the AWS Lambdas listen on.
Every application level task also depends on AWS Parameter Store to pick configuration values. The AWS Parameter store is accessed by each application instance at the time of startup to pick up the configuration values and then used for the lifecycle of the application task until the next deployment or instance replacement.
Good news: Given the fallback logic in place as described above, inbound SMS and user-facing app-based push notifications continued to function without any impact to Lambdas failing. All existing application level tasks continued to operate normally as well since they do not rely on the AWS Parameter Store for normal functioning and only rely on it at startup time.
Where the impact was felt: There is typically a limited number of application-level workers running to service badge-count updates and user-facing app-based push notifications. While there was no impact to the user-facing app-based push notifications, badge-count updates were delayed because they are high throughput (also provide real time updates to the web) given the limited number of workers available to service the badge-count updates.
While typically, the Spruce engineering team can easily increase the number of tasks to keep up with and service the badge count updates, in this case that was not possible given that new tasks would not start up given their reliance on AWS Parameter store to pick up their configuration value. So the engineering team decided to stay put and survive with the delayed badge-count updates knowing that the platform was operating normally otherwise.
The user facing impact was as follows:
We have 1 solid action item that will improve the overall platform here, which is to reduce dependence on AWS Parameter store by having an in-memory cache (Redis) as a fallback. Each time an application task starts up, it will write its configuration values to Redis. If a task cannot access AWS Parameter store, it will then fallback to access the values from Redis. We are prioritizing this change given the immediate impact it can bring to the system.
This change alone will make it so that in the future, if AWS Parameter store is impacted, we can continue to bring up as many application level tasks as we’d like and AWS Lambdas would continue to function normally as well.