Spruce inbox failing to load

Incident Report for Spruce Health

Postmortem

Analysis

One of the instances of a service reached an unhealthy state. Any requests to the instance errored with a timeout given that the instance was unresponsive. It took ~10 minutes for the platform to self-heal and terminate the instance and replace it with a healthy one.

Each client (user’s web app or smartphone) sends requests to the API layer that round-robins requests to appropriate services to process requests. The round-robin logic is currently not advanced enough to filter out any unhealthy requests, consequently leading to intermittent failed requests until the instance is replaced.

‌

Action items

Explore improving client-side load-balancing logic or using a service mesh to improve observability and reliability of intra-service communication

Posted Jun 02, 2022 - 10:15 PDT

Resolved

The Spruce Inbox failed to load intermittently for many customers between 11:50am PT and 12:03pm PT. This was due to one of the services reaching an unhealthy state where the one of the instances of the service became unresponsive. Any request from the client to that unhealthy instance resulted in a timeout which to the user resulted in a failed load of the inbox or a particular conversation.

Inbound/outbound calling, inbound SMS/Email/Fax/Secure messages were not impacted during this time.

Users may have been unable to send a message from the Spruce app if their request hit the unhealthy instance, but a subsequent retry likely resolved the issue.

Posted Jun 01, 2022 - 12:15 PDT

Update

System automatically resolved itself after identifying the unhealthy task and replacing it with a healthy one

Posted Jun 01, 2022 - 12:03 PDT

Investigating

Spruce inbox is intermittently failing to load for users. Our engineering team is on it and investigating.

Posted Jun 01, 2022 - 11:50 PDT

This incident affected: Web App, Mobile Apps, and Video Calling.