Spruce experiencing issues with contact and conversation search related activities.

Incident Report for Spruce Health

Postmortem

Summary + customer impact

The Spruce system experienced degraded performance from 10:15am PT to 6:50pm PT on Match 15 2023. During this time:

Searching for contacts or conversations resulted in errors, slowness or stale results
Searching contacts to create conversations resulted in errors or slowness
Bulk messages sent were delayed due to being stuck in Processing state for a while or just taking longer to complete
Opening contact lists likely returned errors or experiencing slowness returning results
Contacts exports were delayed due to being stuck in Processing state for a while or just taking longer to complete

The degraded performance was caused due to inefficient data distribution across the search cluster where one of our data nodes experienced heavy load and was unable to process any new indexing and search operations.

Analysis

The Spruce engineering team immediately reacted to the monitoring alarms that were triggered to investigate the issue. The issue was caused due to one of the data nodes storing significantly larger amounts of data, compared to the other data nodes in the cluster, which resulted in the node taking longer to process requests. The number of requests piled up over time putting the node under heavier load, and getting to a point where the request queue was exhausted, resulting in new requests being rejected.

The engineering team reviewed and analyzed the cluster configuration and performance metrics in detail, and made the following steps to resolve the issue:

Adjusted the data distribution strategy so the data can be equally allocated across the data nodes in the search cluster. The old strategy was inefficient because the size of the stored data grew significantly over time and it could not support the demands of indexing and search operations.
Increased the number of data nodes and reallocated the data equally across the nodes. This was done in the background with minimal impact on the searching and indexing of new data. With the new configuration, data stored was uniformly distributed across all nodes.

During the degraded performance, the other data nodes in the cluster were fully operational and any requests that were routed to them were processed successfully. Also, all the indexing requests were successfully queued and processed after the issue was resolved.

Action items

Create additional monitoring alerts for search and indexing operations latency, so that any potential issues can be early detected.
Review the cluster configuration and performance metrics in detail every 3-6 months.
Improve the clean-up strategy for unused data to reduce space usage.
Create an internal strategy with clearly defined steps that can be taken in order to quickly troubleshoot and resolve issues like this one and thus minimize the impact on the clients.
Increase general knowledge about the search cluster and its configuration within the engineering team.

Posted Apr 14, 2023 - 13:21 PDT

Resolved

The system is fully operational as per our active monitoring over the last 10 hours.

Summary:
From 10:15am PT March 15 to 6:50pm PT March 15, the following actions on Spruce experienced degraded performance:
- Searching for conversations and contacts either took long or failed
- Contact filters frequently failed to load when clicked into
- When starting a new conversation, contact suggestions either took long or failed to load, making it challenging to start new conversations
- Bulk actions (messaging, tagging, deleting) took longer than expected to complete, but eventually completed
- Newly created contacts, conversations and messages during this time period were not searchable. The new items eventually became searchable 6:50pm PT onwards
- Successful searches for contacts and conversations may have brought up stale results, where an update to a contact or conversation was not reflected in search results. The updated items were eventually updated in the search to reflect their latest versions

There was no impact to calls, SMS, Fax, Secure Messaging, Email or Fax during this time.

We will post a postmortem to the incident soon.

Posted Mar 16, 2023 - 15:23 PDT

Update

The redistribution of data in the cluster is still in progress (note that this happens in the background with minimal impact to searching and indexing of new data). We have been closely monitoring the situation throughout the night. We also increased the capacity of the cluster to accommodate for the redistribution of data and to insure that we are in better shape for today. We have ~20% of redistribution remaining that we believe will have a long standing improvement to the overall performance.

The metrics so far are looking healthy with no signs pointing to poor performance or increased error rate.

We will report back here once the redistribution completes or if we see any signs pointing to degraded performance.

Posted Mar 16, 2023 - 07:58 PDT

Update

Indexing of data has now caught up such that successfully searches for any contacts, conversations and messages will bring up up to date results. We are continuing to work on better distributing the data across the cluster.

We are not experiencing poor performance or intermittent errors currently. This is likely due to the decreased overall traffic in the system given time of day. That being said, we continue to work on reducing the likelihood of this problem continuing into business day tomorrow.

Posted Mar 15, 2023 - 19:11 PDT

Update

We have identified a potential cause for the intermittent failures with the search cluster. We are going to work towards better distributing the data across the cluster so as to increase overall performance and reduce the error rate.

To recap, due to the errors throughout the day:
- Searching for conversations, messages or contacts may have failed
- Bulk messages may have taken longer to complete than usual
- Newly created contacts, conversations and messages may not have shown up when searching
- Updates to contacts may not have been searchable

We will continue to work through the evening to reindex the data so as to better distribute it across the cluster and keep this incident up to date as we make progress here.

We're really sorry for the inconvenience this is causing to your workflows.

Posted Mar 15, 2023 - 18:16 PDT

Update

We continue to investigate this issue. Note that some bulk message operations may take long to complete or may get stuck in a particular state given that the bulk message operations also face similar errors when querying contact lists.

Posted Mar 15, 2023 - 15:31 PDT

Update

We continue to work on the issue here to reduce the intermittent errors while searching for contacts or loading contact lists. Note that there is no impact to phone calls, SMS routing, loading of inbox, exchanging secure messages, or video calling. Bulk messages will continue to send during this time, albeit in a delayed fashion given that bulk messages work off of contact lists.

We will update here as we make progress against the performance issue here.

Posted Mar 15, 2023 - 13:29 PDT

Update

We are continuing to work on a fix for this issue.

Posted Mar 15, 2023 - 12:31 PDT

Update

We are continuing to work on a fix for this issue.

Posted Mar 15, 2023 - 12:27 PDT

Identified

We are investigating an issue with contact and conversation search related activities.

Posted Mar 15, 2023 - 10:15 PDT

This incident affected: Web App and Mobile Apps.