Summary + customer impact
The Spruce system experienced degraded performance from 10:15am PT to 6:50pm PT on Match 15 2023. During this time:
- Searching for contacts or conversations resulted in errors, slowness or stale results
- Searching contacts to create conversations resulted in errors or slowness
- Bulk messages sent were delayed due to being stuck in Processing state for a while or just taking longer to complete
- Opening contact lists likely returned errors or experiencing slowness returning results
- Contacts exports were delayed due to being stuck in Processing state for a while or just taking longer to complete
The degraded performance was caused due to inefficient data distribution across the search cluster where one of our data nodes experienced heavy load and was unable to process any new indexing and search operations.
Analysis
The Spruce engineering team immediately reacted to the monitoring alarms that were triggered to investigate the issue. The issue was caused due to one of the data nodes storing significantly larger amounts of data, compared to the other data nodes in the cluster, which resulted in the node taking longer to process requests. The number of requests piled up over time putting the node under heavier load, and getting to a point where the request queue was exhausted, resulting in new requests being rejected.
The engineering team reviewed and analyzed the cluster configuration and performance metrics in detail, and made the following steps to resolve the issue:
- Adjusted the data distribution strategy so the data can be equally allocated across the data nodes in the search cluster. The old strategy was inefficient because the size of the stored data grew significantly over time and it could not support the demands of indexing and search operations.
- Increased the number of data nodes and reallocated the data equally across the nodes. This was done in the background with minimal impact on the searching and indexing of new data. With the new configuration, data stored was uniformly distributed across all nodes.
During the degraded performance, the other data nodes in the cluster were fully operational and any requests that were routed to them were processed successfully. Also, all the indexing requests were successfully queued and processed after the issue was resolved.
Action items
- Create additional monitoring alerts for search and indexing operations latency, so that any potential issues can be early detected.
- Review the cluster configuration and performance metrics in detail every 3-6 months.
- Improve the clean-up strategy for unused data to reduce space usage.
- Create an internal strategy with clearly defined steps that can be taken in order to quickly troubleshoot and resolve issues like this one and thus minimize the impact on the clients.
- Increase general knowledge about the search cluster and its configuration within the engineering team.