Resolved -
The system is fully operational as per our active monitoring over the last 10 hours.
Summary:
From 10:15am PT March 15 to 6:50pm PT March 15, the following actions on Spruce experienced degraded performance:
- Searching for conversations and contacts either took long or failed
- Contact filters frequently failed to load when clicked into
- When starting a new conversation, contact suggestions either took long or failed to load, making it challenging to start new conversations
- Bulk actions (messaging, tagging, deleting) took longer than expected to complete, but eventually completed
- Newly created contacts, conversations and messages during this time period were not searchable. The new items eventually became searchable 6:50pm PT onwards
- Successful searches for contacts and conversations may have brought up stale results, where an update to a contact or conversation was not reflected in search results. The updated items were eventually updated in the search to reflect their latest versions
There was no impact to calls, SMS, Fax, Secure Messaging, Email or Fax during this time.
We will post a postmortem to the incident soon.
Mar 16, 15:23 PDT
Update -
The redistribution of data in the cluster is still in progress (note that this happens in the background with minimal impact to searching and indexing of new data). We have been closely monitoring the situation throughout the night. We also increased the capacity of the cluster to accommodate for the redistribution of data and to insure that we are in better shape for today. We have ~20% of redistribution remaining that we believe will have a long standing improvement to the overall performance.
The metrics so far are looking healthy with no signs pointing to poor performance or increased error rate.
We will report back here once the redistribution completes or if we see any signs pointing to degraded performance.
Mar 16, 07:58 PDT
Update -
Indexing of data has now caught up such that successfully searches for any contacts, conversations and messages will bring up up to date results. We are continuing to work on better distributing the data across the cluster.
We are not experiencing poor performance or intermittent errors currently. This is likely due to the decreased overall traffic in the system given time of day. That being said, we continue to work on reducing the likelihood of this problem continuing into business day tomorrow.
Mar 15, 19:11 PDT
Update -
We have identified a potential cause for the intermittent failures with the search cluster. We are going to work towards better distributing the data across the cluster so as to increase overall performance and reduce the error rate.
To recap, due to the errors throughout the day:
- Searching for conversations, messages or contacts may have failed
- Bulk messages may have taken longer to complete than usual
- Newly created contacts, conversations and messages may not have shown up when searching
- Updates to contacts may not have been searchable
We will continue to work through the evening to reindex the data so as to better distribute it across the cluster and keep this incident up to date as we make progress here.
We're really sorry for the inconvenience this is causing to your workflows.
Mar 15, 18:16 PDT
Update -
We continue to investigate this issue. Note that some bulk message operations may take long to complete or may get stuck in a particular state given that the bulk message operations also face similar errors when querying contact lists.
Mar 15, 15:31 PDT
Update -
We continue to work on the issue here to reduce the intermittent errors while searching for contacts or loading contact lists. Note that there is no impact to phone calls, SMS routing, loading of inbox, exchanging secure messages, or video calling. Bulk messages will continue to send during this time, albeit in a delayed fashion given that bulk messages work off of contact lists.
We will update here as we make progress against the performance issue here.
Mar 15, 13:29 PDT
Update -
We are continuing to work on a fix for this issue.
Mar 15, 12:31 PDT
Update -
We are continuing to work on a fix for this issue.
Mar 15, 12:27 PDT
Identified -
We are investigating an issue with contact and conversation search related activities.
Mar 15, 10:15 PDT