Analysis
One of the instances of a service reached an unhealthy state. Any requests to the instance errored with a timeout given that the instance was unresponsive. It took ~10 minutes for the platform to self-heal and terminate the instance and replace it with a healthy one.
Each client (user’s web app or smartphone) sends requests to the API layer that round-robins requests to appropriate services to process requests. The round-robin logic is currently not advanced enough to filter out any unhealthy requests, consequently leading to intermittent failed requests until the instance is replaced.
Action items