From 8:55 AM PST through 9:08 AM PST, Files.com customers experienced much slower than normal response times to our core API, which affected other downstream services. Most API requests still completed successfully, albeit slower than normal.
The root cause of the slower than normal response times was that one of our databases was running at a dramatically higher load than normal. Upon investigation, Files.com determined that database queries were performing many orders of magnitude slower than intended due to the misconfiguration of an index in the database. This misconfiguration, it turns out, has existed for over a decade, but the particular query pattern had never been seen before in production.
Files.com immediately reacted to this situation by first disabling the problematic jobs that were generating the unoptimized queries. This returned the system to normal performance. Files.com then fixed the database index configuration and re-enabled the problematic jobs, which then ran to completion quickly with no further impact on system performance.
As part of our incident post-mortem process, we discovered and remedied a few deficiencies that contributed to this incident taking 13 minutes to resolve. First, we discovered a 5 minute delay in importing the relevant time series data from one of our monitoring systems (Amazon Cloudwatch) into another of our monitoring systems (Influxdb), the latter of which is used to trigger our internal alerting. We have made configuration changes to remedy this delay. Second, in addition to the delay in importing the time series data, we also had a poorly configured alert threshold that introduced an additional 6 minutes of delay before an on-call engineer was paged. We have made configuration changes to remove this delay, ensuring that an on-call engineer will be paged immediately in the event of a similar situation in the future.
Additionally, as part of the post-mortem process for this incident, we implemented much stricter controls to detect and reject slow queries at the database itself. We conducted a simulated recreation of this incident in our staging environment and determined that our new controls are sufficient to prevent a recurrence of this incident.
Additionally, after reviewing this incident, we built a new tool for our on-call engineers that implements a much faster, one-click action to quarantine a problematic job type once it has been flagged as problematic. This will improve our ability to react quickly to newly discovered performance deficiencies in the future. We will begin incorporating training on this new tool into our training for on-call engineers in our next recurrent training cycle.
We greatly appreciate your patience and understanding as we resolved this issue.