On May 2nd, 2023, at 12:40 PM PST, Files.com received automated alerting of elevated rates on web services which resulted in an incident being declared. The Incident Management Team (IMT) convened and immediately began investigation.
Files.com released an initial Status Page posting on May 2nd, 2023, at 1:11 PM PST stating:
“US Region Only: Web Service Elevated Error Rates: US Web services only: We are investigating elevated error rates on the web service on Files.com in the US region. This is causing preview delays in the web interface. This incident does not impact other network services such as API, FTP, WebDAV, AS2, and others, nor does it impact regions other than US. At this time, we believe that all network services are currently up in our other regional locations.”
The was resolved on May 2nd, 2023 at 1:04 PM PST, returning the platform to full functionality.
Files.com released a resolution Status Page posting on May 2nd, 2023, at 1:18 PM PST stating:
“All services have been restored and are operating normally. All web services should be operating as normal. The issue with preview processing began at 12:35 PDT and was resolved completely by 1:04 PDT.”
This incident was started when a deadlock occurred in one of Files.com’s backend job processing systems, specifically the system that generates image and PDF previews of large images and documents for web viewing. A recent code change resulted in the system getting into a state where it locked up and did not process preview generation on 1 out of 6 backend servers.
As a result of “backflow” caused by very high error rates, other jobs such as syncs were delayed by 5 minutes on two separate occasions.
The root cause of this incident was a failure of Files.com’s internal job scheduling system to probably route around the failed preview worker and prevent its failure from causing broader impact. Ultimately this was caused by a design failure internal job scheduling system, which we have now redesigned to avoid this type of issue. (See next paragraph.)
A contributing cause was the failure of the preview worker itself, which was caused by Files.com’s failure to properly test the recent code change in a high load situation.
As a result of this incident and several other recent incidents, Files.com worked on dramatic improvements to its internal job scheduling code during the last week of April and first week of May, and those improvements have been tested in staging and are now in production.
These improvements provide multiple new protection mechanisms to prevent issues with specific customers, job types, or regions from “backflowing” and impacting other customers, job types, or regions.
Extensive review and testing was conducted by Files.com staff to ensure this resolution, and we have already taken steps internally to prevent this issue from recurring in the future.
We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.