On May 1st, 2023, at 7:39 AM PST, Files.com received automated alerts of delays and errors related to certain background processing in the Canada region, which resulted in an incident being declared. The Incident Management Team (IMT) convened and immediately began investigation.
Files.com released an initial Status Page posting on May 1st, 2023, at 8:10 AM PST, stating:
“Canada Region Only: Delays and Errors Related to Certain Background Processing: Canada only: We are investigating elevated error rates related to certain background processing performed as part of the core Files.com file transfer pipeline in the [REGION] region. Impacted functions of Files.com include file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5. This situation should not impact customers at all unless they have files stored in the Canada region. This situation should not affect real-time operations such as the Files.com API, FTP, SFTP, AS2, and other operations where Files.com acts as a server.”
The delays and errors related to certain background processing in the Canada region was resolved on May 1st, 2023, at 11:46 AM PST, returning the platform to full functionality.
Files.com released a resolution Status Page posting on May 1st, 2023, at 12:18 PM PST stating:
“All services have been restored and are operating normally. Canada only: We have resolved an issue with certain background processing performed as part of the core Files.com file transfer pipeline in all regions. Impacted functions of Files.com included file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5. The issue with background processing began at 6:46 AM PST and was resolved completely by 11:46 AM PST. Resolution means that any background jobs that were previously delayed were now been processed successfully.”
This incident started when a customer uploaded an large amount of data via our web interface to our Canada region.
Due to the exact nature of the files uploaded, our Canada region’s worker servers became overloaded and unresponsive to any type of communication. As a result, all regional background jobs in Canada began failing for customers using our Canada region.
Upon investigation, we determined the overload to be caused by a design flaw in our checksum calculation code which failed to properly use all available CPU cores on the machine, and instead only attempted to use a single CPU core. Basically, the machine locked up because dozens of jobs were attempting to use the same core, rather than spreading out to all available cores.
As part of the incident resolution, Files.com pushed an update to introduce more parallelism to this calculation and allowed all available CPU cores to be used. Additionally, one CPU core is now reserved for communication with our job scheduling system, which will prevent the communication problems in high load situations in the future.
As a result of “back pressure” caused by very high error rates, other jobs on Files.com’s background job scheduling system outside of the Canada region were also impacted with delays.
The root cause of the broader delays (non-Canada) was a failure of Files.com’s internal job scheduling system to probably route around the failed Canada workers and prevent their failure from causing broader impact. Ultimately this was caused by a design failure internal job scheduling system, which we have now redesigned to avoid this type of issue. (See next paragraph.)
As a result of this incident and several other recent incidents, Files.com worked on dramatic improvements to its internal job scheduling code during the last week of April and first week of May, and those improvements have been tested in staging and are now in production.
These improvements provide multiple new protection mechanisms to prevent issues with specific customers, job types, or regions from “backflowing” and impacting other customers, job types, or regions.
Extensive review and testing was conducted by Files.com staff to ensure this resolution, and we have already taken steps internally to prevent this issue from recurring in the future.
We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.