Canada Region Only: Delays and Errors Related to Certain Background Processing

Incident Report for Files.com

Postmortem

On May 1st, 2023, at 7:39 AM PST, Files.com received automated alerts of delays and errors related to certain background processing in the Canada region, which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation. 

Files.com released an initial Status Page posting on May 1st, 2023, at 8:10 AM PST, stating:    

Canada Region Only: Delays and Errors Related to Certain Background Processing:  Canada only: We are investigating elevated error rates related to certain background processing performed as part of the core Files.com file transfer pipeline in the [REGION] region. Impacted functions of Files.com include file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5.  This situation should not impact customers at all unless they have files stored in the Canada region.  This situation should not affect real-time operations such as the Files.com API, FTP, SFTP, AS2, and other operations where Files.com acts as a server.” 

The delays and errors related to certain background processing in the Canada region was resolved on May 1st, 2023, at 11:46 AM PST, returning the platform to full functionality.  

Files.com released a resolution Status Page posting on May 1st, 2023, at 12:18 PM PST stating:   

“All services have been restored and are operating normally.  Canada only: We have resolved an issue with certain background processing performed as part of the core Files.com file transfer pipeline in all regions. Impacted functions of Files.com included file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5.  The issue with background processing began at 6:46 AM PST and was resolved completely by 11:46 AM PST. Resolution means that any background jobs that were previously delayed were now been processed successfully.” 

This incident started when a customer uploaded an large amount of data via our web interface to our Canada region.   

Due to the exact nature of the files uploaded, our Canada region’s worker servers became overloaded and unresponsive to any type of communication. As a result, all regional background jobs in Canada began failing for customers using our Canada region.  

Upon investigation, we determined the overload to be caused by a design flaw in our checksum calculation code which failed to properly use all available CPU cores on the machine, and instead only attempted to use a single CPU core. Basically, the machine locked up because dozens of jobs were attempting to use the same core, rather than spreading out to all available cores.  

As part of the incident resolution, Files.com pushed an update to introduce more parallelism to this calculation and allowed all available CPU cores to be used. Additionally, one CPU core is now reserved for communication with our job scheduling system, which will prevent the communication problems in high load situations in the future.  

As a result of “back pressure” caused by very high error rates, other jobs on Files.com’s background job scheduling system outside of the Canada region were also impacted with delays.   

The root cause of the broader delays (non-Canada) was a failure of Files.com’s internal job scheduling system to probably route around the failed Canada workers and prevent their failure from causing broader impact. Ultimately this was caused by a design failure internal job scheduling system, which we have now redesigned to avoid this type of issue. (See next paragraph.)   

As a result of this incident and several other recent incidents, Files.com worked on dramatic improvements to its internal job scheduling code during the last week of April and first week of May, and those improvements have been tested in staging and are now in production.  

These improvements provide multiple new protection mechanisms to prevent issues with specific customers, job types, or regions from “backflowing” and impacting other customers, job types, or regions.  

Extensive review and testing was conducted by Files.com staff to ensure this resolution, and we have already taken steps internally to prevent this issue from recurring in the future.   

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Jun 01, 2023 - 11:39 PDT

Resolved

All services have been restored and are operating normally.

Canada only: We have resolved an issue with certain background processing performed as part of the core Files.com file transfer pipeline in Canada. Impacted functions of Files.com included file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5.

The issue with background processing began at 6:46 AM PST and was resolved completely by 11:46 AM PST. Resolution means that any background jobs that were previously delayed have now been processed successfully.

We will follow up with an Incident Report within ten (10) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.
Posted May 01, 2023 - 12:18 PDT

Update

We are continuing to investigate this issue. Most delays and errors have been corrected. We continue to work on a small number of transactions that continue to provide delayed errors. If you need additional assistance, please do not hesitate to contact our Customer Support team by email. Thank you for your continued patience.
Posted May 01, 2023 - 10:59 PDT

Update

We are continuing to investigate this issue with Canada region background processing. We will post an update as soon as the issue has been identified and a fix is being implemented. If you need additional assistance, please do not hesitate to contact our Customer Support team by email. Thank you for your continued patience.
Posted May 01, 2023 - 10:22 PDT

Update

We are continuing to investigate this issue with Canada region background processing. We will post an update as soon as the issue has been identified and a fix is being implemented. If you need additional assistance, please do not hesitate to contact our Customer Support team by email. Thank you for your continued patience.
Posted May 01, 2023 - 09:09 PDT

Investigating

Canada only: We are investigating elevated error rates related to certain background processing performed as part of the core Files.com file transfer pipeline in the Canada region. Impacted functions of Files.com include file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5.

This situation should not impact customers at all unless they have files stored in the Canada region.

This situation should not affect real-time operations such as the Files.com API, FTP, SFTP, AS2, and other operations where Files.com acts as a server.

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.
Posted May 01, 2023 - 08:10 PDT
This incident affected: Core Services / API and Web Interface.