Delays To Batch and Scheduled Operations
Incident Report for Files.com
Postmortem

On April 22nd, 2023, at 10:30 AM PST, Files.com received automated alerts of delays on batch and scheduled operations in the Canada region which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation.  

Files.com released an initial Status Page posting on April 22nd, 2023, at 11:04 AM stating:   

Canada Region Only: Delays and Errors Related to Certain Background Processing:  Canada only: We are investigating elevated error rates related to certain background processing performed as part of the core Files.com file transfer pipeline in the Canada region. Impacted functions of Files.com include file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5.  This situation should not impact customers at all unless they have files stored in the Canada region.  This situation should not affect real-time operations such as the Files.com API, FTP, SFTP, AS2, and other operations where Files.com acts as a server.”  

Files.com released an updated Status Page posting on April 22nd, 2023, at 11:31 AM PST stating:   

“We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.  This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.” 

The delays on batch and scheduled operations in the Canada region was resolved on April 22nd, 2023, at 11:32 AM PST, returning the platform to full functionality. 

Files.com released a resolution Status Page posting on April 22nd, 2023, at 11:38 AM PST stating:   

“We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.  Operations were delayed beginning at 10:30 AM PST and ending by 11:32 AM PST.  All operations did successfully complete despite delays.  This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.” 

This incident was triggered by a customer-initiated regional migration of a fairly large amount of data from our USA region to our Canada region.  We typically process hundreds or thousands of such migrations daily without incident.   

During this process our Canada region worker servers became overloaded.  As a result, all regional background jobs in Canada began failing for customers using our Canada region. 

At the time the incident actually occurred, we incorrectly identified the root cause, but were able to resolve the issue anyway by resubmitting the failed jobs.   

On May 2nd, another similar incident occurred also in Canada where we discovered the true root cause of this incident.  In that investigation, we determined the overload to be caused by a design flaw in our checksum calculation code which failed to properly use all available CPU cores on the machine, and instead only attempted to use a single CPU core.  Basically, the machine locked up because dozens of jobs were attempting to use the same core, rather than spreading out to all available cores.  

As part of that incident’s resolution, Files.com pushed an update to introduce more parallelism to this calculation and allowed all available CPU cores to be used.  Additionally, one CPU core is now reserved for communication with our job scheduling system, which will prevent the communication problems in high load situations in the future.  

As a result of “back pressure” caused by very high error rates, other jobs on Files.com’s background job scheduling system outside of the Canada region were also impacted with delays.   

The root cause of the broader (non-Canada) was a failure of Files.com’s internal job scheduling system to probably route around the failed Canada workers and prevent their failure from causing broader impact. Ultimately this was caused by a design failure internal job scheduling system, which we have now redesigned to avoid this type of issue. (See next paragraph.)   

As a result of this incident and several other recent incidents, Files.com worked on dramatic improvements to its internal job scheduling code during the last week of April and first week of May, and those improvements have been tested in staging and are now in production.  

These improvements provide multiple new protection mechanisms to prevent issues with specific customers, job types, or regions from “backflowing” and impacting other customers, job types, or regions. 

Extensive review and testing was conducted by Files.com staff to ensure this resolution, and we have already taken steps internally to prevent this issue from recurring in the future.   

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted May 15, 2023 - 12:54 PDT

Resolved
We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

Operations were delayed beginning at 10:30 AM PST and ending by 11:32 AM PST. All operations did successfully complete despite delays.

This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted Apr 22, 2023 - 11:38 PDT
Update
We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted Apr 22, 2023 - 11:31 PDT
Investigating
Canada only: We are investigating elevated error rates related to certain background processing performed as part of the core Files.com file transfer pipeline in the Canada region. Impacted functions of Files.com include file transformations such as zip/unzip, GPG, preview generation, and file statistics calculation such as MD5.

This situation should not impact customers at all unless they have files stored in the Canada region.

This situation should not affect real-time operations such as the Files.com API, FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted Apr 22, 2023 - 11:04 PDT
This incident affected: Background Jobs, including Sync and Webhooks.