Delays To Batch and Scheduled Operations
Incident Report for Files.com
Postmortem

On May 2nd, 2023, at 12:40 AM PST, Files.com received automated alerts of delays in batch and scheduled operations which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation.  

Files.com released an initial Status Page posting on May 2nd, 2023, at 1:07 PM PST stating:   

Delays To Batch and Scheduled Operations:  We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.  This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.” 

The delays in batch and scheduled operations were resolved on May 2nd, 2023, at 12:55 PM PST, returning the platform to full functionality.  

Files.com released a resolution Status Page posting on May 2nd, 2023, at 1:15 PM PST stating:    

“We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations. Operations were delayed beginning at 12:35 PDT and ending by 12:55 PDT. All operations did successfully complete despite delays.  This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.”  

This incident was started when a deadlock occurred in one of Files.com’s backend job processing systems, specifically the system that generates image and PDF previews of large images and documents for web viewing.  A recent code change resulted in the system getting into a state where it locked up and did not process preview generation on 1 out of 6 backend servers.   

As a result of “backflow” caused by very high error rates, other jobs such as syncs were delayed by 5 minutes on two separate occasions.  

The root cause of this incident was a failure of Files.com’s internal job scheduling system to probably route around the failed preview worker and prevent its failure from causing broader impact.  Ultimately this was caused by a design failure internal job scheduling system, which we have now redesigned to avoid this type of issue. (See next paragraph.)   

A contributing cause was the failure of the preview worker itself, which was caused by Files.com’s failure to properly test the recent code change in a high load situation.   

As a result of this incident and several other recent incidents, Files.com worked on dramatic improvements to its internal job scheduling code during the last week of April and first week of May, and those improvements have been tested in staging and are now in production.  

These improvements provide multiple new protection mechanisms to prevent issues with specific customers, job types, or regions from “backflowing” and impacting other customers, job types, or regions.  

Extensive review and testing was conducted by Files.com staff to ensure this resolution, and we have already taken steps internally to prevent this issue from recurring in the future.    

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Jun 01, 2023 - 11:41 PDT

Resolved
We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

Operations were delayed beginning at 12:35 PDT and ending by 12:55 PDT. All operations did successfully complete despite delays.

This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.
Posted May 02, 2023 - 13:15 PDT
Investigating
We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

Operations were delayed beginning at 12:35 PDT and ending by 12:55 PDT. All operations did successfully complete despite delays.

This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.
Posted May 02, 2023 - 13:07 PDT
This incident affected: Background Jobs, including Sync and Webhooks.