Delays To Batch and Scheduled Operations
Incident Report for Files.com
Postmortem

On April 19th, 2023, at 6:52 AM PST, Files.com received automated alerting of delays to batch and scheduled operations, which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation.  

Files.com released an initial Status Page posting on April 19th, 2023, at 7:22 AM PST, stating: 

Delays To Batch and Scheduled Operations:  We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.  This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.”  

The delays to batch and scheduled operations were resolved on April 19th, 2023, at 7:51 AM PST, returning the platform to full functionality. 

Files.com released a resolution Status Page posting on April 19th, 2023, at 7:52 AM PST, stating: 

“We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.  Operations were delayed beginning at 6:35 AM PST and ending by 7:51 AM PST.  All operations did successfully complete despite delays.  This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.”  

Files.com regularly reboots machines during the early AM hours to install security updates.  This is performed on a routine basis and typically occurs without incident.   

This incident started when a Search cluster used for searching and sorting files by name on Files.com failed to come back up after a routine reboot.   

After investigation, it was determined that the failure was caused by a configuration change that was meant to target a different Search cluster at Files.com but was inadvertently installed on this cluster.  The change was actually made over a month ago, but due to the nature of the configuration change, it did not cause any issues until the reboot.  

After reverting the inadvertent configuration change, the Search cluster came back online and service was restored.   

As a result of “back pressure” caused by very high error rates on search, other jobs on Files.com’s background job scheduling system unrelated to Search were also impacted with delays lasting about an hour.   

The root cause of the broader backup (i.e. jobs other than Search) was a failure of Files.com’s internal job scheduling system to probably route around the failed search indexing jobs and prevent their failure from causing broader impact.  

Part of this was caused by a design failure internal job scheduling system, which we have now redesigned to avoid this type of issue. (See next paragraph.).  Further contributing to the problem was a software design flaw where the Files.com search code used a 1 second delay prior to retrying a failed query to the Search cluster.  1 second is an eternity on a service that normally serves thousands requests per second. This has also been fixed.  

As a result of this incident and several other recent incidents, Files.com worked on dramatic improvements to its internal job scheduling code during the last week of April and first week of May, and those improvements have been tested in staging and are now in production.  

These improvements provide multiple new protection mechanisms to prevent issues with specific customers, job types, or regions from “backflowing” and impacting other customers, job types, or regions.  

Extensive review and testing was conducted by Files.com staff to ensure this resolution, and we have already taken steps internally to prevent this issue from recurring in the future.

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted May 15, 2023 - 10:48 PDT

Resolved
This incident has been resolved.
Posted Apr 19, 2023 - 07:53 PDT
Update
We have resolved a situation causing delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

Operations were delayed beginning at 6:35 AM PST and ending by 7:51 AM PST. All operations did successfully complete despite delays.

This situation was only a delay of scheduled and batch processing and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted Apr 19, 2023 - 07:52 PDT
Investigating
We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted Apr 19, 2023 - 07:22 PDT
This incident affected: Background Jobs, including Sync and Webhooks.