Delays To Scheduled Operations
Incident Report for Files.com
Postmortem

On May 15th, 2023, at 1:41 AM PST, Files.com received customer reports of delays to batch and scheduled operations which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation. 

Files.com released an initial Status Page posting on May 15th, 2023, at 1:54 AM PST stating: 

“Delays To Batch and Scheduled Operations:  We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations. 

This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.” 

Files.com released an updated Status Page posting on May 15th, 2023 at 2:16 AM PST stating:  

“We are continuing to investigate this issue.”   

The delays to batch and scheduled operations was resolved on May 15th, 2023, at 2:16 AM PST, returning the platform to full functionality. 

Files.com released a resolution Status Page posting on May 15th, 2023, at 2:20 AM PST stating: 

“We have resolved a situation causing delays to scheduled operations around syncs. 

Operations were delayed beginning at 17:45 PDT and ending by 2:16 PDT. All operations did successfully complete despite delays.  

This situation was only a delay of scheduled syncs and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.”  

As part of the incident postmortem, the root cause was identified: 

Files.com uses an internally-developed job scheduling software to manage certain background tasks such as syncs.  Due to a bug in the software, a value larger than 32-bit was inadvertently stored into a database column only able to hold 32-bit values.  The software correctly identified the discrepancy and stopped job processing for Sync jobs specifically until the issue could be manually resolved.    

After becoming aware of the issue, our team pushed several fixes to improve the robustness of this part of the scheduling software.  The root cause of this issue was insufficient testing of the job scheduling software against edge cases such as bad data in a database.   

This incident was further complicated by the fact that it was not detected by our internal monitoring systems at all.  We responded to this incident after being alerted by one of our Enterprise Support customers via our 24/7 contact line for Enterprise Support customers.   

Frankly, we are embarrassed about this.  We conducted a full investigation into the alerting situation and determined that although we did have monitoring about Sync job processing, our alerts for Sync jobs were based on Error Count, not Success Count.   

So a situation like this one where 0 Errors occurred and 0 Successes occurred during a monitoring interval did not result in an alert.   

We have updated our alerting rules to look at Success Count in addition to Error count.  Additionally we will soon be rolling out a brand new internal monitoring system that will provide an independent source of alerting and monitoring for the entire Sync process.  

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Jun 01, 2023 - 11:19 PDT

Resolved
We have resolved a situation causing delays to scheduled operations around syncs.

Operations were delayed beginning at 17:45 PDT and ending by 2:16 PDT. All operations did successfully complete despite delays.

This situation was only a delay of scheduled syncs and did not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted May 15, 2023 - 02:20 PDT
Update
We are continuing to investigate this issue.
Posted May 15, 2023 - 02:16 PDT
Update
We are continuing to investigate this issue.
Posted May 15, 2023 - 01:55 PDT
Investigating
We are investigating reports of delays to batch and scheduled operations, including but not limited to scheduled and ad-hoc syncs, moves, copies, file transformations, automations, webhooks, batch deletes, email delivery, previews, and similar batch operations.

This situation should not affect real-time operations such as FTP, SFTP, AS2, and other operations where Files.com acts as a server.
Posted May 15, 2023 - 01:54 PDT
This incident affected: Background Jobs, including Sync and Webhooks.