SFTP Entirely Down – US East Region (Primary)
Incident Report for Files.com
Postmortem

On May 8th, 2023, at 1:39 PM PST, Files.com received automated alerting of SFTP entirely down in the US East region which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation.  

Files.com released an initial Status Page posting on May 8th, 2023, at 1:47 PM PST stating:  

“SFTP Entirely Down – US East Region (Primary):  SFTP only: We are investigating a major outage of the SFTP service on Files.com in our primary USA region. 

This incident does not impact other network services such as API, FTP, WebDAV, AS2, and others.  

If you have an urgent need to access Files.com, we recommend using FTP in lieu of SFTP. If you must connect via SFTP, you should be able to immediately connect (and access your existing files and account) using the hostname of our Canada region, which is app-ca-central-1.files.com.   

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.” 

The SFTP entirely down in the US East region was resolved on May 8th, 2023, at 1:47 PM PST, returning the platform to full functionality. 

Files.com released a resolution Status Page posting on May 8th, 2023, at 1:51 PM PST stating  

“All services have been restored and are operating normally.   

We have resolved a major outage of the SFTP service on Files.com in our primary USA region.  This incident did not impact other network services such as API, FTP, WebDAV, AS2, and others.  The SFTP service was down from 1:34 p.m. to 1:47 p.m., with a total downtime of 13 minutes, but only in the primary USA region. 

If you previously moved any workloads to another region in response to this incident, you are cleared to move those regional workloads back to the USA region.   

We will follow up with an Incident Report within ten (10) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.”  

This incident occurred during a time period that also contained multiple other incidents, some of which are overlapping.  This report focuses specifically on the symptoms described here, but many customers who experienced this incident also experienced one of the other incidents.   

This incident had two distinct parts and root causes.  

First, Files.com deployed a change to its SFTP server as part of our overall project to dramatically improve the logging and handling of errors on SFTP.  The deployment of that change crashed our SFTP servers in several of our smaller regions due to an “out of memory” condition.   

Our SFTP server is developed in Java, and anyone familiar with Java can tell you how sensitive Java can be to memory configuration settings.  We immediately identified the issue with the Java memory settings and pushed a change to Chef, our infrastructure configuration management system, to tweak the SFTP memory settings and resolve the initial crash.   

The root cause of this first part was Files.com’s failure to monitoring Java runtime parameters such as memory usage to defend against an out of memory condition.  We have added additional monitoring around Java memory usage and are optimistic that this situation will be avoided in the future.    

One benefit of the Files.com architecture as compared with many of our peers is that on Files.com, SFTP is a completely isolated subsystem, so this incident did not impact other network services such as FTP, AS2, WebDAV, or API.    

Unfortunately, when we deployed the configuration change via Chef, we inadvertently deployed an unrelated configuration change at the same time that had been previously merged but not deployed to the SFTP servers.  This is due to the fact that we use one unified Chef repository for server configuration where certain recipes can be shared by different server types.   

That configuration change introduced an error into the upstream communication with our API, resulting in inability to connect via SFTP for certain customers.    

After investigating the issue, we were able to identify the bad configuration change and revert it.    

The root cause of the second part is Files.com’s failure to operate adequate change management procedures to prevent an unintended change from being deployed.    

Our incident management team was quite disappointed to learn about the chain of events that led to this incident.    

We have already improved our internal synthetic monitoring systems with the ability to detect the situation that occurred during this incident and alert on it immediately.   

Additionally, as a result of this incident, we are implementing major changes to our change management procedures designed to prevent this sort of configuration management error from happening again.  

Those changes are fairly complicated and will require a great deal of internal development.  As such, they will likely not be deployed until the middle of Q3.  It is our goal to have them implemented before our next SOC 2 Type II observation period (which runs from Q2-Q3 2023) and documented in our next SOC 2 Type II report.    

On a more general note, we have added a considerable amount of sophistication to our monitoring and routing systems as a result of the several incidents that occurred in May, and we are adding more.  These improvements amount to over 5,000 lines of code and we are optimistic that they will reduce the frequency and impact of incidents in the future.  We hope to share more about the improvements in our next SOC 2 Type II report. 

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Jun 01, 2023 - 11:33 PDT

Resolved
All services have been restored and are operating normally.

We have resolved a major outage of the SFTP service on Files.com in our primary USA region. This incident did not impact other network services such as API, FTP, WebDAV, AS2, and others. The SFTP service was down from 1:34 p.m. to 1:47 p.m., with a total downtime of 13 minutes, but only in the primary USA region.

If you previously moved any workloads to another region in response to this incident, you are cleared to move those regional workloads back to the USA region.

We will follow up with an Incident Report within ten (10) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.
Posted May 08, 2023 - 13:51 PDT
Investigating
SFTP only: We are investigating a major outage of the SFTP service on Files.com in our primary USA region.

This incident does not impact other network services such as API, FTP, WebDAV, AS2, and others.

If you have an urgent need to access Files.com, we recommend using FTP in lieu of SFTP. If you must connect via SFTP, you should be able to immediately connect (and access your existing files and account) using the hostname of our Canada region, which is app-ca-central-1.files.com.

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.
Posted May 08, 2023 - 13:47 PDT
This incident affected: SFTP.