SFTP, FTP/FTPS, WebDAV Service Degraded
Incident Report for Files.com
Postmortem

On May 8th, an d May 9th, 2023, Files.com received multiple automated alerts and customer reports of intermittent issues with the Files.com platform, which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation.  

Files.com released an initial Status Page posting on May 8th, 2023, at 5:12 PM PST stating:  

“SFTP, FTP/FTPS, WebDAV Service Degraded:  FTP/FTPS, SFTP, WebDAV only: We are investigating elevated error rates on these services on Files.com in all regions.  

This incident does not impact other network services such as API, AS2, and others.  

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.” 

Files.com released a resolution Status Page posting on May 8th, 2023, at 5:37 PM PST stating 

“All services have been restored and are operating normally.   

Users connecting to accounts with a custom namespace, an ExaVault host key, a custom host key, or an enforced IP whitelist experienced authentication errors. Logins were impacted between 1:34 p.m. PST and 5:33 p.m. PST. Other users may have experienced elevated error rates as well.  

We will follow up with an Incident Report within ten (10) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.” 

Customers continued reporting other intermittent issues with the platform, which resulted in second incident being declared on May 9th, 2023, at 6:47 AM PST.  The IMT convened and immediately began investigation  

The intermittent issues with the Files.com platform were resolved on May 9th, 2023, at 8:07 AM PST, returning the platform to full functionality.  

This incident occurred due to a complex set of circumstances with times that vary by region.  This narrative will focus on the overall story of what happened.  On May 5, Files.com experienced an incident that resulted in a 3+ hour service outage.   

Prior to that, on May 3, Files.com conducted a successful upgrade of certain regional proxy servers in certain regions from Intel architecture to ARM architecture as part of our overall transition from Intel to ARM across all of our services.    

As we explained in the RCA of the May 5 incident, our Incident Management Team originally misidentified the root cause of that incident as being related to the new ARM servers and made the decision to roll back from our new ARM servers to the old Intel servers in certain regions on May 5.   

Unfortunately, that rollback was not correctly performed.   

We make use AWS (Amazon Web Services) EC2 (Elastic Compute Cloud) for all of our compute resources on Files.com.  Both the Intel and ARM servers being discussed run inside AWS EC2.  

The EC2 networking backplane suffers from a long-standing bug that we have long been aware of where migrating an IP from one server to another can result in erroneous data reported by EC2 to our instances.  In short, if you live migrate an IP on EC2 from one server to another, EC2 can report to both servers that they still “own” the IP.    

Because of this bug, we have a complicated procedure for migrating IPs from one server to another.  This procedure is highly automated and provides that we always fully shut down servers after IPs are moved off of them.  This procedure works around the EC2 bug.   

When we performed the rollback from ARM to Intel servers on May 5, we failed to fully follow our procedure and fully shut down the ARM servers.  They were “disabled” using a softer disabling mechanism, but at some point they rebooted and once they rebooted, EC2 began to report conflicting information about which server “owned” the IPs related to this incident.    

In our architecture, servers report their internal and external IP list to our central routing system on a regular schedule.  As a result of the two sets of servers reporting conflicting information, our routing systems began to oscillate routing traffic between the Intel and ARM servers every few minutes, and only one set of servers would work at a given time.   

The root cause of this incident was our failure to follow our own procedure during the transition between ARM and Intel servers.  A major contributing factor was our failure to detect a situation where IP addresses appear to oscillate between multiple servers.  Another contributing factor is the AWS EC2 bug that results in incorrect IP address information being reported to instances.     

As a result of this incident, we have conducted remedial training with all of our Infrastructure team to re-train them on the procedure to migrate IPs from one server to another.  We have additionally added new protection to our routing system that will detect a situation where IP addresses oscillate between servers and raise an alarm when that happens in the future.    

Furthermore, we have improved our internal synthetic monitoring systems with the ability to detect the situation that occurred during this incident and treat it as a failure.    

On a more general note, we have added a considerable amount of sophistication to our monitoring and routing systems as a result of the several incidents that occurred in May, and we are adding more.  These improvements amount to over 5,000 lines of code and we are optimistic that they will reduce the frequency and impact of incidents in the future. 

We greatly appreciate your patience and understanding as we resolved these issues. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Jun 01, 2023 - 11:36 PDT

Resolved
All services have been restored and are operating normally.

Users connecting to accounts with a custom namespace, an ExaVault host key, a custom host key, or an enforced IP whitelist experienced authentication errors. Logins were impacted between 1:34 p.m. PST and 5:33 p.m. PST. Other users may have experienced elevated error rates as well.

We will follow up with an Incident Report within ten (10) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.
Posted May 08, 2023 - 17:37 PDT
Investigating
FTP/FTPS, SFTP, WebDAV only: We are investigating elevated error rates on these services on Files.com in all regions.

This incident does not impact other network services such as API, AS2, and others.

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.
Posted May 08, 2023 - 17:12 PDT
This incident affected: FTP/FTPS, SFTP, and WebDAV.