SFTP Service in the USA Region Service Outage
Incident Report for Files.com
Postmortem

On December 5th, 2023, at 4:06 AM PST, Files.com correlated multiple customer tickets indicating ‘authentication errors when logging into SFTP’, which resulted in an incident being declared. The Incident Management Team (IMT) convened and immediately began investigation.

The ‘authentication errors when logging into SFTP’ issue was resolved on December 5th, 2023 at 6:29 AM PST, returning the platform to full functionality.

In this incident, our SFTP servers became unstable and failed to process requests for certain customers due to a bad configuration file that was applied on 12-04-2023 at 10:41 PM to our SFTP servers via our automated configuration management system.

This failure only affected a small number of customers. Specifically, it only affected customers where our API was required to authenticate the provenance of the origin IP of the connecting SFTP user. This includes customers who use IP Whitelisting or IP Geolocation (such as country whitelist/blacklisting). We use a sophisticated system to cryptographically authenticate the origin IP of the connectiing SFTP user when making upstream calls to our internal API, and it was a configuration related to this system that was inadvertently misapplied.

The reason for the bad configuration file being deployed is as follows:

A separate configuration change was correctly and successfully made to another system (our HTTP servers) via our configuration management systems. Due to a logic error in the code of the change, the change also inadvertently targeted our SFTP systems as well. This change should not have been deployed to our SFTP systems, but was inadvertently deployed to them anyway.

Internally, Files.com runs SFTP services on several dedicated servers in each service region. Our configuration management system deploys changes to servers one at a time, checking to ensure correct operation prior to continuing forward with the rollout of configuration changes.

The contents of this document are for general release and classified PUBLIC

Unfortunately, while this check did validate proper operation of SFTP in general, it did not specifically validate proper operation of the subsystem that provides for cryptographic authentication of IP addresses.

Upon discovery of the incident, Files.com reverted the inappropriate configuration change on the SFTP servers.

The root cause of this incident is twofold.

Firstly, Files.com failed to automatically monitor and validate the correct operation of the subsystem that provides for cryptographic authentication of IP addresses on SFTP servers. While a downtime of this system doesn’t cause a full downtime of SFTP, it causes a functional equivalent of that if customers require IP Whitelisting or IP Geolocation.

Secondly, Files.com failed to provide feedback to the engineers who developed and deployed the original configuration change targeted at the HTTP servers to let them know that the change would also be applied to SFTP servers.

Files.com will be developing two major improvements to its processes as a result of this incident. First, Files.com will implement additional detection and monitoring around the subsystem that provides for cryptographic authentication of IP addresses on SFTP servers. Second, Files.com will develop a system to provide feedback to its infrastructure engineers about exactly which servers will be affected by a configuration change before that change will be approved.

Both of these improvements will require substantial engineering work and are not completed yet. We look forward to completing them in the coming quarter. We are hugely disappointed by the downtime, and we will work hard to implement the additional layers of protection needed to avoid similar incidents in the future.

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Dec 21, 2023 - 16:18 PST

Resolved
All services have been restored and are operating normally.

We have resolved a major outage of the SFTP service on Files.com in all regions. This incident did not impact other network services such as API, FTP, WebDAV, AS2, and others. The SFTP service was partially down from 10:40 p.m. PST 12/04/23 to 4:20 a.m. PST on 12/05/23. A more extensive SFTP interruption occurred from 4:20 a.m. to 6:29 a.m. for a total of 129 minutes impacting some, but not all, customers. Customers with certain region, IP, custom namespace, or other requirements were most likely to be impacted.

We will follow up with an Incident Report within ten (10) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.
Posted Dec 05, 2023 - 06:41 PST
Identified
We are continuing to investigate this issue.

We identified a configuration error. We have made a change that we believe has solved this configuration error. SFTP issues may be resolved for some connections.

We will post an update as soon as the issue has been identified and a fix is being implemented. If you need additional assistance, please do not hesitate to contact our Customer Support team by email. Thank you for your continued patience.
Posted Dec 05, 2023 - 06:23 PST
Investigating
SFTP only: We are investigating a major outage of the SFTP service on Files.com in all regions.

This incident does not impact other network services such as API, FTP, WebDAV, AS2, and others.

If you have an urgent need to access Files.com, we recommend using FTP in lieu of SFTP.

We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.
Posted Dec 05, 2023 - 04:51 PST
This incident affected: SFTP.