Connection failures over SFTP, FTP, and WebDAV for recently logged in users attempting new connections
Incident Report for Files.com
Postmortem

On August 2nd, 2024, at 7:35 AM PST, Files.com correlated multiple customer tickets indicating ‘Connection failures over SFTP, FTP, and WebDAV for recently logged in users attempting new connections’, which resulted in an incident being declared.  The Incident Management Team (IMT) convened and immediately began investigation.

The ‘Connection failures over SFTP, FTP, and WebDAV for recently logged in users attempting new connections’ issue was resolved on August 2nd, 2024, at 7:41 AM PST, returning the platform to full functionality.

At 6:52 AM PST on August 2, Files.com made a routine code deployment which introduced a bug that prevented more than one session from being opened via the SFTP, FTP, or WebDAV protocols. Files.com reverted this deployment at 7:41 AM PST, restoring proper functionality.  This resulted in 49 minutes of degraded performance for many customer use cases.

It is common for many automated and ad-hoc processes to use several connections at once when communicating via SFTP or FTP.  During the degraded period, only one of those connections was likely to work.  Depending on the exact software in use, this might have resulted in failures of your process to run, or it might have worked with only a single connection.

In either case, the situation was clearly unacceptable because it likely broke a number of critical customer workflows.

The fix to the bug was simple and involved a one line change.  The bug was not caught originally because we did not consider testing multiple simultaneous connections in our testing environment.

While we are disappointed by the original bug making it past our testing pipeline, our true disappointment relates to our systems that monitor and alert on the status of our production environment. If our monitoring had operated perfectly, we would have solved the original bug in 2 minutes, not 49 minutes.

This incident revealed an interesting set of weaknesses in our monitoring systems.  First, our automated testing platform which tests our production environment did not attempt multiple simultaneous connections when testing SFTP.

In this incident, the issue/downtime only occurred when attempting multiple simultaneous connections. We will update our automated testing platform to attempt multiple simultaneous connections in the future.

Additionally, when responding to this incident we discovered that the original bug occurred in a section of server-side code which was excluded from reporting to Sentry, a platform we use for exception tracking and real time alerting.  This exclusion was in error.  As a result, our on-call team was not immediately paged like we should have been.

We have updated our code to ensure that future bugs in this part of the code result in immediate reporting to Sentry, which would result in immediate notification to our on-call team in a future similar incident.

To cover the possibility of Sentry alerts failing to fire in the future, we have added additional belt-and-suspenders alerting to look for spikes in 5xx HTTP error codes from our web proxy layer which don’t have a corresponding alert in Sentry.  This provides a backup mechanism to ensure that our on-call team will be paged in the future in a situation like this one.

The root cause of this issue was Files.com’s failure to have a robust, multi-layered monitoring system to detect production failures and alert our on-call team.  We have already implemented multiple mitigations at different layers to reduce the odds of a similar issue occurring in the future.

We promise a system that works perfectly, all of the time, and today we failed to deliver that to you.  Our entire engineering team is working hard to prevent issues like this one from occurring in the future. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Aug 09, 2024 - 10:22 PDT

Resolved
We have resolved an issue causing connection failures over SFTP, FTP, and WebDAV for a small segment of users who had very recently logged in successfully using those protocols.

This incident occurred between the times of 6:52 AM PT to 7:41 AM PT on August 2nd, 2024, and was resolved by reverting a recent code deploy.

During this time window, only users who tried to log in to SFTP, FTP, and WebDAV with the same credentials multiple times would have experienced login failures when trying to open new connections.

We are still compiling a final Root Cause Analysis for this incident, which we will post here when it is complete.
Posted Aug 02, 2024 - 08:00 PDT