FTP Degraded – All Regions

Incident Report for Files.com

Postmortem

On November 19 from 8:49am to 10:30am PST we experienced an issue with our FTP services that resulted in elevated error rates and inability to connect entirely to FTP for some, but not all, customers. This issue did not affect any other network services at Files.com, such as SFTP, WebDAV, AS2, API, etc. It was specific to FTP.

This issue affected approximately 12.5% (1/8th) of all source IP addresses connecting to Files.com FTP.

To explain how this issue affected only 1/8th of source IP addresses, and to give a sense of the scale we operate at regarding FTP, some background on Files.com’s FTP services is helpful.

Files.com serves over 4,000 customers across a variety of file transfer protocols and paradigms from a globally distributed multi-tenant architecture.

Each file transfer service is implemented and deployed individually so that they can be updated, monitored, and restarted on an individual basis without affecting other network services. For example, our FTP service is not co-located in any way with our SFTP service.

For FTP specifically, we operate a total of 16 server machines in 7 global regions, and on each server machine we operate two different FTP daemons in a Blue/Green configuration. This allows us to deploy software changes to FTP without disrupting existing connections. And we do in fact regularly deploy software changes to FTP, most of which go totally unnoticed to our customers.

In front of those 32 FTP servers (16 machines x 2 instances per machine), we also operate 20 front-end proxy servers which serve as the termination point for the nearly 2,000 dedicated IP addresses that we host on behalf of our customers.

Background

This incident affected only certain combinations of connectivity between the front- end proxy servers and the back-end FTP servers, and that’s why it only affected 12.5% of source IP addresses.

To explain this more, first you need to understand a little bit about the FTP protocol.

The FTP protocol, originally developed in 1971, is an old and legacy protocol. It was designed way before the idea of load balancing even existed, and therefore its design is such that it is difficult to load balance in a reliable way.

This is one of the main reasons we generally recommend against the use of FTP entirely. Instead, we recommend that you prefer modern connectivity methods such as our CLI, SDKs, API, Files.com apps, and direct integrations with cloud providers such as Boomi, MuleSoft, Zapier, and more.

Nevertheless, we understand that a lot of customers have legacy business processes that are built on legacy technologies such as FTP, and therefore, we do our best to support it.

The specific reason that FTP is hard to load balance is that FTP does not use a single encrypted TCP connection between the client and server. Instead, FTP uniquely creates a control connection, which is one TCP connection between the client and server, and then it also creates individual data connections for each file being transferred.

Incidentally, this design also hurts the performance of FTP because it results in separate TLS and TCP connection initiations for every single file in a batch of transfers. More modern protocols like the ones I mentioned before are able to multiplex multiple transfers over the same TCP connection.

That FTP maintains separate data connections is what also creates a major challenge for load balancing. That's because all of those data connections need to end up at the same backend FTP server, which means there needs to be a load balancing strategy that ensures that these data connections all end up at the same backend server. Most load balancers don’t do this by default, because it isn’t what you’d usually want in a load balancing arrangement.

We implement FTP load balancing using a capability called balance_source in the popular HAProxy software. It computes a hash of your source IP address and uses that hash to target FTP requests to specific backend servers that are maintained in a list of backends.

We have a sophisticated health check apparatus that maintains that list of backend servers using health metrics about each backend server. This ensures that our load balancers generally only send traffic to FTP servers which are healthy.

Explanation of this Incident

In this incident, one eighth of our backend FTP servers were marked as healthy when, in fact, they were not healthy. This caused about 1/8 of FTP traffic to be directed to those servers. Unlike on other protocols where this would have applied to 1/8th of all requests from all customers, due to the load balancing design of FTP this ended up applying to all FTP connections from 1/8th of source IP addresses.

How did the processes mistakenly report as healthy when they were not? This was a software bug in our health check software, related to an invalid assumption about which FTP server was running in the “most recent” deployment position of the Blue/Green deployment.

When we wrote the software, we made the mistaken assumption in the code that in Linux, server processes with higher process ID numbers would always be newer than processes with lower process ID numbers. In fact, that's not true. A Linux system only has about 65,000 process ID numbers available, and in a long-running system such as our servers, which often stay up for months at a time, the process IDs can wrap back around.

So our health check software misinterpreted which of the two FTP services on each server was the newest, and this occurred on 4 out of our 32 FTP servers.

We corrected this problem in our health check software and we do not expect this issue to recur.

Monitoring and Incident Resolution Time

This incident occurred for one hour and 41 minutes.

The selective impact of this issue unfortunately also caused it to escape our monitoring systems.

Because our monitoring systems experienced 100% success during this time, we did not detect that one out of eight source IP addresses to FTP were unsuccessful at connecting. It is not clear how we could have better detected this situation through monitoring.

We ultimately became aware of this issue through our regular customer support channel. Unfortunately, it took our customer support team over 30 minutes to confirm and escalate this incident after first being alerted of it, and this increased the time to resolution of the issue.

Similarity to the Incident from November 13

Although this incident seems similar to the incident on November 13, which also affected FTP, this incident has a different root cause. Please read that incident’s Root Cause Analysis document to learn more.

Stability and Uptime, Generally

Any time we have two separate incidents affecting the same service within a short period of time, we are often asked by customers whether there's some sort of systemic problem with our business or our infrastructure or our software or our processes that should be considered unstable.

One of the reasons that we write such long root cause analysis reports is to hopefully convince you that these two incidents were not caused by the same root cause. Additionally, we hope that these reports also help to convince you that neither root cause has at its root a culture or systemic problem at Files.com. When comparing Files.com to other systems, it’s important to compare apples to apples. Files.com is a continuously updated multi-tenant cloud service that does not ever schedule downtime. We are designed for 24/7/365 operation all while performing regular performance, feature, and security updates. It is not a fair comparison to compare the uptime of Files.com to the uptime of an on-premise system where updates are never installed.

We do our own independent monitoring of many of the cloud providers for file transfer and according to our statistics, Files.com exceeds these other providers in many of the metrics that we measure as it relates to uptime and speed and performance.

We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.

Posted Nov 25, 2024 - 10:32 PST

Resolved

We have resolved elevated error rates on the FTP service on Files.com in all regions. This incident did not impact other network services such as API, SFTP, WebDAV, AS2, and others.

This incident occurred between the times of 8:49am PST and 10:30am PST.

We are compiling a Root Cause Analysis for this incident, which we will post here.
Posted Nov 19, 2024 - 08:49 PST