Building a user-space, failover TCP implementation
Before I bore you with my life story (for the past week) and all the gory details, if you just want to check out the code, you can find it at https://github.com/akmistry/ftcp.
Warning, it’s horrible. You might want to gouge your eyes out. Don’t say I didn’t warn you.
But why?
The impetus for this little project came a few weeks ago. I was writing a new-ish server using an old protocol (and existing library) which assumes long-lived connections. The protocol is stateful, with no ability to resume a session after a broken TCP connection. However, since I was writing a new server, I want to run on modern infrastructure, including as a pod on Kubernetes. In this world, restarting the server is a common occurence, whether it be for upgrading the binary, migrating to a node, etc. This is at odds with a stateful protocol which could have sessions which need to be maintained for days/weeks.
To avoid disrupting the session, I would need to find a way to snapshot the session state and migrate it to another server. I could do this for the application protocol, albeit by re-writing it, but this doesn’t address the fact that the session is tied to a specific TCP connection.
TCP is already a fairly robust protocol, and improvements such as MPTCP help cope with modern networks and issues of roaming and IP address changes. But sessions are still tied to stateful endpoints. So I wondered, could a TCP implementation be written which could migrate TCP state from one machine to another, maintaining the session. A cursory web search suggests there is a fair amount of research on this. And I came across TCPCP which does this on Linux. But of course, re-inventing the wheel is much more fun, so I ignored all the existing research and gave myself a week to try out an idea.
Overview
The goal is fairly simple: allow a hot replica to take over a TCP connection if the primary has failed for some reason. In this model, primary failure is random and unpredictable, so the replica needs to be able to take over when it detects the failure (say, through a heartbeat mechanism).
Since there’s no expectation of graceful handover, the connection state needs to be replicated in real-time as packets are received, and the the application interacts with the TCP stack. This means replicating the contents of the send and receive buffers, as well as any additional state, such as the TCP state machine (RFC 9293 3.3.2).
As you can imagine, replicating data received and sent in real-time is very expensive, and would require at least the sum of the application’s combined inbound/outbound bandwidth in outbound replication bandwidth. In addition, the replication needs to be done synchronously and block either application interaction with the stack, or responses to the client. I’ll provide an example later.
Replicating state
The model for state replication is inspired by Conflict-free replicated data types (CRDTs). CRDTs are a class of data structures which can be modified by multiple nodes and replicated, such that all conflicts have a well-defined resolution and all replicas will eventually converge to the same result.
As an example, imagine a simple integer, X. If that integer is changed on two different nodes and that change is replicated, which version wins?
Node 1 Node 2
X = 7 X = 13
<----------->
X = ?
Here, we can define the resolution as “largest value wins”. Hence, after replication, both Node 1 and Node 2 will determine that X = 13. Despite being independently modified by two different nodes, they both converge (eventually) on the same value. This is the grow-only counter example on Wikipedia.
Buffers
It turns out, parts of the TCP state map very well to this notion of well-defined conflict resolution. Take the receive buffer as an example:
Node 1 buffer:
0.......8.......16
XXXXXXX [5, 11]
Node 2 buffer:
0.......8.......16
YYYYYYYY [7, 14]
Node 1 has data between the sequence numbers [5,11], and node 2 has [7, 14]. If this state is replicated, what is the merged state? Once data has been received for a specific sequence number, it is immutable and won’t ever change. Therefore, the common data on both nodes, [7, 11], will be identical. So the merged buffer is just the union of the data in the two nodes:
Merged:
0.......8.......16
XXXXXXXYYY [5, 14]
If you’re paying attention, you’ll notice I just said something completely wrong. In TCP, sequence numbers are 32-bit integers, and will wrap around after 4GiB of data is sent/received. So the idea that data at a specific sequence number is immutable, is wrong. ftcp resolves this by tracking 64-bit logical sequence numbers, and truncating to 32-bits when written to the wire. This will still wrap around, at 16EiB, but I can live with that.
For the received buffer, the merged state above isn’t what ftcp does. Tracking a received buffer in TCP needs to account for not just any new data received, but also what data has been read (and drained) by the application. Data which has been read can be discarded, hence the merged buffer in the example above is just node 2’s buffer, discarding bytes [5, 6].
A similar story can be made about the send buffer. The merged buffer is just the union, but if either node has discarded data at the start of the buffer, this implies that that data has been ACK’s by the remote end, and can be discarded.
State machine
The TCP state machine also partially maps to a CRDT. If we look at RFC 9293, we can see that once the connection is established, the state transitions form a directed acyclic graph (DAG). If states are represented as integers, with larger integers as later states in the DAG, the “grow-only counter” can be used to synchronise the state transitions. There is complexity before the connection is established, with the SYN-ACK three-way handshake. But that can be resolved by dropping the connection, since we’re mainly concerned with maintaining active connections.
Isn’t this complicated?
Since the TCP state is being replicated from an active primary to a inactive replica, why bother dealing with conflict-free replication? Why not just let data from the primary “win” any conflict? A couple of reasons:
- It is possible for replication state to become de-synchronised. A trivial example is when a replica fails and a new one needs to take it’s place. If the new replica is receiving active replication, as well as back-filling state, the back-fill and active replication state can conflict.
- Having run globally distributed production systems, one important lesson I’ve learned is that anything can fail in ways you can’t predict. We could try and enumerate all the failure cases and possible replication states. But that can be difficult. It’s just easier to assume the worst and handle it.
When to replicate?
Because a TCP connection maintains a stateful session with a remote end, it’s not just a question of what state to replicate, but when?
As an example, take the situation when a packet a received from the remote end. The TCP stack will append the new data to the end of the received buffer, and then send an ACK to signal that data has been received. Here, the new data must be replicated BEFORE the ACK is sent. To illustrate why, consider the following example:
Client Server Replica
Seq=7, len=2 ------------>
<--------------------Ack=9
<server dies>
-- <replication, if the server hadn't died>
Seq=9, len=2 ------------>
<replica takes over>
<-------------------------------------------Ack=7
Here, if the replication was asynchronous and happened after the server sent Ack=9, the client has probably seen the ACK and discarded any data before Seq=9. But if the replica hasn’t seen the (Seq=7, len=2) data, it will respond to the client with Ack=7. Since the client has already discarded the data, it can’t retransmit, and doesn’t expect to be able to since it has received an ACK.
There are several cases of this need for synchronous replication. And each time synchronous replication is necessary, it delays packets to the client, increasing latency. TCP hides some of the effects of latency through increased window sizes (namely, throughput), but latency-sensitive applications such as terminals or remote desktops would be affected.
The cost of replication
The need for synchronous replication has a significant latency cost. But having to replicate both the send and receive buffers has a significant bandwidth cost. In the ideal case:
outbound replication bytes ~= inbound connection bytes + outbound connection bytes
Inbound replication is relatively negligible, consisting of replication responses which don’t contain any buffer data (in the common case).
Something that should be noted is that this asymmetry in required bandwidth is at odds with common network technology (read: Ethernet) which offer symmetric bandwidth.
What next?
Now that I’ve written a prototype and sorta proved the idea, what next? Well… nothing. Having thought about it more, I’m not sure this is a problem worth solving. Any new-ish application protocol (written in the past 10ish years) should have been written to be able to withstand network/server disruption, and include retry/resume logic, or have stateless request. Examples of the latter would be most internet-exposed APIs, which are REST-based and naturally stateless.
Legacy protocols are the only real use case I can think of, and I’m not sure it’s worth investing significant engineering effort to give those higher reliability beyond what is currently offered.
But if you want to explore this area, I’ve open-sourced my week-long project under a permissive 3-clause BSD license. Have a browse. Feedback can be posted on my LinkedIn post. If you share this elsewhere, let me know.