User space TCP?

We develop a fast and scalable HTTPS proxy server Tempesta FW. Tempesta FW works in the Linux kernel, as part of the TCP/IP stack, to achieve the highest performance and lowest response latency. The development started in 2014, when kernel bypass technologies for fast networking, such as DPDK and Netmap, were gaining their popularity. We still believe that the decision was right for the reverse proxy and it’s quite unlikely that we make a technological pivot to a kernel bypass. Let’s see what specific about reverse proxies and why do they benefit from being in-kernel.

Recently there was a conversation whether it makes sense to port Tempesta FW to F-Stack, a port of FreeBSD networking stack to DPDK, or a similar technology, e.g. mTCP. So we spent quite a time for an investigation of F-Stack project and this article will reference the project as an example of a user-space network stack.

The simple packet case

CloudFlare made a good example for kernel bypass applicability in 2015. The article discusses the high speed firewall rules filtering out UDP packets on a particular port. That was much earlier before XDP, so the guys compared iptables performance versus DPDK and other kernel bypass approaches. Later, in 2017, they showed quite the similar scenario, but using XDP. XDP (the Linux eXpress Data Path) works in a network adapter’s driver hook, just after the interrupt processing, even before a packet descriptor sk_buff is allocated. The XDP programs are quite limited in their structure and size, but simple tasks like packet filtering or forwarding can be done in a very efficient way.

In fact, simple traffic processing, like filtration by the TCP/UDP prots or IP source/destination addresses, can be easily and efficiently done with either DPDK or XDP.

TCP implementations maturity

Ok, we’re good with packet headers processing and doing simple logic with the packet. But what if we need the real TCP, for example, to proxy HTTPS traffic? We can parse TCP segment header, but we also need to handle TCP streams, including out of order segments, doubled segments, overlapped segments and many other corner cases. Moreover, if we need a proxy, then we need to keep TCP control block for both the connection peers (typically, for a client and server connections) to perform flow and congestion control. The robust TCP/IP stack is quite a huge task, for example let’s see at the Linux TCP/IP stack:

        [linux-linus]$ find net/ipv4 net/ipv6 -name \*.[ch] |xargs wc -l|tail -1
        172221 total

More than 170,000 lines of C code and that’s not the whole code. You might reply that the Linux TCP/IP stack is a known hog and this is why the people are moving to user-space TCP/IP stacks. Having the code small is good, but with the TCP implementations we usually go to not only tiny, but also immature, code.

We can use a simple test to check a TCP/IP stack for maturity -just check it for TCP delayed acknowledgment – quite a reasonable TCP feature to improve network performance. For example, for F-stack:

        [f-stack]$ grep -ri 'delayed.ack' *|wc -l
        74

The Linux kernel:

        [linux-linus]$ grep -ri 'delayed.ack' ./net/ipv4/|wc -l
        51

lwIP:

        [lwip]$ grep -ri 'delayed.ack' *|wc -l
        13

FD.io VPP:

        [vpp/src]$ grep -ri 'delayed.ack' *
        vnet/tcp/tcp_types.h:#define TCP_ALWAYS_ACK        1    /**< On/off delayed acks */
        [vpp/src]$ grep -r TCP_ALWAYS_ACK *
        vnet/tcp/tcp_types.h:#define TCP_ALWAYS_ACK        1    /**< On/off delayed acks */
        0

i.e. the constant seems isn’t used around the code. The size of TCP code in VPP is quite modest in comparison with the Linux TCP implementation – just about 14 thousand lines of C code. However, more importantly is that it doesn’t provide any better performance in comparison with the Linux TCP -SPDK integrated VPP for NVMe over TCP transport in 18.01, but since 20.07 it’s deprecated. And the reason for the deprecation is that Linux io_uring is more efficient. Now let’s check mTCP:

        [mtcp]$ grep -ri 'delayed.ack' *|wc -l
        0

Seastar:

        [seastar]$ grep -ri 'delayed.ack' *|wc -l
        0
        $ find . -name \*tcp\* |xargs wc -l 
         169 ./src/net/tcp.cc
          50 ./include/seastar/net/tcp-stack.hh
        2135 ./include/seastar/net/tcp.hh
          75 ./demos/tcp_demo.cc
         205 ./demos/tcp_sctp_server_demo.cc
         279 ./demos/tcp_sctp_client_demo.cc
        2913 total

OK… F-Stack uses the FreeBSD TCP/IP stack. LwIP is an old and well-developed TCP/IP stack. But Seastar is the new one and the whole TCP code, including demos, is less than 3,000 lines of code. Let’s see for TODO and FIXME comments in the source code of tcp.hh:

        void do_time_wait() {
            // FIXME: Implement TIME_WAIT state timer
        ...
        // 3.4 fourth check the SYN bit
        if (th->f_syn) {
            ...
            if (th->f_ack) {
                // // TODO: clean retransmission queue
        ...
        // FIN_WAIT_2 STATE
        if (in_state(FIN_WAIT_2)) {
            // In addition to the processing for the ESTABLISHED state, if
            // the retransmission queue is empty, the user’s CLOSE can be
            // acknowledged ("ok") but do not delete the TCB.
            // TODO
        ...
        // TIME_WAIT STATE
        if (in_state(TIME_WAIT)) {
            // The only thing that can arrive in this state is a
            // retransmission of the remote FIN. Acknowledge it, and restart
            // the 2 MSL timeout.
            // TODO
        ...
        // 4.6 sixth, check the URG bit
        if (th->f_urg) {
            // TODO
        }

It seems a lot of TCP functionality isn’t implemented yet.

That’s the reason why F-Stack also started from their own TCP/IP stack, but moved to FreeBSD’s one: “At the beginning of this work, F-Stack used a simple TCP/IP stack that developed by ourselves. However, with the growth of various services, this stack couldn’t meet the needs of these services while continue to develop and maintain a complete network stack will cost high. So the FreeBSD network stack was ported into F-Stack. The FreeBSD network stack provides complete features and can follow up the improvement from the community.”

Scaling and performance

While a normal operating system network stack being ported to user space might show very scalable benchmarks (unfortunately F-stack team didn’t precise details of the benchmark), it can deliver even worse performance than a kernel TCP/IP stack on small number of connections.

Normal Socket API implies data copying between user and kernel spaces. Being porter from the kernel to the user space as is, a network stack still struggles from memory copies. As discussed in the referenced thread on the F-Stack bug tracker, the kernel bypass project is mostly about scaling on CPU cores rather than pure performance.

During our performance test of the kernel TCP/IP stack scalability in a virtual environment, we faced the known issue with the Linux connection hash table. The issue was also reported and well described by CloudFlare. The core of the problem is an old-fashioned hash table with collision chains on linked lists. The hash table is protected by a spin lock. Once you have too many TCP connections (in TIME-WAIT state in this case), all the CPUs stuck on locking the hash table. Even 4 CPUs spend more than 70% of time on the lock’s contention:

        36.28%  [kernel]            [k] __inet_check_established
        20.68%  [kernel]            [k] _raw_spin_lock_bh
        14.76%  [kernel]            [k] _raw_spin_lock
        11.17%  [kernel]            [k] native_queued_spin_lock_slowpath
         9.23%  [kernel]            [k] __inet_hash_connect
         3.14%  [kernel]            [k] inet_ehashfn

There are modern research in highly concurrent hash tables (see for example our recent study on the similar problem with 64 cores high contention on a hash table in MariaDB). We were wondering whether F-stack did something different to get a more concurrent code of the connections hash table. The hash table is struct inpcbinfo declared in freebsd/netinet/in_pcb.h and scanned, for example, by in_pcblookup_mbuf() call from tcp_input() function. We see quite the similar read lock as for Linux in the hash lookup function:

        static struct inpcb *
        in_pcblookup_hash(...)
        {
            struct inpcb *inp;

            INP_HASH_RLOCK(pcbinfo);
            inp = in_pcblookup_hash_locked(...);
            ...

The Socket API and, most importantly, the internal synchronization mechanisms, must be reworked in a TCP/IP stack to deliver significantly better scalability and performance in a multi-core environment. Just using kernel bypass technology, like DPDK or Netmap, doesn’t fix the concurrency issues in an existing TCP/IP stack.

The good thing about VPP’s TCP implementation is that it manages TCP connections in per-thread (per-CPU) hashes: as network adapter chooses the queue for a particular TCP flow, the CPU, bound with the queue, is used for all operations with the appropriate TCP socket (see how the TCP session is retrieved using thread_index):

        always_inline tcp_connection_t *
        tcp_connection_get (u32 conn_index, u32 thread_index)
        {
            tcp_worker_ctx_t *wrk = tcp_get_worker (thread_index);
            if (PREDICT_FALSE (pool_is_free_index (wrk->connections, conn_index)))
                return 0;
            return pool_elt_at_index (wrk->connections, conn_index);
        }

        static transport_connection_t *
        tcp_session_get_transport (u32 conn_index, u32 thread_index)
        {
            tcp_connection_t *tc = tcp_connection_get (conn_index, thread_index);
            if (PREDICT_FALSE (!tc))
                return 0;
            return &tc->connection;
        }

        const static transport_proto_vft_t tcp_proto = {
            // .....
            .get_connection = tcp_session_get_transport,
            // .....
        }

        static inline transport_connection_t *
        transport_get_connection (transport_proto_t tp, u32 conn_index,
                                  u8 thread_index)
        {
            return tp_vfts[tp].get_connection (conn_index, thread_index);
        } 

        transport_connection_t *
        session_lookup_connection_wt4 (u32 fib_index, ip4_address_t * lcl,
                                       ip4_address_t * rmt, u16 lcl_port,
                                       u16 rmt_port, u8 proto, u32 thread_index,
                                       u8 * result)
        {
            // ......
            if (PREDICT_FALSE ((u32) (kv4.value >> 32) != thread_index))
            {
                *result = SESSION_LOOKUP_RESULT_WRONG_THREAD;
                return 0;
            }
            s = session_get (kv4.value & 0xFFFFFFFFULL, thread_index);
            return transport_get_connection (proto, s->connection_index,
                                             thread_index);
            // ....
        }

The HTTP layer bottleneck

Some time ago we made comparison of in-kernel Tempesta FW with DPDK-based HTTP server Seastar (see our Netdev talk). Basically, Tempesta FW provides the similar speed and there are 2 reasons for this:

both the servers work on the network application layer with a heavy-weight and complex HTTP processing logic, so the bottleneck isn’t in the network, TCP/IP, layer if the key socket API overheads are removed;
Tempesta FW doesn’t use the high-level socket API, based on file descriptors, so many locks and queues were removed (the usual reason why Nginx and other fast user space servers stop scaling on multi-core systems).

The same problem might be observed for Redis on top of F-Stack: fast network layer doesn’t impact to much to the final application performance.

While CloudFlare heavily adopts XDP and kernel bypass technologies, they still use the Linux TCP/IP stack for HTTP layer processing: the real bottleneck is in the application logic and there is no sense to move out from a mature TCP/IP stack which provide many debugging and traffic management tools (e.g. tcpdump, tc, nftables, ipvs, eBPF and many others).

The reverse proxy example adds a lot of data structures, which must be shared and updated by all the CPUs: web cache, various statistics, connections tracking, HTTP message queues, and many many others. You can use very efficient lock-free algorithms to access and update the data structures, but there is still no magic and there are still very expensive cache coherency requests for ownership messages between the CPUs.

Mainstream performance extensions

With Tempesta FW we keep the Linux kernel patch as small as possible to be able to migrate to newer kernels easily. Following the mainstream code in the user space seems takes quite a time for F-Stack team.

Since F-Stack seems doesn’t do much work in performance extensions of the FreeBSD TCP/IP stack, the next question is whether the FreeBSD TCP/IP stack is actually faster than the Linux’s one? It seems not. Quick googling on the topic might show another performance comparison of Linux and FreeBSD. (To not to upset FreeBSD fans, FreeBSD still can deliver very serious network throughput, see for example the Netflix presentation with an amazing performance numbers for FreeBSD networking.)

Back in 2009-2010 we did some work in FreeBSD performance improvements for web hosting needs. In the most cases we just re-implemented some mechanisms from the Linux kernel. We also considered FreeBSD as the platform for Tempesta FW (mostly because of the license), but ended up with Linux solely due to the performance reason.

Kernel bypass

Speaking strictly from the performance point of view, the kernel bypass approach on its own doesn’t introduce any additional features in comparison with the kernel space. From the other hand, the kernel space provides some benefits to reach higher performance (most of the points, though, are mostly about convenience to develop high-performance network software, rather than an absolute advantage):

Interruptions, either hardware from a NIC of disk and inter-process (aka IPI). The interrupts allow you to avoid constant 100% CPU burning and reduce power consumption. Google Snap solves this problem, but still with the help on the kernel side.
In the kernel space you can easily access the file system and any other system resources without any copies and context switches. At the moment you can access the file system with io_uring in zero-copy asynchronous way. However, we didn’t see yet any real life reverse proxy using the IO interface.
As for DPDK and similar technologies, in the modern Linux kernels with separate scheduling domains you can move out the main network packet crunching work (done by ksoftirqd threads) to the designated CPUs. This not only eliminates performance painful context switches, but more importantly for multi-core systems, allows you to write very efficient lock-free data structures. However, the kernel also provides you the preemption control, which makes the resulting system equally fast, but more flexible.
The kernel provides direct access to the same page table data structures, which are used by the CPU memory management unit. This makes memory management more convenient and the system still can arrange unused memory for other tasks. For example, Tempesta FW targets not only big high-end servers, but also small virtual machines in a cloud, which can run Tempesta FW for the content acceleration and protection along with the backend web application.

Similar to the DPDK approach, the Linux kernel provides per-CPU threads for processing the main TCP/IP logic – ksoftirqd kernel threads, which you can observe in ps output. The most of the code does its best not to work with shared data and avoid contention as hard as possible. However, the legacy, not scalable (as we saw on the example of the TCP connection hash table) code is the back side of the maturity of the operating systems TCP/IP stacks (either Linux or FreeBSD). There are big pieces of code which access remote CPU data introducing high contention in multi-core systems.

The future

TCP is our today for web applications, but the tomorrow is QUIC. There are many QUIC stacks in the user space and it seems only one in-kernel implementation from Microsoft (open source, by the way!). The list of kernel performance benefits at the above doesn’t look so dramatic, so it does make sense to consider the kernel bypass approach for the new network protocols like QUIC. Form the other hand, being implemented in-kernel from scratch QUIC also won’t have scalability issues inherited from the old times like TCP does.