Fast programming languages: C, C++, Rust, and Assembly

This article doesn’t answer which programming language is better. Instead, it discusses the most powerful tool set for development of the fastest server-side system software, such as database engines and HTTPS servers. There are several specific properties of this type of software:

Relatively large code base, 100,000 lines of C or C++ code and more. While it’s possible to write particular, the most ‘hot’ functions, in Assembly language, it’s impractical to write the whole program in Assembly.
Databases and web servers are mission-critical software – we all got used that our Linux systems with MySQL and Nginx processes work for months and years. There are simple high availability best practices mitigating the downtime due to possible crashes, but they’re the subject for another article. Meantime, it’s worth mentioning that if you really-really care about high availability, then you should build you infrastructure with an assumption that any component of your system may crash at any time, just like Facebook does this -the company deploys the recent versions of the Linux kernel as soon as they’re available.

We’ve been developing the fastest software in C, C++, and Assembly for ages. It’s not a surprise that since Rust is “focused on performance” we’re very interested in it. With a bit of skepticism though. Just remember the rise of Java programming language: there were a lot of reports that the JIT compilation produced code faster than C++. Now it’s hard to find a case, when C++ is slower than Java, see for example the benchmark game. It’s also worth mentioning that the memory garbage collection (GC) in Java leads to high tail latencies and it’s hard or even impossible to do anything with the problem. Golang can not be considered for high performance programming also due to the GC.

C or C++? Or both of them?

The C programming language dominates in the system programming. An operating system kernel is an example of one of the most sophisticated system software, not only because it deals with hardware directly, but also due to strict performance requirements. The Linux and FreeBSD kernels are written in C, as well as the other UNIX’es and Windows kernels. Let’s start the discussion from this bright example of high-performance system software.

C++ for operating systems kernel development

FreeBSD has been supporting C++ modules for a while. While the Linux kernel never supported C++, there was the Click modular router written in C++ and working as a Linux kernel module. If you’re interested in C++ applicability for the operating systems kernel development, then you can find quite good discussions in the C++ and Bare bones articles. However, there are fundamental reasons against using C++ for operating system kernel development:

You do not have libstdc++ with RTTI and exceptions in the kernel space. Actually, dynamic_cast isn’t so frequently used and there are a lot of C++ projects compiled without RTTI. If you need exceptions, then you have to port them into the kernel. libstdc++ uses basic C allocations, so it must be significantly reworked for the kernel.
You can’t use the STL and Boost libraries and, in fact, all kernels already have their own libraries. C++ introduces filesystem, threading and networking libraries, which are senseless in an OS kernel. From the other hand, the modern OSes provide advanced synchronization primitives, which are still not available in standard C++ (e.g. there is still no read-write spinlocks in C++).
The Linux kernel provides number of memory allocators (SLAB, page, vmalloc(), kmalloc(), and so on), thus you have to use placement new and/or just use the C functions for memory allocation and freeing. Aligned memory is crucial for the performance, but you need to write special wrappers to get aligned memory with new.
Strong type safety isn’t so comfortable for system programming when raw memory pointers are frequently casted to some data structures. This is debatable though: while some people are uncomfortable with frequent reinterpret_cast<Foo *>(ptr) instead of short (Foo *)ptr, the others are good with more typing and more type safety.
C++ name mangling, required for namespaces and function overloading, makes function hard to call from Assembly, so you need to use extern "C".
You have to make special code sections for static objects constructors and destructors, .ctor and .dtor correspondingly.
C++ exceptions can not cross context boundaries, i.e. you can not throw an exception in one thread and catch it in another. The operating system kernel deals with much more complex context model: there are kernel threads, user space processes entering into the kernel, deferred and hardware interruptions. The contexts can preempt each other in voluntarily or cooperative manner, so exception handling of current context could be preempted by another context. There are also memory management and contexts switching code which could conflict with exception handling code. Just like for RTTI, it’s possible to implement the mechanism in kernel, but the current standard library can not be used.
While Clang and G++ support __restrict__ extension, the official C++ standard does not support it.
Variable length arrays (VLA) are discouraged in the Linux kernel, they are still handy in some scenarios, but are completely unavailable in C++.

Thus, with C++ in the kernel space you basically have only templates, classes inheritance and some syntax sugar like lambda functions. Since system code is quite rarely requires complicated abstractions and inheritances, then does it still have sense to make effort to use C++ in the kernel space?

C++ exceptions

This is one of the most debatable C++ feature and it deserves a separate chapter. For example, the MySQL project, following to the Google coding style, doesn’t use exceptions. The Google coding style provides the good lists of pros and cons of using exceptions. Here we focus on performance aspects only.

Exceptions can improve performance when we have to handle error codes in too may places, e.g. (let the functions be inlined and very small)

        if (func_1())
            return -EINVAL;
        if (func_2())
            return -EINVAL;
        ....
        if (func_100())
            return -EINVAL;

The problem with the code is that there are extra conditional jumps. Modern CPU are pretty good with branch prediction, but it still hurts performance. In C++ we can just write

        try {
            func_1();
            func_2();
            ...
            func_100();
        } catch (...) {
            return -EINVAL;
        }

, so there are no extra conditions in the hot path. However, this isn’t for free: most of the functions in your C++ code have to have extra epilogues with a table of exceptions, which these functions can catch, and an appropriate cleanup table. The function epilogues aren’t executed in normal workflow, but they increase the size of code causing extra pollution in the CPU instruction cache. You can find great details about C++ exception handling internals in the Nico Brailovsky’s blog.

Is C++ still good?

Yes, it is. Firstly, not the whole code actually must be as fast as possible and in most of the places we don’t need custom memory allocation and don’t care about exceptions overhead. The most of the projects are developed in the user space and benefit, especially the new ones, from relatively rich C++ standard and Boost libraries (not so rich as Java’s though).

Secondly, the killing feature of C++ is that it is C. If you don’t want to use exceptions or RTTI, then you can just switch the features off. Most of C programs can be just compiled with a C++ compiler with very small changes or without any changes at all. As an example, we need only this trivial change

    $ diff -up nbody.c nbody-new.cc
        @@ -112,9 +112,9 @@ static void advance(body bodies[]){
             // ROUNDED_INTERACTIONS_COUNT elements to simplify one of the following
             // loops and to also keep the second and third arrays in position_Deltas
             // aligned properly.
        -    static alignas(__m128d) double
        -      position_Deltas[3][ROUNDED_INTERACTIONS_COUNT],
        -      magnitudes[ROUNDED_INTERACTIONS_COUNT];
        +    static double
        +      position_Deltas[3][ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16))),
        +      magnitudes[ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16)));

             // Calculate the position_Deltas between the bodies for each interaction.
             for(intnative_t i=0, k=0; i < BODIES_COUNT-1; ++i)

to compile the C program with G++ compiler. The modern C++ compilers provide C compatibility extensions like the __restrict__ keyword. You always can write the most performance critical code of a C++ program in C style. If you don’t like the STL containers with an overhead, then you can use Boost.intrusive or even port a similar container from the Linux kernel or other fast C project -in most of the cases this won’t be painful. See for example how a hash table from PostgreSQL, HTrie from Tempesta DB, and the Linux kernel read/write spinlocks (all are written in C) were used in a C++ benchmark.

The last thing which must be mentioned about development of high performance programs in C++ is template metaprogramming. It’s very exciting about the modern C++ standards that using templates you can write quite sophisticated logic which is fully computed in the compile time and costs nothing in the run time.

GOTO – the power of C

A professional tool must allow you to work with it in the most efficient way. The goal of the high-level and high-performance programming languages is to generate the most efficient machine code. Each hardware architecture supports jumps, which means that you can jump to any address by any condition. The closest abstraction for the jumps in the C and C++ programming languages is goto operator. It’s not so flexible as assembly jmp, but C compilers provide extensions which make the operator almost the full equivalent of the assembly jumps. Unfortunately, Rust doesn’t support goto, which makes it awkward in the whole class of performance crucial tasks.

We talk about parsers. Not configuration file parsers, which are perfectly done with bunch of switch and if statements, but about large and very fast parsers like an HTTP parser. You might think that this is “too narrow” or “too specific” task, but recall the parser generators, like Ragel or GNU Bison – if you develop such a parser generator, then you never know how big parsers will be generated. (By the way, Ragel extensively uses goto to generate very fast parsers.) Note also SQL parsers in every RDBMS. Actually, we can generalize the class of the tasks as large and fast finite state machines, which also includes, for example, regular expressions.

The HTTP parser in Tempesta FW is much larger than HTTP parsers in other web servers because, in addition to the basic HTTP parsing, it also does many security checks and strictly validates the input against the RFCs. Also our parser works with zero-copy data, so it also much care about data chunks. The technical details of the parser were described in our talk at the SCALE 17x conference and you can watch the talk video or the slides.

Typically, HTTP parsers are implemented as a loop over input characters and nested switch statements for allowed characters and available states. See for example ngx_http_parse_request_line() in the Nginx parser source code. In sake of brevity let’s consider a simplified version of code:

        while (++str_ptr) {
            switch (current_state) {
            case 0:
                ...
            case 1:
                ...
            ...
            case 100:
                switch (*str_ptr) {
                case 'a':
                    ...
                    current_state = 200;
                    break;
                case 'b':
                    ...
                    current_state = 101;
                    break;
                }
                break;
            case 101:
                ...
            ...
            }
        }

Assume that the parser has finished parsing of the previous chunk of data at state 100 and the current chunk of data starts from character 'b'. Regardless the switch statement optimization (it can be optimized by a compiler with a lookup table or binary search), there are 3 problems with the code:

Looking up the state 100 is still more expensive than a direct jump.
While the code for the state 101 is placed right after the code for state 100, we have to re-enter the while and switch statements, i.e. lookup the next state again, instead of just move one character further and just jump right to the next state.
Even if we always reach state 101 after the state 100, a compiler may reorganize the code in such a way that the state 101 is placed at the beginning of the switch statement while the state 100 is placed somewhere at the end.

Tempesta FW fixes all the problems using the goto operator and the GCC compiler extensions for labels as values and label attributes with the code like:

        // Use labels as values to remember the current state when we
        // exit the state machine at the end of current data chunk.
        parser->state = &&state_100;
        goto *parser->state;

        while (true) {
        state_0:
            ...
        state_1:
            ...
        // The state is placed by a compiler closer to the beginning
        // of the code.
        state_100: __attribute__((hot))
            // We still use small switches for small character sets. 
            switch (*str_ptr) {
            case 'a':
                ...
                ++str_ptr;
                goto state_200;
            case 'b':
                ...
                ++str_ptr;
                // Just fall through to the state 101.
            }
        // This state is placed by the compiler after state_100.
        state_101: __attribute__((cold))
            ...
        }

Since Rust doesn’t support the goto operator, we would need to use assembly language to implement the state machine with direct jumps and the optimal code layout.

When Assembly is easier than C

Now let’s have a look onto an example when the Assembly language not only generates faster code, but also allows to write programs in more productive way. This example is about multi-precision integer arithmetic.

Public key cryptography and elliptic curves in particular operates with big integers. The book BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic by Tom St Denis provides great details about the subject as well as C implementations for many algorithms, but for now let’s consider the basic addition of two big integers of 128 bits length on a 64-bit machine. The big integers consist of limbs, two 64-bit longs. To sum the integers we have to care about carry between the limbs, so resulting C code looks like (see 4.2.1 in the book):

        // a := a + b
        // x[0] is the less significant limb,
        // x[1] is the most significant limb.
        void s_mp_add(unsigned long *a, unsigned long *b)
        {
            unsigned long carry;

            a[0] += b[0];
            carry = (a[0] < b[0]);

            a[1] += b[1] + carry;
        }

The code is small and simple, but you probably had to think a little bit about correctness of the manipulations with carry. Hopefully, x86-64 is a CISC architecture, i.e. it provides us a lot of computational features and one of them is computations with carry, so the code above can be done in two instructions only and there is no need for the comparison:

        // Pointer to a is in %RDI, pointer to b is in %RSI
        movq    (%rdi), %r8
        movq    8(%rdi), %r9

        addq    (%rsi), %r8     // add with carry
        addc    8(%rsi), %r9    // use the carry in the next addition

        movq    (%r8), (%rdi)
        movq    (%r9), 8(%rdi)

If you look into any well-optimized cryptography library, like OpenSSL or Tempesta TLS, then you find a lot of assembly code (OpenSSL actually generates the assembly source code with Perl scripts).

Rust at a glance

At a first glance Rust is pretty well equipped to develop very efficient code: SIMD intrinsics, memory alignment, memory barriers, inline assembly. There are many comparisons of Rust with C or C++, e.g. Speed of Rust vs C or C++ Is Faster and Safer Than Rust: Benchmarked by Yandex. But, if you consider to use Rust to develop a benchmarks leading product, then you’ll probably face several obstacles plus to the absence of the goto operator:

Technically, Rust supports custom memory allocators, but there are serious limitations. It’s worth mentioning that any high-performance software uses number of ad-hoc memory allocators.
Just like C++, Rust doesn’t provide VLAs. But if C++ still can use alloca(3), Rust doesn’t provide stack allocations at all. That’s pity because the stack allocations are the cheapest ones and custom memory allocators aren’t an option due to previous point.
It seems likely/unlikely support is much less powerful than in the modern C or C++ compilers.
Reading and writing data structures from/into raw memory is doable in Rust, but requires more code than in C and even C++. Not a big deal though.
The Rust’s generics and macros are much less powerful than provided by C++ templates coupled with C macros. Although, this is also not so crucial.

The most crucial disappointments about Rust system programming is it’s limited abilities to work with raw memory, which are the other side of the memory safety.

Rust for the Linux kernel

Rust for the Linux kernel is a hot topic: any LWN thread about Rust has hundreds of comments, the last Linux Plumbers Conference 2021 had 9 talks about Rust for the Linux kernel! However, Rust is mostly considered for drivers code as discussed in LWN posts (e.g. Supporting Linux kernel development in Rust or Using Rust for kernel development). There are high-quality drivers with wide community support, but there are also many device drivers supported by the only one maintainer, the actual device manufacturer. Code of some of the drivers is even not compliant with the Linux kernel coding style. For now it looks like a particular hardware manufacturer can do with its driver whatever they want. This is already the part of the kernel which nobody cares about.

From the other hand, it’s quite unlikely that Rust can be used for the main Linux kernel code, e.g. memory management or networking. The first thing is that there are millions line of code carefully written, reviewed and tested. There is no reason just to throw all the code out and rewrite it from scratch (probably as there are also no such resources in the world). The second thing is that Rust in the default safe mode has multiple limitations for advanced concurrency code. Paul McKenny discussed many of the limitations in his So You Want to Rust the Linux Kernel? blog series.

One of the option to deal with the limitations is to use Rust in unsafe mode, so the most advantage of Rust vanishes.

Reliability and safety in C++ and Rust

This article would be incomplete without addressing reliability and safety provided by the Rust and C++ programming languages. Hopefully, Sunny Chatterjee from Microsoft recently addressed the topic on the CppCon 2020. The main benefits of Rust are memory and concurrency safety, but the modern C++ addresses the topics as well. In this presentation Sunny has addressed following 6 gaps between Rust and C++: casting, switch statements, smarter loops, smarter copying, lifetimes, and mutability. Let’s review the gaps.

Types casting is well handled by the modern C and C++ compilers with -Wall compiler option.
switch statements are also handled with the -Wall. Moreover, GCC also introduces -Wimplicit-fallthrough compiler option, which makes ‘fall through’s explicit.
smarter loops are addressed by the C++ range-based for loop since C++11.
const auto & references and fine-grained copy and move semantics take care about smart copying.
RAII provides robust lifetimes, but unfortunately doesn’t cover all the cases.
C++ const classes with or without mutable members and const references and variables provide fine-grained mutability, bust also don’t cover all the cases.

The presentation concludes with “C++ Core Guidelines has rules covering many of the big-ticket items” and modern C and C++ compilers are tending to implement the missed checks. It’s also worth mentioning that the C/C++ world effectively several powerful technologies to improve the code quality:

Address sanitizers (e.g. ASAN are built into the modern versions of LLVM and GCC compilers) and catch out-of-bound memory accesses.
GCC and Clang provide powerful static analyzers, which efficiently catch many bugs like out of bounds access, types casting and many more. The Clang’s one seems more mature though.
Both of the compilers also provide extensions like __builtin_object_size, which are helpful to make memory operations like memcpy()safe, read for example the LWN article about memcpy() hardening in the Linux kernel.
There is powerful fuzzer AFL++. Clang also provides the built-in LibFuzzer library. The GCC’s fuzzer was primary developed for the Linux kernel syzkaller and thus it isn’t so easy to use.

After all, you still can make bugs with the unsafe code in Rust, just like working with raw pointers in C++.

The Computer Language Benchmarks Game

Since we’re talking about performance, we must take a look at the Computer Language Benchmark Game. To compare performance of different languages you need to implement the same task in all the languages in the same fashion. An this isn’t what the people usually do, so it’s hard to find real life code examples in different languages which allow you to compare oranges with oranges, not oranges with apples. While the Benchmarks game is a game, comparing implementations of small specific tasks, it’s one of the best what we have. The C++11 vs Rust comparison is one more comparison of equal implementations in C++ and Rust. There is no Assembly language in the Benchmarks game, but there are Rust, C++ for G++ compiler, and two C, for Clang and GCC compilers correspondingly. At the moment of writing this article performance of the implementations are (in seconds, less is better):

Problem	G++	GCC	Clang	Rust
fannkuch-redux	8.07	7.53	9.45	6.88
spectral-norm	0.72	0.72	0.72	0.71
n-body	4.09	4.30	3.31	3.31
binary-trees	1.12	1.78	1.88	1.20
fasta	1.04	0.82	0.88	0.91
pidigits	0.71	0.73	0.81	0.74
mandelbrot	0.84	1.27	2.09	0.92
regex-redux	1.08	0.80	0.81	1.28
reverse-complement	0.63	0.87	0.98	0.75
k-nucleotide	1.93	3.71	6.19	3.29

There is only one test, the first one, where Rust is more or less significantly outperforms C and C++ implementations.

Performance analysis

You might be curious why the fannkuch-redux implementation in Rust is faster than the C implementation? So are we. The copies of both the programs are under the cuts.

The C program

// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Jeremy Zerfas
// Based on the Ada program by Jonathan Parker and Georg Bauhaus which in turn
// was based on code by Dave Fladebo, Eckehard Berns, Heiner Marxen, Hongwei Xi,
// and The Anh Tran and also the Java program by Oleg Mazurov.

// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
#define PREFERRED_NUMBER_OF_BLOCKS_TO_USE 12

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

// intptr_t should be the native integer type on most sane systems.
typedef intptr_t intnative_t;


int main(int argc, char ** argv){
   const intnative_t n=atoi(argv[1]);

   // Create and initialize factorial_Lookup_Table.
   intnative_t factorial_Lookup_Table[n+1];
   factorial_Lookup_Table[0]=1;
   for(intnative_t i=0; ++i<=n;)
      factorial_Lookup_Table[i]=i*factorial_Lookup_Table[i-1];

   // Determine the block_Size to use. If n! is less than
   // PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
   // block_Size from being set to 0. This also causes smaller values of n to
   // be computed serially which is faster and uses less resources for small
   // values of n.
   const intnative_t block_Size=factorial_Lookup_Table[n]/
     (factorial_Lookup_Table[n]<PREFERRED_NUMBER_OF_BLOCKS_TO_USE ?
     1 : PREFERRED_NUMBER_OF_BLOCKS_TO_USE);

   intnative_t maximum_Flip_Count=0, checksum=0;

   // Iterate over each block.
   #pragma omp parallel for \
     reduction(max:maximum_Flip_Count) reduction(+:checksum)
   for(intnative_t initial_Permutation_Index_For_Block=0;
     initial_Permutation_Index_For_Block<factorial_Lookup_Table[n];
     initial_Permutation_Index_For_Block+=block_Size){

      intnative_t count[n];
      int8_t temp_Permutation[n], current_Permutation[n];


      // Initialize count and current_Permutation.
      count[0]=0;
      for(intnative_t i=0; i<n; ++i)
         current_Permutation[i]=i;
      for(intnative_t i=n-1,
        permutation_Index=initial_Permutation_Index_For_Block; i>0; --i){
         const intnative_t d=permutation_Index/factorial_Lookup_Table[i];
         permutation_Index=permutation_Index%factorial_Lookup_Table[i];
         count[i]=d;

         for(intnative_t j=0; j<n; ++j)
            temp_Permutation[j]=current_Permutation[j];
         for(intnative_t j=0; j<=i; ++j)
            current_Permutation[j]= j+d<=i ?
              temp_Permutation[j+d] : temp_Permutation[j+d-i-1];
      }


      // Iterate over each permutation in the block.
      const intnative_t last_Permutation_Index_In_Block=
        initial_Permutation_Index_For_Block+block_Size-1;
      for(intnative_t permutation_Index=initial_Permutation_Index_For_Block; ;
        ++permutation_Index){

         // If the first value in the current_Permutation is not 1 (0) then
         // we will need to do at least one flip for the current_Permutation.
         if(current_Permutation[0]>0){

            // Make a copy of current_Permutation[] to work on. Note that we
            // don't need to copy the first value since that will be stored
            // in a separate variable since it gets used a lot.
            for(intnative_t i=0; ++i<n;)
               temp_Permutation[i]=current_Permutation[i];

            intnative_t flip_Count=1;

            // Flip temp_Permutation until the element at the first_Value
            // index is 1 (0).
            for(intnative_t first_Value=current_Permutation[0];
              temp_Permutation[first_Value]>0; ++flip_Count){

               // Record the new_First_Value and restore the old
               // first_Value at its new flipped position.
               const int8_t new_First_Value=temp_Permutation[first_Value];
               temp_Permutation[first_Value]=first_Value;

               // If first_Value is greater than 3 (2) then we are flipping
               // a series of four or more values so we will also need to
               // flip additional elements in the middle of the
               // temp_Permutation.
               if(first_Value>2){
                  intnative_t low_Index=1, high_Index=first_Value-1;
                  // Note that this loop is written so that it will run at
                  // most 16 times so that compilers will be more willing
                  // to unroll it. Consequently this won't work right when
                  // n is greater than 35. This would probably be the
                  // least of your concerns since 21! won't fit into 64
                  // bit integers and even if it did you probably wouldn't
                  // want to run this program with a value that large
                  // since it would take thousands of years to do on a
                  // modern desktop computer. ;-)
                  do{
                     const int8_t temp=temp_Permutation[high_Index];
                     temp_Permutation[high_Index]=
                       temp_Permutation[low_Index];
                     temp_Permutation[low_Index]=temp;
                  }while(low_Index+++3<=high_Index-- && low_Index<16);
               }

               // Update first_Value to new_First_Value that we recorded
               // earlier.
               first_Value=new_First_Value;
            }


            // Update the checksum.
            if(permutation_Index%2==0)
               checksum+=flip_Count;
            else
               checksum-=flip_Count;

            // Update maximum_Flip_Count if necessary.
            if(flip_Count>maximum_Flip_Count)
               maximum_Flip_Count=flip_Count;
         }


         // Break out of the loop when we get to the
         // last_Permutation_Index_In_Block.
         if(permutation_Index>=last_Permutation_Index_In_Block)
            break;

         // Generate the next permutation.
         int8_t first_Value=current_Permutation[1];
         current_Permutation[1]=current_Permutation[0];
         current_Permutation[0]=first_Value;
         for(intnative_t i=1; ++count[i]>i;){
            count[i++]=0;
            const int8_t new_First_Value=current_Permutation[0]=
              current_Permutation[1];

            for(intnative_t j=0; ++j<i;)
               current_Permutation[j]=current_Permutation[j+1];

            current_Permutation[i]=first_Value;
            first_Value=new_First_Value;
         }
      }
   }


   // Output the results to stdout.
   printf("%jd\nPfannkuchen(%jd) = %jd\n", (intmax_t)checksum, (intmax_t)n,
     (intmax_t)maximum_Flip_Count);

   return 0;
}

The Rust program

// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Cliff L. Biffle, translated from Jeremy Zerfas's C program.
//
// The C program was based on the Ada program by Jonathan Parker and Georg
// Bauhaus which in turn was based on code by Dave Fladebo, Eckehard Berns,
// Heiner Marxen, Hongwei Xi, and The Anh Tran and also the Java program by Oleg
// Mazurov.

extern crate rayon;

use rayon::prelude::*;
use std::mem::replace;

// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
const PREFERRED_NUMBER_OF_BLOCKS_TO_USE: usize = 12;

// One greater than the maximum `n` value. Used to size stack arrays.
const MAX_N: usize = 16;

fn main() {
    let n = std::env::args().nth(1).unwrap().parse().unwrap();

    // This assert eliminates several bounds checks.
    assert!(n < MAX_N);

    // Create and initialize factorial_lookup_table.
    let factorial_lookup_table = {
        let mut table: [usize; MAX_N] = [0; MAX_N];
        table[0] = 1;
        for i in 1..MAX_N {
            table[i] = i * table[i - 1];
        }
        table
    };

    // Determine the block_size to use. If n! is less than
    // PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
    // block_size from being set to 0. This also causes smaller values of n to
    // be computed serially which is faster and uses less resources for small
    // values of n.
    let block_size =
        1.max(factorial_lookup_table[n] / PREFERRED_NUMBER_OF_BLOCKS_TO_USE);
    let block_count = factorial_lookup_table[n] / block_size;

    // Iterate over each block.
    let (checksum, max_flip_count) = (0..block_count)
        .into_par_iter()
        .map(|bn| {
            let initial_permutation_index = bn * block_size;

            let mut count: [usize; MAX_N] = [0; MAX_N];
            let mut current_permutation: [u8; MAX_N] =
                [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];

            // Initialize count and current_permutation.
            {
                let mut temp_permutation: [u8; MAX_N] = [0; MAX_N];
                let mut permutation_index = initial_permutation_index;
                for i in (1..n).rev() {
                    let f = factorial_lookup_table[i];
                    let d = permutation_index / f;

                    count[i] = d;

                    // Rotate the permutation left by d places. This is faster
                    // than using slice::rotate_left.
                    temp_permutation[0..=i - d]
                        .copy_from_slice(&current_permutation[d..=i]);
                    temp_permutation[i - d + 1..=i]
                        .copy_from_slice(&current_permutation[..d]);
                    current_permutation = temp_permutation;

                    permutation_index = permutation_index % f;
                }
            }

            let mut max_flip_count = 0;
            let mut checksum = 0;

            // Iterate over each permutation in the block.
            let last_permutation_index = initial_permutation_index + block_size;
            for permutation_index in
                initial_permutation_index..last_permutation_index
            {
                // If the first value in the current_permutation is not 1 (0)
                // then we will need to do at least one flip for the
                // current_permutation.
                if current_permutation[0] > 0 {
                    // Make a copy of current_permutation[] to work on.
                    let mut temp_permutation = current_permutation;

                    let mut flip_count: usize = 1;

                    // Flip temp_permutation until the element at the
                    // first_value index is 1 (0).
                    let mut first_value = current_permutation[0] as usize & 0xF;
                    while temp_permutation[first_value] > 0 {
                        // Record the new_first_value and restore the old
                        // first_value at its new flipped position.
                        let new_first_value = replace(
                            &mut temp_permutation[first_value],
                            first_value as u8,
                        );

                        // If first_value is greater than 3 (2) then we are
                        // flipping a series of four or more values so we will
                        // also need to flip additional elements in the middle
                        // of the temp_permutation.
                        if first_value > 2 {
                            for (low_index, high_index) in
                                (1..first_value).zip((1..first_value).rev())
                            {
                                temp_permutation.swap(high_index, low_index);

                                if low_index + 3 > high_index {
                                    break;
                                }
                            }
                        }

                        // Update first_value to new_first_value that we
                        // recorded earlier.
                        first_value = new_first_value as usize & 0xF;
                        flip_count += 1;
                    }

                    // Update the checksum.
                    if permutation_index % 2 == 0 {
                        checksum += flip_count;
                    } else {
                        checksum -= flip_count;
                    }

                    // Update max_flip_count if necessary.
                    max_flip_count = max_flip_count.max(flip_count);
                }

                // Generate the next permutation.
                current_permutation.swap(0, 1);
                let mut first_value = current_permutation[0];
                for i in 1..MAX_N - 2 {
                    count[i] += 1;
                    if count[i] <= i {
                        break;
                    }
                    count[i] = 0;

                    let new_first_value = current_permutation[1];

                    for j in 0..i + 1 {
                        current_permutation[j] = current_permutation[j + 1];
                    }

                    current_permutation[i + 1] = first_value;
                    first_value = new_first_value;
                }
            }
            (checksum, max_flip_count)
        })
        .reduce(
            || (0, 0),
            |(cs1, mf1), (cs2, mf2)| (cs1 + cs2, mf1.max(mf2)),
        );

    // Output the results to stdout.
    println!("{}", checksum);
    println!("Pfannkuchen({}) = {}", n, max_flip_count);
}

Let’s start the C program and collect performance profile of the program using the Linux perf tool. We can see with perf report or perf annotate what’s the most hot code in the program:

    0.46 |       movzbl    -0x9(%r15,%rax,1),%ecx
    0.96 |       movzbl    0x9(%r15),%r8d
         |       mov       %r8b,-0x9(%r15,%rax,1)
    2.31 |       mov       %cl,0x9(%r15)
         |       lea       -0xa(%rax),%rcx
   12.76 |       cmp       $0xb,%rdi

The performance killer getting the 12.76% of time is the part of unrolled loop

        do{
           const int8_t temp=temp_Permutation[high_Index];
           temp_Permutation[high_Index]=
             temp_Permutation[low_Index];
           temp_Permutation[low_Index]=temp;
        }while(low_Index+++3<=high_Index-- && low_Index<16);

And the cmp instruction is the part of the while loop condition. Actually, his loop just reverses bytes in the array. While the C implementation uses the naive and heavy operations with the arrays indexes, the Rust implementation uses the efficient double iterator:

        if first_value > 2 {
            for (low_index, high_index) in
                (1..first_value).zip((1..first_value).rev())
            {
                temp_permutation.swap(high_index, low_index);

                if low_index + 3 > high_index {
                    break;
                }
            }
        }

Fast array reversal with SIMD! describes several ways to improve performance of the C program (the article uses C++ by the way). The first one is to use only one index i and iterate only until the middle of the permuted part of the array with temp_Permutation[i] and temp_Permutation[high_Index - i]. That would be quite close the Rust double iterators. By the way, the more advanced way to improve performance of both the programs is to use PSHUFB SSSE3 instruction, or the _mm_shuffle_epi8() intrinsic, instead of the whole loop. Since there is very small number of shuffle masks, all of them can be defined on the compile time and immediately loaded into a control mask register for the instruction.

However, this isn’t the only differences between the implementations. The Rust program takes advantage of maximum input number const MAX_N: usize = 16. Probably this small improvement affects performance the most since the compiler can now do better optimizations with loops and static arrays. The program explicitly uses static array initialization

        let mut current_permutation: [u8; MAX_N] =
            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];

, while the C implementation does this in the run time with no assumptions on input data

        for(intnative_t i=0; i<n; ++i)
           current_Permutation[i]=i;

The Rust program copies arrays using built-in memory copying

        let mut temp_permutation = current_permutation;

, while the C program does this in a loop again

        for(intnative_t i=0; ++i<n;)
           temp_Permutation[i]=current_Permutation[i];

These aren’t all the inefficiencies in the C program, which were eliminated in the Rust implementation (both the programs are based on the same initial Ada program). In most of the places an optimized version of the program will be not only faster, but also shorter.

Thus, in this single case when a Rust implementation is more or less faster than C, the performance difference is not about better compiler, but about more efficient structure of the program, which allows a compiler to optimize code better.

Rust as a system programming language?

A real high-level system programming language must be compatible with C. Just consider the 2 examples of our real life projects.

The first one was a web application firewall (WAF). Such kind of software are typically built on top of Nginx or HAproxy HTTPS servers, which are written in C. It was easy to write C++ module for Nginx, but we would need extra glue code to develop the module in Rust and maintain all our patches for the C code of Nginx. The same developers were easily switching between C and C++ parts of the code.

In the second case our client wished to execute some external logic using MySQL user defined functions (UDF) to interact with the operating system. We could develop the logic in any programming language, but there was a constraint: we had to execute the program 5000 per second on each CPU core! That was impossible to achieve even using posix_spawnp(), the fastest way to execute a program in Linux. We ended up with development of a custom UDF for MySQL, which is a shared object loaded into MySQL server process. It was quite straightforward to do with C++.

An opposite example to use Rust for an Nginx module is CloudFlare’s Quiche, an Nginx extension to support QUIC and HTTP/3 protocols. While it’s definitely possible to use Rust for such kind of tasks, the guys, besides the FFI code for C/C++ bindings, still had to write some C code to patch Nginx. This means that

you had to write some extra boilerplate code for C/C++ bindings
and your still have to deal with C/C++ and the second language, which makes the project more complex.

(By the way, the same is also applicable to the D programming language, which also can not directly include C headers.) Both the FFI and Nginx patch in the Quiche project are just about 5,000 lines of code, i.e. 10% of the whole project, which is more than 40,000 lines of Rust code. If the project was developed in C or C++, then they would need the Nginx patch as well, but not in a second language though. But there are zero chances to adopt the code in the Nginx main code base. And this is what actually happen: having the production ready QUIC implementation from the big vendor, Nginx team developed their own C implementation. It’s hard to say whether the “binding” code is negligible or how much time the developers spend on the boilerplate code. The question is whether the Rust memory safety (which also can be reached with the modern core C++, static analysis, and address sanitizers) makes the development so productive that the extra code and maintaining the code base in two different languages become negligible?..

Conclusion

We realized that we reached the limits of the C language when we were developing the HTTP parser for Tempesta FW: we couldn’t jump directly to the required state of the parser without lookups in the switch statement as well as we couldn’t get satisfactory code layout. That time we considered to introduce inline Assembly into the code of the parser. Having that the zero-copy state machine was already so sophisticated, we weren’t happy about this idea. That was so nice surprise to find the computed labels and hot/cold attributes among the compiler extensions! Thanks to these features the compiler generated the optimal code for the parser.

The power of C++ is “There’s more than one way to do it”, TIMTOWTDI. Yes, this is Perl’s idea, but in many cases C++ allows you to write your program in plain C, in templates metaprogramming, using high-level STL or well-optimized custom algorithms and data structures. The modern C++ is sophisticated and requires years of experience to be fluent with the language, but it is a professional tool, which allows a professional developer to create the fastest and reliable software.

Not only Rust is immature, but it seems the language designers intentionally limited the language. There are a lot of poor programs misusing goto, so they just removed the operator: good for juniors, but too limited for professionals. It’s quite unlikely that the language and the compiler makes you a nice surprise, when you struggle on a complex technical task. Instead, it’s likely that when you need to do something simple, what you’ve been doing for ages in C or C++, you’ll be disappointed and start to fight with the compiler. As an example, likely and unlikely compiler hints are used in the Linux kernel for ages and they are so popular in the user space C/C++ programming, that they are included into the C++20 standard (before that programmers had to use compiler intrinsics). But with Rust you find that the API is experimental and works for if statements only.