Fast programming languages: C, C++, Rust, and Assembly

Posted on October 28, 2020

This article isn't about which programming language is better, instead it discusses the most powerful tool set for development of the fastest server-side system software, such as database engines and HTTPS servers. There are several specific properties of this type of software:

We've been developing the fastest software in C, C++, and Assembly for ages. It's not a surprise that since Rust is "focused on performance" we're very interested in it. With a bit of skepticism though. Just remember the rise of Java programming language: there were a lot of reports that the JIT compilation produced code faster than C++. Now it's hard to find a case, when C++ is slower than Java, see for example the benchmark game. It's also worth mentioning that the memory garbage collection (GC) in Java leads to high tail latencies and it's hard or even impossible to do anything with the problem. Golang can not be considered for high performance programming also due to the GC.

C or C++? Or both of them?

The C programming language dominates in the system programming. An operating system kernel is an example of one of the most sophisticated system software, not only because it deals with hardware directly, but also due to strict performance requirements. The Linux and FreeBSD kernels are written in C, as well as the other UNIX'es and Windows kernels. Let's start the discussion from this bright example of high-performance system software.

C++ for operating systems kernel development

FreeBSD has been supporting C++ modules for a while. While the Linux kernel never supported C++, there was the Click modular router written in C++ and working as a Linux kernel module. If you're interested in C++ applicability for the operating systems kernel development, then you can find quite good discussions in the C++ and Bare bones articles. However, there are fundamental reasons against using C++ for operating system kernel development: Thus, with C++ in the kernel space you basically have only templates, classes inheritance and some syntax sugar like lambda functions. Since system code is quite rarely requires complicated abstractions and inheritances, then does it still have sense to make effort to use C++ in the kernel space?

C++ exceptions

This is one of the most debatable C++ feature and it deserves a separate chapter. For example, the MySQL project, following to the Google coding style, doesn't use exceptions. The Google coding style provides the good lists of pros and cons of using exceptions. Here we focus on performance aspects only.

Exceptions can improve performance when we have to handle error codes in too may places, e.g. (let the functions be inlined and very small)


        if (func_1())
            return -EINVAL;
        if (func_2())
            return -EINVAL;
        ....
        if (func_100())
            return -EINVAL;
    
The problem with the code is that there are extra conditional jumps. Modern CPU are pretty good with branch prediction, but it still hurts performance. In C++ we can just write

        try {
            func_1();
            func_2();
            ...
            func_100();
        } catch (...) {
            return -EINVAL;
        }
    
, so there are no extra conditions in the hot path. However, this isn't for free: most of the functions in your C++ code have to have extra epilogues with a table of exceptions, which these functions can catch, and an appropriate cleanup table. The function epilogues aren't executed in normal workflow, but they increase the size of code causing extra pollution in the CPU instruction cache. You can find great details about C++ exception handling internals in the Nico Brailovsky's blog.

Is C++ still good?

Yes, it is. Firstly, not the whole code actually must be as fast as possible and in most of the places we don't need custom memory allocation and don't care about exceptions overhead. The most of the projects are developed in the user space and benefit, especially the new ones, from relatively rich C++ standard and Boost libraries (not so rich as Java's though).

Secondly, the killing feature of C++ is that it is C. If you don't want to use exceptions or RTTI, then you can just switch the features off. Most of C programs can be just compiled with a C++ compiler with very small changes or without any changes at all. As an example, we need only this trivial change


    $ diff -up nbody.c nbody-new.cc
        @@ -112,9 +112,9 @@ static void advance(body bodies[]){
             // ROUNDED_INTERACTIONS_COUNT elements to simplify one of the following
             // loops and to also keep the second and third arrays in position_Deltas
             // aligned properly.
        -    static alignas(__m128d) double
        -      position_Deltas[3][ROUNDED_INTERACTIONS_COUNT],
        -      magnitudes[ROUNDED_INTERACTIONS_COUNT];
        +    static double
        +      position_Deltas[3][ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16))),
        +      magnitudes[ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16)));

             // Calculate the position_Deltas between the bodies for each interaction.
             for(intnative_t i=0, k=0; i < BODIES_COUNT-1; ++i)
    
to compile the C program with G++ compiler. The modern C++ compilers provide C compatibility extensions like the __restrict__ keyword. You always can write the most performance critical code of a C++ program in C style. If you don't like the STL containers with an overhead, then you can use Boost.intrusive or even port a similar container from the Linux kernel or other fast C project -in most of the cases this won't be painful. See for example how a hash table from PostgreSQL, HTrie from Tempesta DB, and the Linux kernel read/write spinlocks (all are written in C) were used in a C++ benchmark.

The last thing which must be mentioned about development of high performance programs in C++ is template metaprogramming. It's very exciting about the modern C++ standards that using templates you can write quite sophisticated logic which is fully computed in the compile time and costs nothing in the run time.

GOTO - the power of C

A professional tool must allow you to work with it in the most efficient way. The goal of the high-level and high-performance programming languages is to generate the most efficient machine code. Each hardware architecture supports jumps, which means that you can jump to any address by any condition. The closest abstraction for the jumps in the C and C++ programming languages is goto operator. It's not so flexible as assembly jmp, but C compilers provide extensions which make the operator almost the full equivalent of the assembly jumps. Unfortunately, Rust doesn't support goto, which makes it awkward in the whole class of performance crucial tasks.

We talk about parsers. Not configuration file parsers, which are perfectly done with bunch of switch and if statements, but about large and very fast parsers like an HTTP parser. You might think that this is "too narrow" or "too specific" task, but recall the parser generators, like Ragel or GNU Bison - if you develop such a parser generator, then you never know how big parsers will be generated. (By the way, Ragel extensively uses goto to generate very fast parsers.) Note also SQL parsers in every RDBMS. Actually, we can generalize the class of the tasks as large and fast finite state machines, which also includes, for example, regular expressions.

The HTTP parser in Tempesta FW is much larger than HTTP parsers in other web servers because, in addition to the basic HTTP parsing, it also does many security checks and strictly validates the input against the RFCs. Also our parser works with zero-copy data, so it also much care about data chunks. The technical details of the parser were described in our talk at the SCALE 17x conference and you can watch the talk video or the slides.

Typically, HTTP parsers are implemented as a loop over input characters and nested switch statements for allowed characters and available states. See for example ngx_http_parse_request_line() in the Nginx parser source code. In sake of brevity let's consider a simplified version of code:


        while (++str_ptr) {
            switch (current_state) {
            case 0:
                ...
            case 1:
                ...
            ...
            case 100:
                switch (*str_ptr) {
                case 'a':
                    ...
                    current_state = 200;
                    break;
                case 'b':
                    ...
                    current_state = 101;
                    break;
                }
                break;
            case 101:
                ...
            ...
            }
        }
    
Assume that the parser has finished parsing of the previous chunk of data at state 100 and the current chunk of data starts from character 'b'. Regardless the switch statement optimization (it can be optimized by a compiler with a lookup table or binary search), there are 3 problems with the code:
  1. Looking up the state 100 is still more expensive than a direct jump.
  2. While the code for the state 101 is placed right after the code for state 100, we have to re-enter the while and switch statements, i.e. lookup the next state again, instead of just move one character further and just jump right to the next state.
  3. Even if we always reach state 101 after the state 100, a compiler may reorganize the code in such a way that the state 101 is placed at the beginning of the switch statement while the state 100 is placed somewhere at the end.
Tempesta FW fixes all the problems using the goto operator and the GCC compiler extensions for labels as values and label attributes with the code like:

        // Use labels as values to remember the current state when we
        // exit the state machine at the end of current data chunk.
        parser->state = &&state_100;
        goto *parser->state;

        while (true) {
        state_0:
            ...
        state_1:
            ...
        // The state is placed by a compiler closer to the beginning
        // of the code.
        state_100: __attribute__((hot))
            // We still use small switches for small character sets. 
            switch (*str_ptr) {
            case 'a':
                ...
                ++str_ptr;
                goto state_200;
            case 'b':
                ...
                ++str_ptr;
                // Just fall through to the state 101.
            }
        // This state is placed by the compiler after state_100.
        state_101: __attribute__((cold))
            ...
        }
    
Since Rust doesn't support the goto operator, we would need to use assembly language to implement the state machine with direct jumps and the optimal code layout.

When Assembly is easier than C

Now let's have a look onto an example when the Assembly language not only generates faster code, but also allows to write programs in more productive way. This example is about multi-precision integer arithmetic.

Public key cryptography and elliptic curves in particular operates with big integers. The book BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic by Tom St Denis provides great details about the subject as well as C implementations for many algorithms, but for now let's consider the basic addition of two big integers of 128 bits length on a 64-bit machine. The big integers consist of limbs, two 64-bit longs. To sum the integers we have to care about carry between the limbs, so resulting C code looks like (see 4.2.1 in the book):


        // a := a + b
        // x[0] is the less significant limb,
        // x[1] is the most significant limb.
        void s_mp_add(unsigned long *a, unsigned long *b)
        {
            unsigned long carry;

            a[0] += b[0];
            carry = (a[0] < b[0]);

            a[1] += b[1] + carry;
        }
    
The code is small and simple, but you probably had to think a little bit about correctness of the manipulations with carry. Hopefully, x86-64 is a CISC architecture, i.e. it provides us a lot of computational features and one of them is computations with carry, so the code above can be done in two instructions only and there is no need for the comparison:

        // Pointer to a is in %RDI, pointer to b is in %RSI
        movq    (%rdi), %r8
        movq    8(%rdi), %r9

        addq    (%rsi), %r8     // add with carry
        addc    8(%rsi), %r9    // use the carry in the next addition

        movq    (%r8), (%rdi)
        movq    (%r9), 8(%rdi)
    
If you look into any well-optimized cryptography library, like OpenSSL or Tempesta TLS, then you find a lot of assembly code (OpenSSL actually generates the assembly source code with Perl scripts).

Rust at a glance

At a first glance Rust is pretty well equipped to develop very efficient code: SIMD intrinsics, memory alignment, memory barriers, inline assembly. There are many comparisons of Rust with C or C++, e.g. Speed of Rust vs C or C++ Is Faster and Safer Than Rust: Benchmarked by Yandex. But, if you consider to use Rust to develop a benchmarks leading product, the you'll probably face several obstacles plus to the absence of the goto operator: The most crucial disappointments about Rust system programming is it's limited abilities to work with raw memory, which are the other side of the memory safety.

Reliability and safety in C++ and Rust

This article would be incomplete without addressing reliability and safety provided by the Rust and C++ programming languages. Hopefully, Sunny Chatterjee from Microsoft recently addressed the topic on the CppCon 2020. The main benefits of Rust are memory and concurrency safety, but the modern C++ addresses the topics as well. In this presentation Sunny has addressed following 6 gaps between Rust and C++: casting, switch statements, smarter loops, smarter copying, lifetimes, and mutability. Let's review the gaps. The presentation concludes with "C++ Core Guidelines has rules covering many of the big-ticket items" and modern C and C++ compilers are tending to implement the missed checks. It's also worth mentioning that the C/C++ world effectively uses address sanitizers (e.g. ASAN is built into the modern versions of LLVM and GCC compilers) to catch out-of-bound memory accesses. After all, you still can make bugs with the unsafe code in Rust, just like working with raw pointers in C++.

The Computer Language Benchmarks Game

Since we're talking about performance, we must take a look at the Computer Language Benchmark Game. To compare performance of different languages you need to implement the same task in all the languages in the same fashion. An this isn't what the people usually do, so it's hard to find real life code examples in different languages which allow you to compare oranges with oranges, not oranges with apples. While the Benchmarks game is a game, comparing implementations of small specific tasks, it's one of the best what we have. The C++11 vs Rust comparison is one more comparison of equal implementations in C++ and Rust. There is no Assembly language in the Benchmarks game, but there are Rust, C++ for G++ compiler, and two C, for Clang and GCC compilers correspondingly. At the moment of writing this article performance of the implementations are (in seconds, less is better):
Problem G++ GCC Clang Rust
fannkuch-redux 8.07 7.53 9.45 6.88
spectral-norm 0.72 0.72 0.72 0.71
n-body 4.09 4.30 3.31 3.31
binary-trees 1.12 1.78 1.88 1.20
fasta 1.04 0.82 0.88 0.91
pidigits 0.71 0.73 0.81 0.74
mandelbrot 0.84 1.27 2.09 0.92
regex-redux 1.08 0.80 0.81 1.28
reverse-complement 0.63 0.87 0.98 0.75
k-nucleotide 1.93 3.71 6.19 3.29
There is only one test, the first one, where Rust is more or less significantly outperforms C and C++ implementations.

Performance analysis

You might be curious why the fannkuch-redux implementation in Rust is faster than the C implementation? So are we. The copies of both the programs are under the cuts.

The C program


// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Jeremy Zerfas
// Based on the Ada program by Jonathan Parker and Georg Bauhaus which in turn
// was based on code by Dave Fladebo, Eckehard Berns, Heiner Marxen, Hongwei Xi,
// and The Anh Tran and also the Java program by Oleg Mazurov.

// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
#define PREFERRED_NUMBER_OF_BLOCKS_TO_USE 12

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

// intptr_t should be the native integer type on most sane systems.
typedef intptr_t intnative_t;


int main(int argc, char ** argv){
   const intnative_t n=atoi(argv[1]);

   // Create and initialize factorial_Lookup_Table.
   intnative_t factorial_Lookup_Table[n+1];
   factorial_Lookup_Table[0]=1;
   for(intnative_t i=0; ++i<=n;)
      factorial_Lookup_Table[i]=i*factorial_Lookup_Table[i-1];

   // Determine the block_Size to use. If n! is less than
   // PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
   // block_Size from being set to 0. This also causes smaller values of n to
   // be computed serially which is faster and uses less resources for small
   // values of n.
   const intnative_t block_Size=factorial_Lookup_Table[n]/
     (factorial_Lookup_Table[n]<PREFERRED_NUMBER_OF_BLOCKS_TO_USE ?
     1 : PREFERRED_NUMBER_OF_BLOCKS_TO_USE);

   intnative_t maximum_Flip_Count=0, checksum=0;

   // Iterate over each block.
   #pragma omp parallel for \
     reduction(max:maximum_Flip_Count) reduction(+:checksum)
   for(intnative_t initial_Permutation_Index_For_Block=0;
     initial_Permutation_Index_For_Block<factorial_Lookup_Table[n];
     initial_Permutation_Index_For_Block+=block_Size){

      intnative_t count[n];
      int8_t temp_Permutation[n], current_Permutation[n];


      // Initialize count and current_Permutation.
      count[0]=0;
      for(intnative_t i=0; i<n; ++i)
         current_Permutation[i]=i;
      for(intnative_t i=n-1,
        permutation_Index=initial_Permutation_Index_For_Block; i>0; --i){
         const intnative_t d=permutation_Index/factorial_Lookup_Table[i];
         permutation_Index=permutation_Index%factorial_Lookup_Table[i];
         count[i]=d;

         for(intnative_t j=0; j<n; ++j)
            temp_Permutation[j]=current_Permutation[j];
         for(intnative_t j=0; j<=i; ++j)
            current_Permutation[j]= j+d<=i ?
              temp_Permutation[j+d] : temp_Permutation[j+d-i-1];
      }


      // Iterate over each permutation in the block.
      const intnative_t last_Permutation_Index_In_Block=
        initial_Permutation_Index_For_Block+block_Size-1;
      for(intnative_t permutation_Index=initial_Permutation_Index_For_Block; ;
        ++permutation_Index){

         // If the first value in the current_Permutation is not 1 (0) then
         // we will need to do at least one flip for the current_Permutation.
         if(current_Permutation[0]>0){

            // Make a copy of current_Permutation[] to work on. Note that we
            // don't need to copy the first value since that will be stored
            // in a separate variable since it gets used a lot.
            for(intnative_t i=0; ++i<n;)
               temp_Permutation[i]=current_Permutation[i];

            intnative_t flip_Count=1;

            // Flip temp_Permutation until the element at the first_Value
            // index is 1 (0).
            for(intnative_t first_Value=current_Permutation[0];
              temp_Permutation[first_Value]>0; ++flip_Count){

               // Record the new_First_Value and restore the old
               // first_Value at its new flipped position.
               const int8_t new_First_Value=temp_Permutation[first_Value];
               temp_Permutation[first_Value]=first_Value;

               // If first_Value is greater than 3 (2) then we are flipping
               // a series of four or more values so we will also need to
               // flip additional elements in the middle of the
               // temp_Permutation.
               if(first_Value>2){
                  intnative_t low_Index=1, high_Index=first_Value-1;
                  // Note that this loop is written so that it will run at
                  // most 16 times so that compilers will be more willing
                  // to unroll it. Consequently this won't work right when
                  // n is greater than 35. This would probably be the
                  // least of your concerns since 21! won't fit into 64
                  // bit integers and even if it did you probably wouldn't
                  // want to run this program with a value that large
                  // since it would take thousands of years to do on a
                  // modern desktop computer. ;-)
                  do{
                     const int8_t temp=temp_Permutation[high_Index];
                     temp_Permutation[high_Index]=
                       temp_Permutation[low_Index];
                     temp_Permutation[low_Index]=temp;
                  }while(low_Index+++3<=high_Index-- && low_Index<16);
               }

               // Update first_Value to new_First_Value that we recorded
               // earlier.
               first_Value=new_First_Value;
            }


            // Update the checksum.
            if(permutation_Index%2==0)
               checksum+=flip_Count;
            else
               checksum-=flip_Count;

            // Update maximum_Flip_Count if necessary.
            if(flip_Count>maximum_Flip_Count)
               maximum_Flip_Count=flip_Count;
         }


         // Break out of the loop when we get to the
         // last_Permutation_Index_In_Block.
         if(permutation_Index>=last_Permutation_Index_In_Block)
            break;

         // Generate the next permutation.
         int8_t first_Value=current_Permutation[1];
         current_Permutation[1]=current_Permutation[0];
         current_Permutation[0]=first_Value;
         for(intnative_t i=1; ++count[i]>i;){
            count[i++]=0;
            const int8_t new_First_Value=current_Permutation[0]=
              current_Permutation[1];

            for(intnative_t j=0; ++j<i;)
               current_Permutation[j]=current_Permutation[j+1];

            current_Permutation[i]=first_Value;
            first_Value=new_First_Value;
         }
      }
   }


   // Output the results to stdout.
   printf("%jd\nPfannkuchen(%jd) = %jd\n", (intmax_t)checksum, (intmax_t)n,
     (intmax_t)maximum_Flip_Count);

   return 0;
}
            

The Rust program


// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Cliff L. Biffle, translated from Jeremy Zerfas's C program.
//
// The C program was based on the Ada program by Jonathan Parker and Georg
// Bauhaus which in turn was based on code by Dave Fladebo, Eckehard Berns,
// Heiner Marxen, Hongwei Xi, and The Anh Tran and also the Java program by Oleg
// Mazurov.

extern crate rayon;

use rayon::prelude::*;
use std::mem::replace;

// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
const PREFERRED_NUMBER_OF_BLOCKS_TO_USE: usize = 12;

// One greater than the maximum `n` value. Used to size stack arrays.
const MAX_N: usize = 16;

fn main() {
    let n = std::env::args().nth(1).unwrap().parse().unwrap();

    // This assert eliminates several bounds checks.
    assert!(n < MAX_N);

    // Create and initialize factorial_lookup_table.
    let factorial_lookup_table = {
        let mut table: [usize; MAX_N] = [0; MAX_N];
        table[0] = 1;
        for i in 1..MAX_N {
            table[i] = i * table[i - 1];
        }
        table
    };

    // Determine the block_size to use. If n! is less than
    // PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
    // block_size from being set to 0. This also causes smaller values of n to
    // be computed serially which is faster and uses less resources for small
    // values of n.
    let block_size =
        1.max(factorial_lookup_table[n] / PREFERRED_NUMBER_OF_BLOCKS_TO_USE);
    let block_count = factorial_lookup_table[n] / block_size;

    // Iterate over each block.
    let (checksum, max_flip_count) = (0..block_count)
        .into_par_iter()
        .map(|bn| {
            let initial_permutation_index = bn * block_size;

            let mut count: [usize; MAX_N] = [0; MAX_N];
            let mut current_permutation: [u8; MAX_N] =
                [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];

            // Initialize count and current_permutation.
            {
                let mut temp_permutation: [u8; MAX_N] = [0; MAX_N];
                let mut permutation_index = initial_permutation_index;
                for i in (1..n).rev() {
                    let f = factorial_lookup_table[i];
                    let d = permutation_index / f;

                    count[i] = d;

                    // Rotate the permutation left by d places. This is faster
                    // than using slice::rotate_left.
                    temp_permutation[0..=i - d]
                        .copy_from_slice(&current_permutation[d..=i]);
                    temp_permutation[i - d + 1..=i]
                        .copy_from_slice(&current_permutation[..d]);
                    current_permutation = temp_permutation;

                    permutation_index = permutation_index % f;
                }
            }

            let mut max_flip_count = 0;
            let mut checksum = 0;

            // Iterate over each permutation in the block.
            let last_permutation_index = initial_permutation_index + block_size;
            for permutation_index in
                initial_permutation_index..last_permutation_index
            {
                // If the first value in the current_permutation is not 1 (0)
                // then we will need to do at least one flip for the
                // current_permutation.
                if current_permutation[0] > 0 {
                    // Make a copy of current_permutation[] to work on.
                    let mut temp_permutation = current_permutation;

                    let mut flip_count: usize = 1;

                    // Flip temp_permutation until the element at the
                    // first_value index is 1 (0).
                    let mut first_value = current_permutation[0] as usize & 0xF;
                    while temp_permutation[first_value] > 0 {
                        // Record the new_first_value and restore the old
                        // first_value at its new flipped position.
                        let new_first_value = replace(
                            &mut temp_permutation[first_value],
                            first_value as u8,
                        );

                        // If first_value is greater than 3 (2) then we are
                        // flipping a series of four or more values so we will
                        // also need to flip additional elements in the middle
                        // of the temp_permutation.
                        if first_value > 2 {
                            for (low_index, high_index) in
                                (1..first_value).zip((1..first_value).rev())
                            {
                                temp_permutation.swap(high_index, low_index);

                                if low_index + 3 > high_index {
                                    break;
                                }
                            }
                        }

                        // Update first_value to new_first_value that we
                        // recorded earlier.
                        first_value = new_first_value as usize & 0xF;
                        flip_count += 1;
                    }

                    // Update the checksum.
                    if permutation_index % 2 == 0 {
                        checksum += flip_count;
                    } else {
                        checksum -= flip_count;
                    }

                    // Update max_flip_count if necessary.
                    max_flip_count = max_flip_count.max(flip_count);
                }

                // Generate the next permutation.
                current_permutation.swap(0, 1);
                let mut first_value = current_permutation[0];
                for i in 1..MAX_N - 2 {
                    count[i] += 1;
                    if count[i] <= i {
                        break;
                    }
                    count[i] = 0;

                    let new_first_value = current_permutation[1];

                    for j in 0..i + 1 {
                        current_permutation[j] = current_permutation[j + 1];
                    }

                    current_permutation[i + 1] = first_value;
                    first_value = new_first_value;
                }
            }
            (checksum, max_flip_count)
        })
        .reduce(
            || (0, 0),
            |(cs1, mf1), (cs2, mf2)| (cs1 + cs2, mf1.max(mf2)),
        );

    // Output the results to stdout.
    println!("{}", checksum);
    println!("Pfannkuchen({}) = {}", n, max_flip_count);
}
            

Let's start the C program and collect performance profile of the program using the Linux perf tool. We can see with perf report or perf annotate what's the most hot code in the program:


    0.46 |       movzbl    -0x9(%r15,%rax,1),%ecx
    0.96 |       movzbl    0x9(%r15),%r8d
         |       mov       %r8b,-0x9(%r15,%rax,1)
    2.31 |       mov       %cl,0x9(%r15)
         |       lea       -0xa(%rax),%rcx
   12.76 |       cmp       $0xb,%rdi
    
The performance killer getting the 12.76% of time is the part of unrolled loop

        do{
           const int8_t temp=temp_Permutation[high_Index];
           temp_Permutation[high_Index]=
             temp_Permutation[low_Index];
           temp_Permutation[low_Index]=temp;
        }while(low_Index+++3<=high_Index-- && low_Index<16);
    
And the cmp instruction is the part of the while loop condition. Actually, his loop just reverses bytes in the array. While the C implementation uses the naive and heavy operations with the arrays indexes, the Rust implementation uses the efficient double iterator:

        if first_value > 2 {
            for (low_index, high_index) in
                (1..first_value).zip((1..first_value).rev())
            {
                temp_permutation.swap(high_index, low_index);

                if low_index + 3 > high_index {
                    break;
                }
            }
        }
    
Fast array reversal with SIMD! describes several ways to improve performance of the C program (the article uses C++ by the way). The first one is to use only one index i and iterate only until the middle of the permuted part of the array with temp_Permutation[i] and temp_Permutation[high_Index - i]. That would be quite close the Rust double iterators. By the way, the more advanced way to improve performance of both the programs is to use PSHUFB SSSE3 instruction, or the _mm_shuffle_epi8() intrinsic, instead of the whole loop. Since there is very small number of shuffle masks, all of them can be defined on the compile time and immediately loaded into a control mask register for the instruction.

However, this isn't the only differences between the implementations. The Rust program takes advantage of maximum input number const MAX_N: usize = 16. Probably this small improvement affects performance the most since the compiler can now do better optimizations with loops and static arrays. The program explicitly uses static array initialization


        let mut current_permutation: [u8; MAX_N] =
            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
    
, while the C implementation does this in the run time with no assumptions on input data

        for(intnative_t i=0; i<n; ++i)
           current_Permutation[i]=i;
    
The Rust program copies arrays using built-in memory copying

        let mut temp_permutation = current_permutation;
    
, while the C program does this in a loop again

        for(intnative_t i=0; ++i<n;)
           temp_Permutation[i]=current_Permutation[i];
    
These aren't all the inefficiencies in the C program, which were eliminated in the Rust implementation (both the programs are based on the same initial Ada program). In most of the places an optimized version of the program will be not only faster, but also shorter.

Thus, in this single case when a Rust implementation is more or less faster than C, the performance difference is not about better compiler, but about more efficient structure of the program, which allows a compiler to optimize code better.

Rust as a system programming language?

A real high-level system programming language must be compatible with C. Just consider the 2 examples of our real life projects.

The first one was a web application firewall (WAF). Such kind of software are typically built on top of Nginx or HAproxy HTTPS servers, which are written in C. It was easy to write C++ module for Nginx, but we would need extra glue code to develop the module in Rust and maintain all our patches for the C code of Nginx. The same developers were easily switching between C and C++ parts of the code.

In the second case our client wished to execute some external logic using MySQL user defined functions (UDF) to interact with the operating system. We could develop the logic in any programming language, but there was a constraint: we had to execute the program 5000 per second on each CPU core! That was impossible to achieve even using posix_spawnp(), the fastest way to execute a program in Linux. We ended up with development of a custom UDF for MySQL, which is a shared object loaded into MySQL server process. It was quite straightforward to do with C++.

An opposite example to use Rust for an Nginx module is CloudFlare's Quiche, an Nginx extension to support QUIC and HTTP/3 protocols. While it's definitely possible to use Rust for such kind of tasks, the guys, besides the FFI code for C/C++ bindings, still had to write some C code to patch Nginx. This means that

(By the way, the same is also applicable to the D programming language, which also can not directly include C headers.) Both the FFI and Nginx patch in the Quiche project are just about 5,000 lines of code, i.e. 10% of the whole project, which is more than 40,000 lines of Rust code. If the project was developed in C or C++, then they would need the Nginx patch as well, but not in a second language though. But there are zero chances to adopt the code in the Nginx main code base. And this is what actually happen: having the production ready QUIC implementation from the big vendor, Nginx team developed their own C implementation. It's hard to say whether the "binding" code is negligible or how much time the developers spend on the boilerplate code. The question is whether the Rust memory safety (which also can be reached with the modern core C++, static analysis, and address sanitizers) makes the development so productive that the extra code and maintaining the code base in two different languages become negligible?..

Conclusion

We realized that we reached the limits of the C language when we were developing the HTTP parser for Tempesta FW: we couldn't jump directly to the required state of the parser without lookups in the switch statement as well as we couldn't get satisfactory code layout. That time we considered to introduce inline Assembly into the code of the parser. Having that the zero-copy state machine was already so sophisticated, we weren't happy about this idea. That was so nice surprise to find the computed labels and hot/cold attributes among the compiler extensions! Thanks to these features the compiler generated the optimal code for the parser.

The power of C++ is "There's more than one way to do it", TIMTOWTDI. Yes, this is Perl's idea, but in many cases C++ allows you to write your program in plain C, in templates metaprogramming, using high-level STL or well-optimized custom algorithms and data structures. The modern C++ is sophisticated and requires years of experience to be fluent with the language, but it is a professional tool, which allows a professional developer to create the fastest and reliable software.

Not only Rust is immature, but it seems the language designers intentionally limited the language. There are a lot of poor programs misusing goto, so they just removed the operator: good for juniors, but too limited for professionals. It's quite unlikely that the language and the compiler makes you a nice surprise, when you struggle on a complex technical task. Instead, it's likely that when you need to do something simple, what you've been doing for ages in C or C++, you'll be disappointed and start to fight with the compiler. As an example, likely and unlikely compiler hints are used in the Linux kernel for ages and they are so popular in the user space C/C++ programming, that they are included into the C++20 standard (before that programmers had to use compiler intrinsics). But with Rust you find that the API is experimental and works for if statements only.

 

Related posts

Web application firewall acceleration
From our experience in developing custom core logic of Web Application Firewalls (WAF), we learned several performance issues typical for the most, or even all, modern WAFs which may lead to high cost of ownership and/or denial of service. In this article we introduce a WAF accelerator, which like a web accelerator, improves performance of WAFs and protects them against DDoS attacks.

Recap NatSys Lab. blog
We recap the most interesting posts since 2011 from our old NatSys Laboratory blog: effect of the recent CPU vulnerabilities onto Linux system calls performance, deep dive into HTTP proxies functionality and performance comparison of Tempesta FW with Nginx and HAProxy, fast strings processing algorithms, lock-free data structures, and memory allocators. A lot of technical details!

 

Need a faster and scalable software?

 

Share on