*Posted on October 28, 2020*

This article isn't about which programming language is better, instead it discusses the most powerful tool set for development of the fastest server-side system software, such as database engines and HTTPS servers. There are several specific properties of this type of software:

- Relatively large code base, 100,000 lines of C or C++ code and more. While it's possible to write particular, the most 'hot' functions, in Assembly language, it's impractical to write the whole program in Assembly.
- Databases and web servers are mission-critical software - we all got used that our Linux systems with MySQL and Nginx processes work for months and years. There are simple high availability best practices mitigating the downtime due to possible crashes, but they're the subject for another article. Meantime, it's worth mentioning that if you really-really care about high availability, then you should build you infrastructure with an assumption that any component of your system may crash at any time, just like Facebook does this -the company deploys the recent versions of the Linux kernel as soon as they're available.

We've been developing the fastest software in C, C++, and Assembly for ages. It's not a surprise that since Rust is "focused on performance" we're very interested in it. With a bit of skepticism though. Just remember the rise of Java programming language: there were a lot of reports that the JIT compilation produced code faster than C++. Now it's hard to find a case, when C++ is slower than Java, see for example the benchmark game. It's also worth mentioning that the memory garbage collection (GC) in Java leads to high tail latencies and it's hard or even impossible to do anything with the problem. Golang can not be considered for high performance programming also due to the GC.

- You do not have
`libstdc++`

with RTTI and exceptions in the kernel space. Actually,`dynamic_cast`

isn't so frequently used and there are a lot of C++ projects compiled without RTTI. If you need exceptions, then you have to port them into the kernel.`libstdc++`

uses basic C allocations, so it must be significantly reworked for the kernel. - You can't use the STL and Boost libraries and, in fact, all kernels already have their own libraries. C++ introduces filesystem, threading and networking libraries, which are senseless in an OS kernel. From the other hand, the modern OSes provide advanced synchronization primitives, which are still not available in standard C++ (e.g. there is still no read-write spinlocks in C++).
- The Linux kernel provides number of memory allocators (SLAB, page,
`vmalloc()`

,`kmalloc()`

, and so on), thus you have to use`placement new`

and/or just use the C functions for memory allocation and freeing. Aligned memory is crucial for the performance, but you need to write special wrappers to get aligned memory with`new`

. - Strong type safety isn't so comfortable for system programming when raw memory pointers are frequently casted to some data structures. This is debatable though: while some people are uncomfortable with frequent
`reinterpret_cast<Foo *>(ptr)`

instead of short`(Foo *)ptr`

, the others are good with more typing and more type safety. - C++ name mangling, required for namespaces and function overloading, makes function hard to call from Assembly, so you need to use
`extern "C"`

. - You have to make special code sections for static objects constructors and destructors,
`.ctor`

and`.dtor`

correspondingly. - C++ exceptions can not cross
*context*boundaries, i.e. you can not throw an exception in one thread and catch it in another. The operating system kernel deals with much more complex context model: there are kernel threads, user space processes entering into the kernel, deferred and hardware interruptions. The contexts can preempt each other in voluntarily or cooperative manner, so exception handling of current context could be preempted by another context. There are also memory management and contexts switching code which could conflict with exception handling code. Just like for RTTI, it's possible to implement the mechanism in kernel, but the current standard library can not be used. - While Clang and G++ support
`__restrict__`

extension, the official C++ standard does not support it. - Variable length arrays (VLA) are discouraged in the Linux kernel, they are still handy in some scenarios, but are completely unavailable in C++.

Exceptions can improve performance when we have to handle error codes in too may places, e.g. (let the functions be inlined and very small)

```
if (func_1())
return -EINVAL;
if (func_2())
return -EINVAL;
....
if (func_100())
return -EINVAL;
```

The problem with the code is that there are extra conditional jumps. Modern CPU are pretty good with branch prediction, but it still hurts performance. In C++ we can just write ```
try {
func_1();
func_2();
...
func_100();
} catch (...) {
return -EINVAL;
}
```

, so there are no extra conditions in the Secondly, the killing feature of C++ is that **it is C**. If you don't want to use exceptions or RTTI, then you can just switch the features off. Most of C programs can be just compiled with a C++ compiler with very small changes or without any changes at all. As an example, we need only this trivial change

```
$ diff -up nbody.c nbody-new.cc
@@ -112,9 +112,9 @@ static void advance(body bodies[]){
// ROUNDED_INTERACTIONS_COUNT elements to simplify one of the following
// loops and to also keep the second and third arrays in position_Deltas
// aligned properly.
- static alignas(__m128d) double
- position_Deltas[3][ROUNDED_INTERACTIONS_COUNT],
- magnitudes[ROUNDED_INTERACTIONS_COUNT];
+ static double
+ position_Deltas[3][ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16))),
+ magnitudes[ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16)));
// Calculate the position_Deltas between the bodies for each interaction.
for(intnative_t i=0, k=0; i < BODIES_COUNT-1; ++i)
```

to compile the C program with G++ compiler. The modern C++ compilers provide C compatibility extensions like the `__restrict__`

keyword. You always can write the most performance critical code of a C++ program in C style. If you don't like the STL containers with an overhead, then you can use Boost.intrusive or even port a similar container from the Linux kernel or other fast C project -in most of the cases this won't be painful. See for example how a hash table from PostgreSQL, HTrie from Tempesta DB, and the Linux kernel read/write spinlocks (all are written in C) were used in a C++ benchmark. The last thing which must be mentioned about development of high performance programs in C++ is template metaprogramming. It's very exciting about the modern C++ standards that using templates you can write quite sophisticated logic which is fully computed in the compile time and costs nothing in the run time.

`goto`

operator. It's not so flexible as assembly `jmp`

, but C compilers provide extensions which make the operator almost the full equivalent of the assembly jumps. Unfortunately, Rust doesn't support goto, which makes it awkward in the whole class of performance crucial tasks. We talk about parsers. Not configuration file parsers, which are perfectly done with bunch of `switch`

and `if`

statements, but about large and very fast parsers like an HTTP parser. You might think that this is "too narrow" or "too specific" task, but recall the parser generators, like Ragel or GNU Bison - if you develop such a parser generator, then you never know how big parsers will be generated. (By the way, Ragel extensively uses `goto`

to generate very fast parsers.) Note also SQL parsers in every RDBMS. Actually, we can generalize the class of the tasks as *large and fast finite state machines*, which also includes, for example, regular expressions.

The HTTP parser in Tempesta FW is much larger than HTTP parsers in other web servers because, in addition to the basic HTTP parsing, it also does many security checks and strictly validates the input against the RFCs. Also our parser works with zero-copy data, so it also much care about data chunks. The technical details of the parser were described in our talk at the SCALE 17x conference and you can watch the talk video or the slides.

Typically, HTTP parsers are implemented as a loop over input characters and nested `switch`

statements for allowed characters and available states. See for example `ngx_http_parse_request_line()`

in the Nginx parser source code. In sake of brevity let's consider a simplified version of code:

```
while (++str_ptr) {
switch (current_state) {
case 0:
...
case 1:
...
...
case 100:
switch (*str_ptr) {
case 'a':
...
current_state = 200;
break;
case 'b':
...
current_state = 101;
break;
}
break;
case 101:
...
...
}
}
```

Assume that the parser has finished parsing of the previous chunk of data at state `100`

and the current chunk of data starts from character `'b'`

. Regardless the switch statement optimization (it can be optimized by a compiler with a lookup table or binary search), there are 3 problems with the code: - Looking up the state
`100`

is still more expensive than a direct jump. - While the code for the state
`101`

is placed right after the code for state`100`

, we have to re-enter the`while`

and`switch`

statements, i.e. lookup the next state again, instead of just move one character further and just jump right to the next state. - Even if we always reach state
`101`

after the state`100`

, a compiler may reorganize the code in such a way that the state`101`

is placed at the beginning of the`switch`

statement while the state`100`

is placed somewhere at the end.

`goto`

operator and the GCC compiler extensions for labels as values and label attributes with the code like: ```
// Use labels as values to remember the current state when we
// exit the state machine at the end of current data chunk.
parser->state = &&state_100;
goto *parser->state;
while (true) {
state_0:
...
state_1:
...
// The state is placed by a compiler closer to the beginning
// of the code.
state_100: __attribute__((hot))
// We still use small switches for small character sets.
switch (*str_ptr) {
case 'a':
...
++str_ptr;
goto state_200;
case 'b':
...
++str_ptr;
// Just fall through to the state 101.
}
// This state is placed by the compiler after state_100.
state_101: __attribute__((cold))
...
}
```

Since Rust doesn't support the `goto`

operator, we would need to use assembly language to implement the state machine with direct jumps and the optimal code layout. Public key cryptography and elliptic curves in particular operates with big integers. The book BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic by Tom St Denis provides great details about the subject as well as C implementations for many algorithms, but for now let's consider the basic addition of two big integers of 128 bits length on a 64-bit machine. The big integers consist of *limbs*, two 64-bit `long`

s. To sum the integers we have to care about carry between the limbs, so resulting C code looks like (see 4.2.1 in the book):

```
// a := a + b
// x[0] is the less significant limb,
// x[1] is the most significant limb.
void s_mp_add(unsigned long *a, unsigned long *b)
{
unsigned long carry;
a[0] += b[0];
carry = (a[0] < b[0]);
a[1] += b[1] + carry;
}
```

The code is small and simple, but you probably had to think a little bit about correctness of the manipulations with `carry`

. Hopefully, x86-64 is a CISC architecture, i.e. it provides us a lot of computational features and one of them is computations with carry, so the code above can be done in two instructions only and there is no need for the comparison: ```
// Pointer to a is in %RDI, pointer to b is in %RSI
movq (%rdi), %r8
movq 8(%rdi), %r9
addq (%rsi), %r8 // add with carry
addc 8(%rsi), %r9 // use the carry in the next addition
movq (%r8), (%rdi)
movq (%r9), 8(%rdi)
```

If you look into any well-optimized cryptography library, like OpenSSL or Tempesta TLS, then you find a lot of assembly code (OpenSSL actually generates the assembly source code with Perl scripts). `goto`

operator: - Technically, Rust supports custom memory allocators, but there are serious limitations. It's worth mentioning that any high-performance software uses number of ad-hoc memory allocators.
- Just like C++, Rust doesn't provide VLAs. But if C++ still can use
`alloca(3)`

, Rust doesn't provide stack allocations at all. That's pity because the stack allocations are the cheapest ones and custom memory allocators aren't an option due to previous point. - It seems likely/unlikely support is much less powerful than in the modern C or C++ compilers.
- Reading and writing data structures from/into raw memory is doable in Rust, but requires more code than in C and even C++. Not a big deal though.
- The Rust's generics and macros are much less powerful than provided by C++ templates coupled with C macros. Although, this is also not so crucial.

- Types
**casting**is well handled by the modern C and C++ compilers with`-Wall`

compiler option. statements are also handled with the`switch`

`-Wall`

. Moreover, GCC also introduces`-Wimplicit-fallthrough`

compiler option, which makes 'fall through's explicit.**smarter loops**are addressed by the C++ range-based for loop since C++11.-
`const auto &`

references and fine-grained copy and move semantics take care about**smart copying**. - RAII provides robust
**lifetimes**, but unfortunately*doesn't cover all the cases*. - C++
`const`

classes with or without`mutable`

members and`const`

references and variables provide fine-grained**mutability**, bust also*don't cover all the cases*.

`unsafe`

code in Rust, just like working with raw pointers in C++. Problem | G++ | GCC | Clang | Rust |
---|---|---|---|---|

fannkuch-redux | 8.07 | 7.53 | 9.45 | 6.88 |

spectral-norm | 0.72 | 0.72 | 0.72 | 0.71 |

n-body | 4.09 | 4.30 | 3.31 | 3.31 |

binary-trees | 1.12 | 1.78 | 1.88 | 1.20 |

fasta | 1.04 | 0.82 | 0.88 | 0.91 |

pidigits | 0.71 | 0.73 | 0.81 | 0.74 |

mandelbrot | 0.84 | 1.27 | 2.09 | 0.92 |

regex-redux | 1.08 | 0.80 | 0.81 | 1.28 |

reverse-complement | 0.63 | 0.87 | 0.98 | 0.75 |

k-nucleotide | 1.93 | 3.71 | 6.19 | 3.29 |

The C program

```
// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Jeremy Zerfas
// Based on the Ada program by Jonathan Parker and Georg Bauhaus which in turn
// was based on code by Dave Fladebo, Eckehard Berns, Heiner Marxen, Hongwei Xi,
// and The Anh Tran and also the Java program by Oleg Mazurov.
// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
#define PREFERRED_NUMBER_OF_BLOCKS_TO_USE 12
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
// intptr_t should be the native integer type on most sane systems.
typedef intptr_t intnative_t;
int main(int argc, char ** argv){
const intnative_t n=atoi(argv[1]);
// Create and initialize factorial_Lookup_Table.
intnative_t factorial_Lookup_Table[n+1];
factorial_Lookup_Table[0]=1;
for(intnative_t i=0; ++i<=n;)
factorial_Lookup_Table[i]=i*factorial_Lookup_Table[i-1];
// Determine the block_Size to use. If n! is less than
// PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
// block_Size from being set to 0. This also causes smaller values of n to
// be computed serially which is faster and uses less resources for small
// values of n.
const intnative_t block_Size=factorial_Lookup_Table[n]/
(factorial_Lookup_Table[n]<PREFERRED_NUMBER_OF_BLOCKS_TO_USE ?
1 : PREFERRED_NUMBER_OF_BLOCKS_TO_USE);
intnative_t maximum_Flip_Count=0, checksum=0;
// Iterate over each block.
#pragma omp parallel for \
reduction(max:maximum_Flip_Count) reduction(+:checksum)
for(intnative_t initial_Permutation_Index_For_Block=0;
initial_Permutation_Index_For_Block<factorial_Lookup_Table[n];
initial_Permutation_Index_For_Block+=block_Size){
intnative_t count[n];
int8_t temp_Permutation[n], current_Permutation[n];
// Initialize count and current_Permutation.
count[0]=0;
for(intnative_t i=0; i<n; ++i)
current_Permutation[i]=i;
for(intnative_t i=n-1,
permutation_Index=initial_Permutation_Index_For_Block; i>0; --i){
const intnative_t d=permutation_Index/factorial_Lookup_Table[i];
permutation_Index=permutation_Index%factorial_Lookup_Table[i];
count[i]=d;
for(intnative_t j=0; j<n; ++j)
temp_Permutation[j]=current_Permutation[j];
for(intnative_t j=0; j<=i; ++j)
current_Permutation[j]= j+d<=i ?
temp_Permutation[j+d] : temp_Permutation[j+d-i-1];
}
// Iterate over each permutation in the block.
const intnative_t last_Permutation_Index_In_Block=
initial_Permutation_Index_For_Block+block_Size-1;
for(intnative_t permutation_Index=initial_Permutation_Index_For_Block; ;
++permutation_Index){
// If the first value in the current_Permutation is not 1 (0) then
// we will need to do at least one flip for the current_Permutation.
if(current_Permutation[0]>0){
// Make a copy of current_Permutation[] to work on. Note that we
// don't need to copy the first value since that will be stored
// in a separate variable since it gets used a lot.
for(intnative_t i=0; ++i<n;)
temp_Permutation[i]=current_Permutation[i];
intnative_t flip_Count=1;
// Flip temp_Permutation until the element at the first_Value
// index is 1 (0).
for(intnative_t first_Value=current_Permutation[0];
temp_Permutation[first_Value]>0; ++flip_Count){
// Record the new_First_Value and restore the old
// first_Value at its new flipped position.
const int8_t new_First_Value=temp_Permutation[first_Value];
temp_Permutation[first_Value]=first_Value;
// If first_Value is greater than 3 (2) then we are flipping
// a series of four or more values so we will also need to
// flip additional elements in the middle of the
// temp_Permutation.
if(first_Value>2){
intnative_t low_Index=1, high_Index=first_Value-1;
// Note that this loop is written so that it will run at
// most 16 times so that compilers will be more willing
// to unroll it. Consequently this won't work right when
// n is greater than 35. This would probably be the
// least of your concerns since 21! won't fit into 64
// bit integers and even if it did you probably wouldn't
// want to run this program with a value that large
// since it would take thousands of years to do on a
// modern desktop computer. ;-)
do{
const int8_t temp=temp_Permutation[high_Index];
temp_Permutation[high_Index]=
temp_Permutation[low_Index];
temp_Permutation[low_Index]=temp;
}while(low_Index+++3<=high_Index-- && low_Index<16);
}
// Update first_Value to new_First_Value that we recorded
// earlier.
first_Value=new_First_Value;
}
// Update the checksum.
if(permutation_Index%2==0)
checksum+=flip_Count;
else
checksum-=flip_Count;
// Update maximum_Flip_Count if necessary.
if(flip_Count>maximum_Flip_Count)
maximum_Flip_Count=flip_Count;
}
// Break out of the loop when we get to the
// last_Permutation_Index_In_Block.
if(permutation_Index>=last_Permutation_Index_In_Block)
break;
// Generate the next permutation.
int8_t first_Value=current_Permutation[1];
current_Permutation[1]=current_Permutation[0];
current_Permutation[0]=first_Value;
for(intnative_t i=1; ++count[i]>i;){
count[i++]=0;
const int8_t new_First_Value=current_Permutation[0]=
current_Permutation[1];
for(intnative_t j=0; ++j<i;)
current_Permutation[j]=current_Permutation[j+1];
current_Permutation[i]=first_Value;
first_Value=new_First_Value;
}
}
}
// Output the results to stdout.
printf("%jd\nPfannkuchen(%jd) = %jd\n", (intmax_t)checksum, (intmax_t)n,
(intmax_t)maximum_Flip_Count);
return 0;
}
```

The Rust program

```
// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Cliff L. Biffle, translated from Jeremy Zerfas's C program.
//
// The C program was based on the Ada program by Jonathan Parker and Georg
// Bauhaus which in turn was based on code by Dave Fladebo, Eckehard Berns,
// Heiner Marxen, Hongwei Xi, and The Anh Tran and also the Java program by Oleg
// Mazurov.
extern crate rayon;
use rayon::prelude::*;
use std::mem::replace;
// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
const PREFERRED_NUMBER_OF_BLOCKS_TO_USE: usize = 12;
// One greater than the maximum `n` value. Used to size stack arrays.
const MAX_N: usize = 16;
fn main() {
let n = std::env::args().nth(1).unwrap().parse().unwrap();
// This assert eliminates several bounds checks.
assert!(n < MAX_N);
// Create and initialize factorial_lookup_table.
let factorial_lookup_table = {
let mut table: [usize; MAX_N] = [0; MAX_N];
table[0] = 1;
for i in 1..MAX_N {
table[i] = i * table[i - 1];
}
table
};
// Determine the block_size to use. If n! is less than
// PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
// block_size from being set to 0. This also causes smaller values of n to
// be computed serially which is faster and uses less resources for small
// values of n.
let block_size =
1.max(factorial_lookup_table[n] / PREFERRED_NUMBER_OF_BLOCKS_TO_USE);
let block_count = factorial_lookup_table[n] / block_size;
// Iterate over each block.
let (checksum, max_flip_count) = (0..block_count)
.into_par_iter()
.map(|bn| {
let initial_permutation_index = bn * block_size;
let mut count: [usize; MAX_N] = [0; MAX_N];
let mut current_permutation: [u8; MAX_N] =
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
// Initialize count and current_permutation.
{
let mut temp_permutation: [u8; MAX_N] = [0; MAX_N];
let mut permutation_index = initial_permutation_index;
for i in (1..n).rev() {
let f = factorial_lookup_table[i];
let d = permutation_index / f;
count[i] = d;
// Rotate the permutation left by d places. This is faster
// than using slice::rotate_left.
temp_permutation[0..=i - d]
.copy_from_slice(¤t_permutation[d..=i]);
temp_permutation[i - d + 1..=i]
.copy_from_slice(¤t_permutation[..d]);
current_permutation = temp_permutation;
permutation_index = permutation_index % f;
}
}
let mut max_flip_count = 0;
let mut checksum = 0;
// Iterate over each permutation in the block.
let last_permutation_index = initial_permutation_index + block_size;
for permutation_index in
initial_permutation_index..last_permutation_index
{
// If the first value in the current_permutation is not 1 (0)
// then we will need to do at least one flip for the
// current_permutation.
if current_permutation[0] > 0 {
// Make a copy of current_permutation[] to work on.
let mut temp_permutation = current_permutation;
let mut flip_count: usize = 1;
// Flip temp_permutation until the element at the
// first_value index is 1 (0).
let mut first_value = current_permutation[0] as usize & 0xF;
while temp_permutation[first_value] > 0 {
// Record the new_first_value and restore the old
// first_value at its new flipped position.
let new_first_value = replace(
&mut temp_permutation[first_value],
first_value as u8,
);
// If first_value is greater than 3 (2) then we are
// flipping a series of four or more values so we will
// also need to flip additional elements in the middle
// of the temp_permutation.
if first_value > 2 {
for (low_index, high_index) in
(1..first_value).zip((1..first_value).rev())
{
temp_permutation.swap(high_index, low_index);
if low_index + 3 > high_index {
break;
}
}
}
// Update first_value to new_first_value that we
// recorded earlier.
first_value = new_first_value as usize & 0xF;
flip_count += 1;
}
// Update the checksum.
if permutation_index % 2 == 0 {
checksum += flip_count;
} else {
checksum -= flip_count;
}
// Update max_flip_count if necessary.
max_flip_count = max_flip_count.max(flip_count);
}
// Generate the next permutation.
current_permutation.swap(0, 1);
let mut first_value = current_permutation[0];
for i in 1..MAX_N - 2 {
count[i] += 1;
if count[i] <= i {
break;
}
count[i] = 0;
let new_first_value = current_permutation[1];
for j in 0..i + 1 {
current_permutation[j] = current_permutation[j + 1];
}
current_permutation[i + 1] = first_value;
first_value = new_first_value;
}
}
(checksum, max_flip_count)
})
.reduce(
|| (0, 0),
|(cs1, mf1), (cs2, mf2)| (cs1 + cs2, mf1.max(mf2)),
);
// Output the results to stdout.
println!("{}", checksum);
println!("Pfannkuchen({}) = {}", n, max_flip_count);
}
```

Let's start the C program and collect performance profile of the program using the Linux perf tool. We can see with `perf report`

or `perf annotate`

what's the most hot code in the program:

```
0.46 | movzbl -0x9(%r15,%rax,1),%ecx
0.96 | movzbl 0x9(%r15),%r8d
| mov %r8b,-0x9(%r15,%rax,1)
2.31 | mov %cl,0x9(%r15)
| lea -0xa(%rax),%rcx
12.76 | cmp $0xb,%rdi
```

The performance killer getting the 12.76% of time is the part of unrolled loop ```
do{
const int8_t temp=temp_Permutation[high_Index];
temp_Permutation[high_Index]=
temp_Permutation[low_Index];
temp_Permutation[low_Index]=temp;
}while(low_Index+++3<=high_Index-- && low_Index<16);
```

And the `cmp`

instruction is the part of the `while`

loop condition. Actually, his loop just reverses bytes in the array. While the C implementation uses the naive and heavy operations with the arrays indexes, the Rust implementation uses the efficient double iterator: ```
if first_value > 2 {
for (low_index, high_index) in
(1..first_value).zip((1..first_value).rev())
{
temp_permutation.swap(high_index, low_index);
if low_index + 3 > high_index {
break;
}
}
}
```

Fast array reversal with SIMD! describes several ways to improve performance of the C program (the article uses C++ by the way). The first one is to use only one index `i`

and iterate only until the middle of the permuted part of the array with `temp_Permutation[i]`

and `temp_Permutation[high_Index - i]`

. That would be quite close the Rust double iterators. By the way, the more advanced way to improve performance of both the programs is to use `PSHUFB`

SSSE3 instruction, or the `_mm_shuffle_epi8()`

intrinsic, instead of the whole loop. Since there is very small number of shuffle masks, all of them can be defined on the compile time and immediately loaded into a control mask register for the instruction. However, this isn't the only differences between the implementations. The Rust program takes advantage of maximum input number `const MAX_N: usize = 16`

. Probably this small improvement affects performance the most since the compiler can now do better optimizations with loops and static arrays. The program explicitly uses static array initialization

```
let mut current_permutation: [u8; MAX_N] =
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
```

, while the C implementation does this in the run time with no assumptions on input data ```
for(intnative_t i=0; i<n; ++i)
current_Permutation[i]=i;
```

The Rust program copies arrays using built-in memory copying ```
let mut temp_permutation = current_permutation;
```

, while the C program does this in a loop again ```
for(intnative_t i=0; ++i<n;)
temp_Permutation[i]=current_Permutation[i];
```

These aren't all the inefficiencies in the C program, which were eliminated in the Rust implementation (both the programs are based on the same initial Ada program). In most of the places an optimized version of the program will be not only faster, but also shorter. Thus, in this single case when a Rust implementation is more or less faster than C, the performance difference is not about better compiler, but about more efficient structure of the program, which allows a compiler to optimize code better.

**A real high-level system programming language must be compatible with C.** Just consider the 2 examples of our real life projects.

The first one was a web application firewall (WAF). Such kind of software are typically built on top of Nginx or HAproxy HTTPS servers, which are written in C. It was easy to write C++ module for Nginx, but we would need extra glue code to develop the module in Rust and maintain all our patches for the C code of Nginx. The same developers were easily switching between C and C++ parts of the code.

In the second case our client wished to execute some external logic using MySQL user defined functions (UDF) to interact with the operating system. We could develop the logic in any programming language, but there was a constraint: we had to execute the program 5000 per second on each CPU core! That was impossible to achieve even using `posix_spawnp()`

, the fastest way to execute a program in Linux. We ended up with development of a custom UDF for MySQL, which is a shared object loaded into MySQL server process. It was quite straightforward to do with C++.

An opposite example to use Rust for an Nginx module is CloudFlare's Quiche, an Nginx extension to support QUIC and HTTP/3 protocols. While it's definitely possible to use Rust for such kind of tasks, the guys, besides the FFI code for C/C++ bindings, still had to write some C code to patch Nginx. This means that

- you had to write some extra boilerplate code for C/C++ bindings
- and your still have to deal with C/C++
**and**the second language, which makes the project more complex.

`switch`

statement as well as we couldn't get satisfactory code layout. That time we considered to introduce inline Assembly into the code of the parser. Having that the zero-copy state machine was already so sophisticated, we weren't happy about this idea. That was so nice surprise to find the computed labels and hot/cold attributes among the compiler extensions! Thanks to these features the compiler generated the optimal code for the parser. The power of C++ is "There's more than one way to do it", TIMTOWTDI. Yes, this is Perl's idea, but in many cases C++ allows you to write your program in plain C, in templates metaprogramming, using high-level STL or well-optimized custom algorithms and data structures. The modern C++ is sophisticated and requires years of experience to be fluent with the language, but it is a professional tool, which allows a professional developer to create the fastest and reliable software.

Not only Rust is immature, but it seems the language designers intentionally limited the language. There are a lot of poor programs misusing `goto`

, so they just removed the operator: good for juniors, but too limited for professionals. It's quite unlikely that the language and the compiler makes you a nice surprise, when you struggle on a complex technical task. Instead, it's likely that when you need to do something simple, what you've been doing for ages in C or C++, you'll be disappointed and start to fight with the compiler. As an example, `likely`

and `unlikely`

compiler hints are used in the Linux kernel for ages and they are so popular in the user space C/C++ programming, that they are included into the C++20 standard (before that programmers had to use compiler intrinsics). But with Rust you find that the API is experimental and works for `if`

statements only.

Web application firewall acceleration

From our experience in developing custom core logic of Web Application Firewalls (WAF), we learned several performance issues typical for the most, or even all, modern WAFs which may lead to high cost of ownership and/or denial of service. In this article we introduce a WAF accelerator, which like a web accelerator, improves performance of WAFs and protects them against DDoS attacks.

Recap NatSys Lab. blog

We recap the most interesting posts since 2011 from our old NatSys Laboratory blog: effect of the recent CPU vulnerabilities onto Linux system calls performance, deep dive into HTTP proxies functionality and performance comparison of Tempesta FW with Nginx and HAProxy, fast strings processing algorithms, lock-free data structures, and memory allocators. A lot of technical details!