Posted on October 28, 2020
This article doesn't answer which programming language is better. Instead, it discusses the most powerful tool set for development of the fastest server-side system software, such as database engines and HTTPS servers. There are several specific properties of this type of software:
We've been developing the fastest software in C, C++, and Assembly for ages. It's not a surprise that since Rust is "focused on performance" we're very interested in it. With a bit of skepticism though. Just remember the rise of Java programming language: there were a lot of reports that the JIT compilation produced code faster than C++. Now it's hard to find a case, when C++ is slower than Java, see for example the benchmark game. It's also worth mentioning that the memory garbage collection (GC) in Java leads to high tail latencies and it's hard or even impossible to do anything with the problem. Golang can not be considered for high performance programming also due to the GC.
libstdc++
with RTTI and exceptions in the kernel space. Actually, dynamic_cast
isn't so frequently used and there are a lot of C++ projects compiled without RTTI. If you need exceptions, then you have to port them into the kernel. libstdc++
uses basic C allocations, so it must be significantly reworked for the kernel. vmalloc()
, kmalloc()
, and so on), thus you have to use placement new
and/or just use the C functions for memory allocation and freeing. Aligned memory is crucial for the performance, but you need to write special wrappers to get aligned memory with new
. reinterpret_cast<Foo *>(ptr)
instead of short (Foo *)ptr
, the others are good with more typing and more type safety. extern "C"
. .ctor
and .dtor
correspondingly. __restrict__
extension, the official C++ standard does not support it. Exceptions can improve performance when we have to handle error codes in too may places, e.g. (let the functions be inlined and very small)
if (func_1())
return -EINVAL;
if (func_2())
return -EINVAL;
....
if (func_100())
return -EINVAL;
The problem with the code is that there are extra conditional jumps. Modern CPU are pretty good with branch prediction, but it still hurts performance. In C++ we can just write
try {
func_1();
func_2();
...
func_100();
} catch (...) {
return -EINVAL;
}
, so there are no extra conditions in the hot path. However, this isn't for free: most of the functions in your C++ code have to have extra epilogues with a table of exceptions, which these functions can catch, and an appropriate cleanup table. The function epilogues aren't executed in normal workflow, but they increase the size of code causing extra pollution in the CPU instruction cache. You can find great details about C++ exception handling internals in the Nico Brailovsky's blog. Secondly, the killing feature of C++ is that it is C. If you don't want to use exceptions or RTTI, then you can just switch the features off. Most of C programs can be just compiled with a C++ compiler with very small changes or without any changes at all. As an example, we need only this trivial change
$ diff -up nbody.c nbody-new.cc
@@ -112,9 +112,9 @@ static void advance(body bodies[]){
// ROUNDED_INTERACTIONS_COUNT elements to simplify one of the following
// loops and to also keep the second and third arrays in position_Deltas
// aligned properly.
- static alignas(__m128d) double
- position_Deltas[3][ROUNDED_INTERACTIONS_COUNT],
- magnitudes[ROUNDED_INTERACTIONS_COUNT];
+ static double
+ position_Deltas[3][ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16))),
+ magnitudes[ROUNDED_INTERACTIONS_COUNT] __attribute__((aligned(16)));
// Calculate the position_Deltas between the bodies for each interaction.
for(intnative_t i=0, k=0; i < BODIES_COUNT-1; ++i)
to compile the C program with G++ compiler. The modern C++ compilers provide C compatibility extensions like the __restrict__
keyword. You always can write the most performance critical code of a C++ program in C style. If you don't like the STL containers with an overhead, then you can use Boost.intrusive or even port a similar container from the Linux kernel or other fast C project -in most of the cases this won't be painful. See for example how a hash table from PostgreSQL, HTrie from Tempesta DB, and the Linux kernel read/write spinlocks (all are written in C) were used in a C++ benchmark. The last thing which must be mentioned about development of high performance programs in C++ is template metaprogramming. It's very exciting about the modern C++ standards that using templates you can write quite sophisticated logic which is fully computed in the compile time and costs nothing in the run time.
goto
operator. It's not so flexible as assembly jmp
, but C compilers provide extensions which make the operator almost the full equivalent of the assembly jumps. Unfortunately, Rust doesn't support goto, which makes it awkward in the whole class of performance crucial tasks. We talk about parsers. Not configuration file parsers, which are perfectly done with bunch of switch
and if
statements, but about large and very fast parsers like an HTTP parser. You might think that this is "too narrow" or "too specific" task, but recall the parser generators, like Ragel or GNU Bison - if you develop such a parser generator, then you never know how big parsers will be generated. (By the way, Ragel extensively uses goto
to generate very fast parsers.) Note also SQL parsers in every RDBMS. Actually, we can generalize the class of the tasks as large and fast finite state machines, which also includes, for example, regular expressions.
The HTTP parser in Tempesta FW is much larger than HTTP parsers in other web servers because, in addition to the basic HTTP parsing, it also does many security checks and strictly validates the input against the RFCs. Also our parser works with zero-copy data, so it also much care about data chunks. The technical details of the parser were described in our talk at the SCALE 17x conference and you can watch the talk video or the slides.
Typically, HTTP parsers are implemented as a loop over input characters and nested switch
statements for allowed characters and available states. See for example ngx_http_parse_request_line()
in the Nginx parser source code. In sake of brevity let's consider a simplified version of code:
while (++str_ptr) {
switch (current_state) {
case 0:
...
case 1:
...
...
case 100:
switch (*str_ptr) {
case 'a':
...
current_state = 200;
break;
case 'b':
...
current_state = 101;
break;
}
break;
case 101:
...
...
}
}
Assume that the parser has finished parsing of the previous chunk of data at state 100
and the current chunk of data starts from character 'b'
. Regardless the switch statement optimization (it can be optimized by a compiler with a lookup table or binary search), there are 3 problems with the code: 100
is still more expensive than a direct jump. 101
is placed right after the code for state 100
, we have to re-enter the while
and switch
statements, i.e. lookup the next state again, instead of just move one character further and just jump right to the next state. 101
after the state 100
, a compiler may reorganize the code in such a way that the state 101
is placed at the beginning of the switch
statement while the state 100
is placed somewhere at the end. goto
operator and the GCC compiler extensions for labels as values and label attributes with the code like:
// Use labels as values to remember the current state when we
// exit the state machine at the end of current data chunk.
parser->state = &&state_100;
goto *parser->state;
while (true) {
state_0:
...
state_1:
...
// The state is placed by a compiler closer to the beginning
// of the code.
state_100: __attribute__((hot))
// We still use small switches for small character sets.
switch (*str_ptr) {
case 'a':
...
++str_ptr;
goto state_200;
case 'b':
...
++str_ptr;
// Just fall through to the state 101.
}
// This state is placed by the compiler after state_100.
state_101: __attribute__((cold))
...
}
Since Rust doesn't support the goto
operator, we would need to use assembly language to implement the state machine with direct jumps and the optimal code layout. Public key cryptography and elliptic curves in particular operates with big integers. The book BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic by Tom St Denis provides great details about the subject as well as C implementations for many algorithms, but for now let's consider the basic addition of two big integers of 128 bits length on a 64-bit machine. The big integers consist of limbs, two 64-bit long
s. To sum the integers we have to care about carry between the limbs, so resulting C code looks like (see 4.2.1 in the book):
// a := a + b
// x[0] is the less significant limb,
// x[1] is the most significant limb.
void s_mp_add(unsigned long *a, unsigned long *b)
{
unsigned long carry;
a[0] += b[0];
carry = (a[0] < b[0]);
a[1] += b[1] + carry;
}
The code is small and simple, but you probably had to think a little bit about correctness of the manipulations with carry
. Hopefully, x86-64 is a CISC architecture, i.e. it provides us a lot of computational features and one of them is computations with carry, so the code above can be done in two instructions only and there is no need for the comparison:
// Pointer to a is in %RDI, pointer to b is in %RSI
movq (%rdi), %r8
movq 8(%rdi), %r9
addq (%rsi), %r8 // add with carry
addc 8(%rsi), %r9 // use the carry in the next addition
movq (%r8), (%rdi)
movq (%r9), 8(%rdi)
If you look into any well-optimized cryptography library, like OpenSSL or Tempesta TLS, then you find a lot of assembly code (OpenSSL actually generates the assembly source code with Perl scripts). goto
operator: alloca(3)
, Rust doesn't provide stack allocations at all. That's pity because the stack allocations are the cheapest ones and custom memory allocators aren't an option due to previous point. From the other hand, it's quite unlikely that Rust can be used for the main Linux kernel code, e.g. memory management or networking. The first thing is that there are millions line of code carefully written, reviewed and tested. There is no reason just to throw all the code out and rewrite it from scratch (probably as there are also no such resources in the world). The second thing is that Rust in the default safe mode has multiple limitations for advanced concurrency code. Paul McKenny discussed many of the limitations in his So You Want to Rust the Linux Kernel? blog series.
One of the option to deal with the limitations is to use Rust in unsafe
mode, so the most advantage of Rust vanishes.
-Wall
compiler option. switch
statements are also handled with the -Wall
. Moreover, GCC also introduces -Wimplicit-fallthrough
compiler option, which makes 'fall through's explicit. const auto &
references and fine-grained copy and move semantics take care about smart copying. const
classes with or without mutable
members and const
references and variables provide fine-grained mutability, bust also don't cover all the cases. __builtin_object_size
, which are helpful to make memory operations like memcpy()
safe, read for example the LWN article about memcpy() hardening in the Linux kernel. unsafe
code in Rust, just like working with raw pointers in C++. Problem | G++ | GCC | Clang | Rust |
---|---|---|---|---|
fannkuch-redux | 8.07 | 7.53 | 9.45 | 6.88 |
spectral-norm | 0.72 | 0.72 | 0.72 | 0.71 |
n-body | 4.09 | 4.30 | 3.31 | 3.31 |
binary-trees | 1.12 | 1.78 | 1.88 | 1.20 |
fasta | 1.04 | 0.82 | 0.88 | 0.91 |
pidigits | 0.71 | 0.73 | 0.81 | 0.74 |
mandelbrot | 0.84 | 1.27 | 2.09 | 0.92 |
regex-redux | 1.08 | 0.80 | 0.81 | 1.28 |
reverse-complement | 0.63 | 0.87 | 0.98 | 0.75 |
k-nucleotide | 1.93 | 3.71 | 6.19 | 3.29 |
The C program
// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Jeremy Zerfas
// Based on the Ada program by Jonathan Parker and Georg Bauhaus which in turn
// was based on code by Dave Fladebo, Eckehard Berns, Heiner Marxen, Hongwei Xi,
// and The Anh Tran and also the Java program by Oleg Mazurov.
// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
#define PREFERRED_NUMBER_OF_BLOCKS_TO_USE 12
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
// intptr_t should be the native integer type on most sane systems.
typedef intptr_t intnative_t;
int main(int argc, char ** argv){
const intnative_t n=atoi(argv[1]);
// Create and initialize factorial_Lookup_Table.
intnative_t factorial_Lookup_Table[n+1];
factorial_Lookup_Table[0]=1;
for(intnative_t i=0; ++i<=n;)
factorial_Lookup_Table[i]=i*factorial_Lookup_Table[i-1];
// Determine the block_Size to use. If n! is less than
// PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
// block_Size from being set to 0. This also causes smaller values of n to
// be computed serially which is faster and uses less resources for small
// values of n.
const intnative_t block_Size=factorial_Lookup_Table[n]/
(factorial_Lookup_Table[n]<PREFERRED_NUMBER_OF_BLOCKS_TO_USE ?
1 : PREFERRED_NUMBER_OF_BLOCKS_TO_USE);
intnative_t maximum_Flip_Count=0, checksum=0;
// Iterate over each block.
#pragma omp parallel for \
reduction(max:maximum_Flip_Count) reduction(+:checksum)
for(intnative_t initial_Permutation_Index_For_Block=0;
initial_Permutation_Index_For_Block<factorial_Lookup_Table[n];
initial_Permutation_Index_For_Block+=block_Size){
intnative_t count[n];
int8_t temp_Permutation[n], current_Permutation[n];
// Initialize count and current_Permutation.
count[0]=0;
for(intnative_t i=0; i<n; ++i)
current_Permutation[i]=i;
for(intnative_t i=n-1,
permutation_Index=initial_Permutation_Index_For_Block; i>0; --i){
const intnative_t d=permutation_Index/factorial_Lookup_Table[i];
permutation_Index=permutation_Index%factorial_Lookup_Table[i];
count[i]=d;
for(intnative_t j=0; j<n; ++j)
temp_Permutation[j]=current_Permutation[j];
for(intnative_t j=0; j<=i; ++j)
current_Permutation[j]= j+d<=i ?
temp_Permutation[j+d] : temp_Permutation[j+d-i-1];
}
// Iterate over each permutation in the block.
const intnative_t last_Permutation_Index_In_Block=
initial_Permutation_Index_For_Block+block_Size-1;
for(intnative_t permutation_Index=initial_Permutation_Index_For_Block; ;
++permutation_Index){
// If the first value in the current_Permutation is not 1 (0) then
// we will need to do at least one flip for the current_Permutation.
if(current_Permutation[0]>0){
// Make a copy of current_Permutation[] to work on. Note that we
// don't need to copy the first value since that will be stored
// in a separate variable since it gets used a lot.
for(intnative_t i=0; ++i<n;)
temp_Permutation[i]=current_Permutation[i];
intnative_t flip_Count=1;
// Flip temp_Permutation until the element at the first_Value
// index is 1 (0).
for(intnative_t first_Value=current_Permutation[0];
temp_Permutation[first_Value]>0; ++flip_Count){
// Record the new_First_Value and restore the old
// first_Value at its new flipped position.
const int8_t new_First_Value=temp_Permutation[first_Value];
temp_Permutation[first_Value]=first_Value;
// If first_Value is greater than 3 (2) then we are flipping
// a series of four or more values so we will also need to
// flip additional elements in the middle of the
// temp_Permutation.
if(first_Value>2){
intnative_t low_Index=1, high_Index=first_Value-1;
// Note that this loop is written so that it will run at
// most 16 times so that compilers will be more willing
// to unroll it. Consequently this won't work right when
// n is greater than 35. This would probably be the
// least of your concerns since 21! won't fit into 64
// bit integers and even if it did you probably wouldn't
// want to run this program with a value that large
// since it would take thousands of years to do on a
// modern desktop computer. ;-)
do{
const int8_t temp=temp_Permutation[high_Index];
temp_Permutation[high_Index]=
temp_Permutation[low_Index];
temp_Permutation[low_Index]=temp;
}while(low_Index+++3<=high_Index-- && low_Index<16);
}
// Update first_Value to new_First_Value that we recorded
// earlier.
first_Value=new_First_Value;
}
// Update the checksum.
if(permutation_Index%2==0)
checksum+=flip_Count;
else
checksum-=flip_Count;
// Update maximum_Flip_Count if necessary.
if(flip_Count>maximum_Flip_Count)
maximum_Flip_Count=flip_Count;
}
// Break out of the loop when we get to the
// last_Permutation_Index_In_Block.
if(permutation_Index>=last_Permutation_Index_In_Block)
break;
// Generate the next permutation.
int8_t first_Value=current_Permutation[1];
current_Permutation[1]=current_Permutation[0];
current_Permutation[0]=first_Value;
for(intnative_t i=1; ++count[i]>i;){
count[i++]=0;
const int8_t new_First_Value=current_Permutation[0]=
current_Permutation[1];
for(intnative_t j=0; ++j<i;)
current_Permutation[j]=current_Permutation[j+1];
current_Permutation[i]=first_Value;
first_Value=new_First_Value;
}
}
}
// Output the results to stdout.
printf("%jd\nPfannkuchen(%jd) = %jd\n", (intmax_t)checksum, (intmax_t)n,
(intmax_t)maximum_Flip_Count);
return 0;
}
The Rust program
// The Computer Language Benchmarks Game
// https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
//
// Contributed by Cliff L. Biffle, translated from Jeremy Zerfas's C program.
//
// The C program was based on the Ada program by Jonathan Parker and Georg
// Bauhaus which in turn was based on code by Dave Fladebo, Eckehard Berns,
// Heiner Marxen, Hongwei Xi, and The Anh Tran and also the Java program by Oleg
// Mazurov.
extern crate rayon;
use rayon::prelude::*;
use std::mem::replace;
// This value controls how many blocks the workload is broken up into (as long
// as the value is less than or equal to the factorial of the argument to this
// program) in order to allow the blocks to be processed in parallel if
// possible. PREFERRED_NUMBER_OF_BLOCKS_TO_USE should be some number which
// divides evenly into all factorials larger than it. It should also be around
// 2-8 times the amount of threads you want to use in order to create enough
// blocks to more evenly distribute the workload amongst the threads.
const PREFERRED_NUMBER_OF_BLOCKS_TO_USE: usize = 12;
// One greater than the maximum `n` value. Used to size stack arrays.
const MAX_N: usize = 16;
fn main() {
let n = std::env::args().nth(1).unwrap().parse().unwrap();
// This assert eliminates several bounds checks.
assert!(n < MAX_N);
// Create and initialize factorial_lookup_table.
let factorial_lookup_table = {
let mut table: [usize; MAX_N] = [0; MAX_N];
table[0] = 1;
for i in 1..MAX_N {
table[i] = i * table[i - 1];
}
table
};
// Determine the block_size to use. If n! is less than
// PREFERRED_NUMBER_OF_BLOCKS_TO_USE then just use a single block to prevent
// block_size from being set to 0. This also causes smaller values of n to
// be computed serially which is faster and uses less resources for small
// values of n.
let block_size =
1.max(factorial_lookup_table[n] / PREFERRED_NUMBER_OF_BLOCKS_TO_USE);
let block_count = factorial_lookup_table[n] / block_size;
// Iterate over each block.
let (checksum, max_flip_count) = (0..block_count)
.into_par_iter()
.map(|bn| {
let initial_permutation_index = bn * block_size;
let mut count: [usize; MAX_N] = [0; MAX_N];
let mut current_permutation: [u8; MAX_N] =
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
// Initialize count and current_permutation.
{
let mut temp_permutation: [u8; MAX_N] = [0; MAX_N];
let mut permutation_index = initial_permutation_index;
for i in (1..n).rev() {
let f = factorial_lookup_table[i];
let d = permutation_index / f;
count[i] = d;
// Rotate the permutation left by d places. This is faster
// than using slice::rotate_left.
temp_permutation[0..=i - d]
.copy_from_slice(¤t_permutation[d..=i]);
temp_permutation[i - d + 1..=i]
.copy_from_slice(¤t_permutation[..d]);
current_permutation = temp_permutation;
permutation_index = permutation_index % f;
}
}
let mut max_flip_count = 0;
let mut checksum = 0;
// Iterate over each permutation in the block.
let last_permutation_index = initial_permutation_index + block_size;
for permutation_index in
initial_permutation_index..last_permutation_index
{
// If the first value in the current_permutation is not 1 (0)
// then we will need to do at least one flip for the
// current_permutation.
if current_permutation[0] > 0 {
// Make a copy of current_permutation[] to work on.
let mut temp_permutation = current_permutation;
let mut flip_count: usize = 1;
// Flip temp_permutation until the element at the
// first_value index is 1 (0).
let mut first_value = current_permutation[0] as usize & 0xF;
while temp_permutation[first_value] > 0 {
// Record the new_first_value and restore the old
// first_value at its new flipped position.
let new_first_value = replace(
&mut temp_permutation[first_value],
first_value as u8,
);
// If first_value is greater than 3 (2) then we are
// flipping a series of four or more values so we will
// also need to flip additional elements in the middle
// of the temp_permutation.
if first_value > 2 {
for (low_index, high_index) in
(1..first_value).zip((1..first_value).rev())
{
temp_permutation.swap(high_index, low_index);
if low_index + 3 > high_index {
break;
}
}
}
// Update first_value to new_first_value that we
// recorded earlier.
first_value = new_first_value as usize & 0xF;
flip_count += 1;
}
// Update the checksum.
if permutation_index % 2 == 0 {
checksum += flip_count;
} else {
checksum -= flip_count;
}
// Update max_flip_count if necessary.
max_flip_count = max_flip_count.max(flip_count);
}
// Generate the next permutation.
current_permutation.swap(0, 1);
let mut first_value = current_permutation[0];
for i in 1..MAX_N - 2 {
count[i] += 1;
if count[i] <= i {
break;
}
count[i] = 0;
let new_first_value = current_permutation[1];
for j in 0..i + 1 {
current_permutation[j] = current_permutation[j + 1];
}
current_permutation[i + 1] = first_value;
first_value = new_first_value;
}
}
(checksum, max_flip_count)
})
.reduce(
|| (0, 0),
|(cs1, mf1), (cs2, mf2)| (cs1 + cs2, mf1.max(mf2)),
);
// Output the results to stdout.
println!("{}", checksum);
println!("Pfannkuchen({}) = {}", n, max_flip_count);
}
Let's start the C program and collect performance profile of the program using the Linux perf tool. We can see with perf report
or perf annotate
what's the most hot code in the program:
0.46 | movzbl -0x9(%r15,%rax,1),%ecx
0.96 | movzbl 0x9(%r15),%r8d
| mov %r8b,-0x9(%r15,%rax,1)
2.31 | mov %cl,0x9(%r15)
| lea -0xa(%rax),%rcx
12.76 | cmp $0xb,%rdi
The performance killer getting the 12.76% of time is the part of unrolled loop
do{
const int8_t temp=temp_Permutation[high_Index];
temp_Permutation[high_Index]=
temp_Permutation[low_Index];
temp_Permutation[low_Index]=temp;
}while(low_Index+++3<=high_Index-- && low_Index<16);
And the cmp
instruction is the part of the while
loop condition. Actually, his loop just reverses bytes in the array. While the C implementation uses the naive and heavy operations with the arrays indexes, the Rust implementation uses the efficient double iterator:
if first_value > 2 {
for (low_index, high_index) in
(1..first_value).zip((1..first_value).rev())
{
temp_permutation.swap(high_index, low_index);
if low_index + 3 > high_index {
break;
}
}
}
Fast array reversal with SIMD! describes several ways to improve performance of the C program (the article uses C++ by the way). The first one is to use only one index i
and iterate only until the middle of the permuted part of the array with temp_Permutation[i]
and temp_Permutation[high_Index - i]
. That would be quite close the Rust double iterators. By the way, the more advanced way to improve performance of both the programs is to use PSHUFB
SSSE3 instruction, or the _mm_shuffle_epi8()
intrinsic, instead of the whole loop. Since there is very small number of shuffle masks, all of them can be defined on the compile time and immediately loaded into a control mask register for the instruction. However, this isn't the only differences between the implementations. The Rust program takes advantage of maximum input number const MAX_N: usize = 16
. Probably this small improvement affects performance the most since the compiler can now do better optimizations with loops and static arrays. The program explicitly uses static array initialization
let mut current_permutation: [u8; MAX_N] =
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
, while the C implementation does this in the run time with no assumptions on input data
for(intnative_t i=0; i<n; ++i)
current_Permutation[i]=i;
The Rust program copies arrays using built-in memory copying
let mut temp_permutation = current_permutation;
, while the C program does this in a loop again
for(intnative_t i=0; ++i<n;)
temp_Permutation[i]=current_Permutation[i];
These aren't all the inefficiencies in the C program, which were eliminated in the Rust implementation (both the programs are based on the same initial Ada program). In most of the places an optimized version of the program will be not only faster, but also shorter. Thus, in this single case when a Rust implementation is more or less faster than C, the performance difference is not about better compiler, but about more efficient structure of the program, which allows a compiler to optimize code better.
A real high-level system programming language must be compatible with C. Just consider the 2 examples of our real life projects.
The first one was a web application firewall (WAF). Such kind of software are typically built on top of Nginx or HAproxy HTTPS servers, which are written in C. It was easy to write C++ module for Nginx, but we would need extra glue code to develop the module in Rust and maintain all our patches for the C code of Nginx. The same developers were easily switching between C and C++ parts of the code.
In the second case our client wished to execute some external logic using MySQL user defined functions (UDF) to interact with the operating system. We could develop the logic in any programming language, but there was a constraint: we had to execute the program 5000 per second on each CPU core! That was impossible to achieve even using posix_spawnp()
, the fastest way to execute a program in Linux. We ended up with development of a custom UDF for MySQL, which is a shared object loaded into MySQL server process. It was quite straightforward to do with C++.
An opposite example to use Rust for an Nginx module is CloudFlare's Quiche, an Nginx extension to support QUIC and HTTP/3 protocols. While it's definitely possible to use Rust for such kind of tasks, the guys, besides the FFI code for C/C++ bindings, still had to write some C code to patch Nginx. This means that
switch
statement as well as we couldn't get satisfactory code layout. That time we considered to introduce inline Assembly into the code of the parser. Having that the zero-copy state machine was already so sophisticated, we weren't happy about this idea. That was so nice surprise to find the computed labels and hot/cold attributes among the compiler extensions! Thanks to these features the compiler generated the optimal code for the parser. The power of C++ is "There's more than one way to do it", TIMTOWTDI. Yes, this is Perl's idea, but in many cases C++ allows you to write your program in plain C, in templates metaprogramming, using high-level STL or well-optimized custom algorithms and data structures. The modern C++ is sophisticated and requires years of experience to be fluent with the language, but it is a professional tool, which allows a professional developer to create the fastest and reliable software.
Not only Rust is immature, but it seems the language designers intentionally limited the language. There are a lot of poor programs misusing goto
, so they just removed the operator: good for juniors, but too limited for professionals. It's quite unlikely that the language and the compiler makes you a nice surprise, when you struggle on a complex technical task. Instead, it's likely that when you need to do something simple, what you've been doing for ages in C or C++, you'll be disappointed and start to fight with the compiler. As an example, likely
and unlikely
compiler hints are used in the Linux kernel for ages and they are so popular in the user space C/C++ programming, that they are included into the C++20 standard (before that programmers had to use compiler intrinsics). But with Rust you find that the API is experimental and works for if
statements only.
This post was discussed on Hacker News and Reddit.
Web application firewall acceleration
From our experience in developing custom core logic of Web Application Firewalls (WAF), we learned several performance issues typical for the most, or even all, modern WAFs which may lead to high cost of ownership and/or denial of service. In this article we introduce a WAF accelerator, which like a web accelerator, improves performance of WAFs and protects them against DDoS attacks.
Recap NatSys Lab. blog
We recap the most interesting posts since 2011 from our old NatSys Laboratory blog: effect of the recent CPU vulnerabilities onto Linux system calls performance, deep dive into HTTP proxies functionality and performance comparison of Tempesta FW with Nginx and HAProxy, fast strings processing algorithms, lock-free data structures, and memory allocators. A lot of technical details!