Zero Cost Abstractions: An Addendum

On Reddit, Icarium-Lifestealer asked me how the disassembly looked if I used -C panic=abort instead of extern "C". Curious, I tried and came out with the following assembly, which is again exactly the same as the C example

        push    rbx
        mov     rbx, rdi
        cmp     dword ptr [rdi], 43
        jl      .LBB3_2
        mov     rdi, rbx
        call    example::bar
        mov     dword ptr [rbx], 42
.LBB3_2:
        mov     rdi, rbx
        pop     rbx
        jmp     example::baz

Zero cost abstractions: Rust vs C++

I watched a video of a C++ talk from the always excellent Chandler Carruth yesterday, entitled “There are no Zero Cost Abstractions“. In it, he talks about the cost of different abstractions that are labelled zero cost. It’s a really interesting talk and I’d advise people interested in low-level programming to give it a look.

He made an observation that C++’s std::unique_ptr is actually not a zero cost abstraction at runtime, and showed a simple C++ example using raw pointers, and then a semantically equivalent one using std::unique_ptr. Then he compared the assembly outputs taken from compiler explorer.

Now this was really interesting, and particularly because Rust boasts about having very similar zero cost abstractions. So I wondered about the cost of the equivalent code in Rust.

Fortunately compiler explorer supports Rust as well, so we can compare them!

So for reference, here are the C++ examples that Chandler included in his talk:

Raw pointer version:
void bar(int* ptr);
void baz(int *ptr);

void foo(int *ptr) {
    if (*ptr > 42) {
        bar(ptr);
        *ptr = 42;
    }
    baz(ptr);
}
With std::unique_ptr
#include <memory>

using namespace std;

void bar(int* ptr) noexcept;
void baz(unique_ptr<int> &&ptr) noexcept;

void foo(unique_ptr<int> ptr) noexcept {
    if (*ptr > 42) {
        bar(ptr.get());
        *ptr = 42;
    }
    baz(std::move(ptr));
}

And here are the x64 assembly language outputs. I’ve compiled using clang 10.0 and -O2 for optimisation.

As not everyone is familiar with the beauty of x64 assembly language, I’ve annotated the the output.

Raw pointer version:
        push    rbx                      # register preservation for the calling convention
        mov     rbx, rdi                  
        cmp     dword ptr [rdi], 43      # if (ptr > 42)
        jl      .LBB0_2                  # 
        mov     rdi, rbx                 # calling convention set up
        call    bar(int*)
        mov     dword ptr [rbx], 42      # *ptr = 42
.LBB0_2:
        mov     rdi, rbx                 # reverse out the register/stack preservation
        pop     rbx
        jmp     baz(int*)                # TAILCALL
With std::unique_ptr
        push    rbx                      # register preservation for the calling convention
        mov     rbx, rdi                
        mov     rdi, qword ptr [rdi]     # what's this? we are dereferencing the ptr in rdi!
        cmp     dword ptr [rdi], 43      # if (ptr > 42)
        jl      .LBB0_2
        call    bar(int*)                
        mov     rax, qword ptr [rbx]     # another dereference!
        mov     dword ptr [rax], 42      # *ptr = 42
.LBB0_2:
        mov     rdi, rbx                 # reverse out the register/stack preservation
        pop     rbx
        jmp     baz(unique_ptr<int>&&)   # TAILCALL

Now we have two extra fetches from memory in the std::unique_ptr version. I’m not going to go into detail for the reason behind them, as Chandler does this expertly in the talk with multiple revisions of the code, and I think you should all watch it, as the rest of the talk is highly relevant for even non C++ programmers.

Now std::unique_ptr<T> is equivalent to Rust’s Box<T>. Rust’s move semantics and ownership model enforce the same behaviour (actually even more strongly than C++ does).

So my first pass at a rust equivalent of the unique_ptr example was as follows:

#[inline(never)]
fn bar(ptr: &i32) {
    print!("{}", ptr);
}

#[inline(never)]
fn baz(ptr: Box<i32>) {
    print!("{}", ptr);
}

pub fn foo(mut ptr: Box<i32>) {
    if *ptr > 42 {
        bar(&*ptr);
        *ptr = 42;
    }
    baz(ptr)
}

In Rust, we can’t get away with a forward reference to a function in another translation unit, so we define bar and baz trivially, and mark them as non-inlinable, so that the compiler will generate calls to them instead of trying to be clever – we aren’t trying to check how the compiler inlines code and we want it to be a fair comparison.

So what does the generated output look like? Well slightly disappointing:

        push    r14                  # Lots of calling convention set up
        push    rbx
        push    rax
        mov     rbx, rdi
        cmp     dword ptr [rdi], 43  # if *ptr > 42
        jl      .LBB5_3
        mov     rdi, rbx             # calling convention set up
        call    example::bar
        mov     dword ptr [rbx], 42
.LBB5_3:
        mov     rdi, rbx             # Lots of calling convention unwind too
        add     rsp, 8
        pop     rbx
        pop     r14
        jmp     example::baz
        mov     r14, rax             # This is panic unwinding code.
        mov     rdi, rbx             # The jmp above means this is never reached!
        call    alloc::alloc::box_free
        mov     rdi, r14
        call    _Unwind_Resume@PLT
        ud2

Rust’s calling convention means there is a lot more setup. Of particular note was the
panic unwind. LLVM does a tail call optimisation, so instead of a call, it does it’s
cleanup and a jump. That means that when baz returns, it returns directly to foo’s caller
rather than foo. It’s a neat optimisation and saves on stack space and instruction calls.
However, that does mean that all the code after that is unused. This looks like a bug, albeit
not a serious one, either in LLVM or in the rustc backend.

So we can make the comparison more fair, by using the C calling convention in Rust to compare like for like. extern "C" functions need no panic unwinding (as C functions can’t panic) and the setup and tear down should be the same as for the C examples above, so this really would be a like for like comparison.

Here is the code:

#[inline(never)]
extern "C" fn bar(ptr: &i32) {
    print!("{}", ptr);
}

#[inline(never)]
extern "C" fn baz(ptr: Box<i32>) {
    print!("{}", ptr);
}

pub extern "C" fn foo(mut ptr: Box<i32>) {
    if *ptr > 42 {
        bar(&*ptr);
        *ptr = 42;
    }
    baz(ptr)
}

and the output:

        push    rbx
        mov     rbx, rdi
        cmp     dword ptr [rdi], 43
        jl      .LBB4_2
        mov     rdi, rbx
        call    example::bar
        mov     dword ptr [rbx], 42
.LBB4_2:
        mov     rdi, rbx
        pop     rbx
        jmp     example::baz

This is identical to the C raw pointer example. Rust’s compiler enforced semantics enable it to better optimise code without you even thinking about it. This is why I’ve come to love Rust – whilst C++ is too entrenched to ever disappear, I strongly believe that Rust is becoming a much better alternative to C++ in many areas where the only alternative to C++ was pure C. There is a long way to go yet, but it’s already very impressive.

As an aside, and true to Chandler’s talk – this is not zero cost, we pay for this performance at compile time. Rust is contending with C++ to be the slowest language to compile, although there is a lot of work going on to optimise it.

Anyway check out Rust and check out Chandler Carruth’s talk!

Edit: The comment on zero cost generated a bit of controversy on Reddit – and I think looking back it’s because it was poorly worded. I’m not attempting to hijack, change or rubbish the use of the term “zero cost abstraction”, merely observe that we pay for that performance elsewhere. With the power of hindsight I’d have written something like: “True to Chandler’s talk, and like all zero cost abstractions, this is only zero runtime cost – we pay for this performance at compile time.”