Skip to main content

How to speed up a micro-benchmark 300x

How to speed up a ubenchmark 300x

Static Hermes: How to Speed Up a Micro-benchmark by 300x Without Cheating

This is the first of a series of light blog posts about Static Hermes, covering topics that I find interesting or amusing. It is not intended to be a systematic introduction to Static Hermes design.Consider it more like a backstage pass to the quirks, features, and “aha!” moments that make Static Hermes unique.

If you’re not familiar with Static Hermes, it’s an evolving project we’re working on to explore the confluence of static typing and JavaScript. It is work in progress, but we’re excited about the possibilities it opens up for performance improvements and more.

For more background:

Contents:

Meet interp-dispatch.js

interp-dispatch.js is a Hermes micro-benchmark written to evaluate the interpreter dispatch overhead and to catch regressions in it. When we say dispatch overhead, we mean the extra work the interpreter needs to do in addition to executing the actual instructions. Specifically, switching to the instruction implementation based on the instruction opcode. So, this test deliberately uses cheap instructions, but lots of them.

Of course, to some degree it also evaluates the performance of the implementations of the instructions themselves, since it is impossible to separate the dispatch overhead from instruction execution.

It is an fun benchmark because its minimalistic and loop-heavy design makes it highly sensitive to minor changes, offering insights into the impact of such changes on running time.

interp-dispatch.js does not claim to have a higher meaning than that, to represent useful classes of calculations, etc. You can make up your own mind about it.

Without further ado, this is it:

function bench (lc, fc) {
    var n, fact;
    var res = 0;
    while (--lc >= 0) {
        n = fc;
        fact = n;
        while (--n > 1)
            fact *= n;
        res += fact;
    }
    return res;
}

let t1 = Date.now();
var res = bench(4e6, 100);
print(Date.now() - t1, "ms,", res);

It calculates the same factorial in a loop a specified number of times and adds the results.

Let’s look at the compiled bytecode:

Function<bench>(3 params, 10 registers, 0 symbols):
    LoadParam         r5, 2
    LoadConstZero     r4
    LoadConstUInt8    r3, 1
    LoadParam         r0, 1
    Dec               r2, r0
    LoadConstZero     r1
    LoadConstZero     r0
    JNotGreaterEqual  L1, r2, r0
L4:
    Dec               r8, r5
    Mov               r7, r5
    Mov               r6, r7
    JNotGreater       L2, r8, r3
L3:
    Mul               r7, r7, r8
    Dec               r8, r8
    Mov               r6, r7
    JGreater          L3, r8, r3
L2:
    Add               r1, r1, r6
    Dec               r2, r2
    Mov               r0, r1
    JGreaterEqual     L4, r2, r4
L1:
    Ret               r0

It is pretty tight. The inner loop is the instructions from L3: to JGreater. Not bad, though we can eliminate the extra Mov, shrinking the loop from four to three instructions.

The outer loop is from L4: to JGreaterEqual.

Let’s Run It!

Running it with the Hermes interpreter, we get the following:

$ hermes-rel -w bench.js
2086 ms, 3.7330486176528693e+164

2086 ms, that is a very respectable result for an interpreter. This isn’t too surprising, given the considerable effort put into optimizing the Hermes interpreter, with plans for even more significant improvements.

Now, let’s examine the impact of native compilation:

$ shermes-rel -w bench.js -o bench

$ ./bench
2388 ms 3.7330486176528693e+164

Well, that’s a bit underwhelming. 2388 ms - a bit slower than the interpreter. How is that possible?

The answer lies in the nature of untyped code. Intuitively, Static Hermes can’t do much for completely untyped code. It has to assume that lc and fc can be any type, including string or BigInt. Let’s dig a little deeper.

Static Hermes with Untyped Code

First, let’s check out the assembly code generated by Static Hermes:

$ shermes-rel -S bench.js 
$ cat bench.s

Remember the inner JS loop?

    while (--n > 1)
        fact *= n;

This is it compiled to armv8 assembly:

LBB2_4:
        mov     x0, x19
        mov     x1, x23
        mov     x2, x25
        bl      __sh_ljs_mul_rjs
        str     x0, [sp, #72]
        mov     x0, x19
        mov     x1, x25
        bl      __sh_ljs_dec_rjs
        ldr     x8, [sp, #72]
        str     x0, [sp, #80]
        str     x8, [sp, #64]
        add     x2, sp, #8
        mov     x0, x19
        mov     x1, x25
        bl      __sh_ljs_greater_rjs
        tbnz    w0, #0, LBB2_4

You can clearly see why it’s not fast: multiplication, decrement, and comparison are all function calls. This confirms our intuition.

To dig deeper, let’s see what Static Hermes thinks the types are, using the --dump-lra option:

shermes -dump-lra bench.js

Here’s the relevant part of the inner loop’s internal representation:

%BB4:
  $loc5 = BinaryMultiplyInst (:number|bigint) $loc5, $loc6
  $loc6 = UnaryDecInst (:number|bigint) $loc6
  $loc4 = MovInst (:number|bigint) $loc5
  $loc5 = CmpBrGreaterThanInst $loc6, $np0, %BB4, %BB3

The types following each instruction show that Static Hermes infers n and fact to be either number or bigint. This is neat, but unfortunately still requires a function call for each operation.

But how does it know that they can be only number or bigint? What if the inputs were strings or undefined or even something crazy like regular expressions? Let’s look at the beginning of the function:

function bench(lc: any, fc: any): string|number
frame = []
%BB0:
  ...
  $loc3      = LoadParamInst (:any) %fc: any
  $loc0      = LoadParamInst (:any) %lc: any
  $loc2      = UnaryDecInst (:number|bigint) $loc0

It is loading fc into $loc3 and lc into $loc0. At this point we can see Static Hermes still thinks that they can be anything, any. But then it decrements $loc0, and we can see that the result if UnaryDecInst changes to number|bigint. Static Hermes knows that the result of the JavaScript -- operator is always number or bigint, regardless of its inputs and can take advantage of that.

In other words, Static Hermes uses the rules of JavaScript to reason about and infer the possible types of every value. This is very powerful, but unfortunately it is not always enough to generate efficient code. Often, we need to pin down a specific type, not just that it is one of several possible.

So, with this understanding, could a minimal change to the code make it faster?

Helping the Compiler

Static Hermes infers types based on JavaScript language rules. For example, it knows that the result of the -- operator is always number|bigint. Can we help it narrow down the type even further?

Absolutely. In JavaScript, the unary + operator is designed to always return a number. So, when we write:

    x = +x;

Both we and the compiler can be confident that x is now a number.

The most crucial variable in this benchmark is fc, which controls the inner loop. Let’s apply this optimization technique to it:

function bench (lc, fc) {
    fc = +fc;
    var n, fact;
    var res = 0;
    // ... rest of the code is unchanged.
}

Now, let’s run it through Static Hermes again:

$ shermes-rel -w bench.js -o bench

$ ./bench
278 ms 3.7330486176528693e+164

Impressive! The execution time is down to 278 ms, making it 7.5 times faster than the interpreter and 8.6 times faster than our previous run—all from a single-line change.

To confirm out understanding, let’s check the IR again:

shermes -dump-lra bench.js

...
...
%BB4:
  $np8       = BinaryMultiplyInst (:number) $np8, $np7
  $np7       = UnaryDecInst (:number) $np7
  $np6       = MovInst (:number) $np8
  $loc1      = CmpBrGreaterThanInst $np7, $np4, %BB4, %BB3

As expected, all operations are now strictly of type number. The assembly reflects this optimization:

LBB2_4:
	fmul	d0, d0, d1
	fadd	d1, d1, d12
	fcmp	d1, d10
	b.gt	LBB2_4

It’s a tight loop, devoid of function calls, just as we’d hoped.

Other Ways to Help the Compiler

While wrapping up this post, I began pondering other ways to assist the compiler that are worth mentioning.

An observant reader might wonder: if we’re only calling bench with numbers, why doesn’t Static Hermes recognize lc and fc as such? The reason is that bench() is a global function. It is assigned to a global variable, or more accurately to a property of the global object globalThis.bench. Static Hermes can’t make assumptions about how it might be called from elsewhere in the code.

In reality, most code resides in modules, be it CommonJS or ESM. While we won’t get into the whole module system here, we can simulate this behavior by enclosing our code in a self-invoking function. This allows Static Hermes to make more precise inferences about function call sites.

(function module(exports) {

function bench (lc, fc) {
    var n, fact;
    var res = 0;
    while (--lc >= 0) {
        n = fc;
        fact = n;
        while (--n > 1)
            fact *= n;
        res += fact;
    }
    return res;
}

let t1 = Date.now();
var res = bench(4e6, 100);
print(Date.now() - t1, "ms", res);

})({});

Let’s run it:

$ shermes-rel -w bench-mod.js -o bench-mod

$ ./bench-mod
6 ms 3.7330486176528693e+164

Wait, what? A jaw-dropping 6 ms! That’s 348 times faster than the interpreter and 46 times faster than our earlier optimization. What led to this dramatic improvement?

It turns out that Static Hermes is now able to inline bench() directly into the main function, replacing the parameters with their actual values. Not only does it know their types, it knows their values!

For clarity, I’ve trimmed down the code to focus on the compiler’s output:

(function module(exports) {
function bench (lc, fc) {
  // ...    
}
return bench(4e6, 100);
})({});

Here is the generated IR:

function module(exports: any): number [allCallsitesKnownInStrictMode]
frame = []
%BB0:
  $np5       = HBCLoadConstInst (:number) 0: number
  $np2       = HBCLoadConstInst (:number) 1: number
  $np6       = HBCLoadConstInst (:number) 3999999: number
  $np1       = HBCLoadConstInst (:number) 0: number
%BB1:
  $np8       = HBCLoadConstInst (:number) 99: number
  $np7       = HBCLoadConstInst (:number) 100: number
%BB4:
  $np0       = BinaryMultiplyInst (:number) $np7, $np8
  $np8       = UnaryDecInst (:number) $np8
  $np7       = MovInst (:number) $np0
  $loc0      = CmpBrGreaterThanInst $np8, $np2, %BB4, %BB3
%BB3:
  $np0       = BinaryAddInst (:number) $np1, $np0
  $np6       = UnaryDecInst (:number) $np6
  $np1       = MovInst (:number) $np0
  $np6       = MovInst (:number) $np6
  $loc0      = CmpBrGreaterThanOrEqualInst $np6, $np5, %BB1, %BB2
%BB2:
  $loc0      = ReturnInst $np0
function_end

As you can see, the parameters are now constants. The assembly code is even more remarkable:

LBB2_1:
	fmov	d1, x9
	fadd	d0, d0, d1
	fadd	d0, d0, d1
	fadd	d0, d0, d1
	fadd	d0, d0, d1
	subs	w8, w8, #4
	b.ne	LBB2_1

The compiler (LLVM) has determined that the inner loop is a constant and has pre-calculated its value x9. The outer loop has been unrolled four times (its body has been copied four times, to decrease the number of iterations).

This is a fascinating result, demonstrating a series of compiler optimizations compounding each other to produce something unexpected and incredibly fast.

Some Observations

A discerning reader might observe that the original code isn’t exactly a model of efficiency — it calculates the same factorial 4 million times. Why? Because it’s a micro-benchmark designed to evaluate interpreter dispatch overhead, not to be a paragon of smart coding. The value of the factorial is irrelevant here.

One could rewrite the code to calculate the factorial just once and then multiply it by 4 million. However, this would defeat the purpose of the benchmark:

function bench (lc, fc) {
    fact = fc;
    while (--fc > 1)
        fact *= fc;
    return fact * lc;
}

Could a compiler theoretically optimize this for us? Sure, but compiler optimizations involve trade-offs and are often designed with a specific set of use-cases in mind. In our case, we’re focused on React Native performance, and an optimization like this wouldn’t be particularly beneficial.

Similarly, LLVM could technically optimize away the entire outer loop in the previous section, since the result of bench() is a compile-time constant when the parameters are constants. However, such an optimization isn’t generally useful, especially not in the context we care about.

Revisiting The Original Untyped Code

Let’s circle back to the original, unmodified code. We already know that in the inner loop — the most computationally intensive part of the function — the variables could only be of type number or bigint. So, could we compile two separate versions of the loop and dynamically switch between them based on the variable type?

Technically, yes. However, applying this approach more broadly presents challenges. Imagine a loop with multiple variables, each having different potential types. This scenario could lead to a combinatorial explosion of type combinations, necessitating a separate compiled version for each. Even if we were to emit just two versions for most scenarios, that would still double the size of the output code.

For those interested in this approach, Manuel Serrano’s HopC compiler employs a similar strategy. It generates two versions of every function: one optimized based on heuristics for the most likely type combinations, and a generic “fallback” version for all other types.

This is an avenue we might explore down the line. However, it’s worth noting that its utility may be somewhat limited in our specific use-case. This approach mainly benefits basic types like number and bigint, but doesn’t offer much for more complex types such as objects.

What About Type Annotations?

You might be asking, “Given that Static Hermes is designed to leverage type annotations, why didn’t we use them in this benchmark?” Good question. We chose not to use them for a couple of reasons. First, skipping annotations allowed us to explore the type inference capabilities of this nascent compiler technology. Second, in this particular benchmark, type annotations wouldn’t make a difference. Why? Because we employed other techniques—like the unary + operator—that conveyed the same type information to the compiler. This approach gives us insight into what Static Hermes can deduce and optimize without explicit annotations.

Conclusion

So there you have it. We kicked the tires on Static Hermes a bit, played around with a micro-benchmark, and even managed to soup it up — way up. Turns out, understanding a little bit about how Static Hermes (or any compiler, really) thinks can go a long way in making your code faster. But it’s not like you need to pull out all the stops for every piece of code you write.

If you’re into this stuff, keep an eye out for more posts like this; we’re just scratching the surface.

Comments

  1. This is awesome. I have been hoping for something like this in the JS ecosystem. The ramifications of the compiler being able to optimise so much based on these sort of general type information will be massive

    ReplyDelete
  2. 😎 cool, please what syntax highlighter are you using?

    ReplyDelete
  3. This is wild! I copied the code and ran these myself. The numbers were slightly different but the effect was as you said "jaw dropping". I then crossed my heart and ran the optimized code with just hermes (not static hermes), because I was thinking it will also get fast. And it did, just not jaw droppingly though😅.

    The commands you have given had me looking for `hermes-rel` though. For anyone else who is also stuck on that part, if you compile the `static_h` branch of hermes from github then you can use the following commands:

    1. for the bytecode output from hermes
    build_release/bin/hermes -dump-bytecode bench.js
    2. for all other commands just use the `build_release/bin/shermes` binary wherever `shermes-rel` is instructed above.

    ReplyDelete
  4. > You can clearly see why it’s not fast: multiplication, decrement, and comparison are all function calls. This confirms our intuition.

    hehe.... well we are javascript developers here come to seek refuge with hermes. And I think I speak for a large portion of us when I say that, we in fact do not clearly see😂. I googled the `armv8` instruction and how `bl` works. But I don't understand what those fucntions are. For example `__sh_ljs_greater_rjs` seems to be a function that will take something in two registers and compare if ljs is greater than rjs or something. But I cannot find the definition of that function in the file generated by shermes. Is this some kind of standard function that shermes provides? From the name Im guessing its the assembly implementation of how to compare two javascript values. But how can I see the sourcecode of this function and others like it in bench.s?

    ReplyDelete

Post a Comment

Popular posts from this blog

You Don't Like Google's Go Because You Are Small

When you look at Google's presentations about Go, they are not shy about it. Go is about very smart people at Google solving very BIG problems. They know best. If you don't like Go, then you are small and are solving small problems. If you were big (or smart), you would surely like it. For example, you might naively think that printing the greater of two numbers should be as simple as std::cout << max(b(),c()) That is because you think small. What you really should want is: t1 := b() if t2 := c(); t1 < t2 { t1 = t2 } fmt.Print( t1 ) Isn't it much better? We didn't have to type all those extra semicolons that were killing productivity before. If you don't like it, you are small. If you wanted to extract an attribute of an optional parameter, you may be used to typing something like: a = p ? p->a : 0; or even: a = p && p->a You just make me sad because obviously what you really want is: a = 0 if p != nil { a = p-...

Apple2TC: an Apple II Binary to C Decompiler - Part 1

This is a series of blog posts to serve as a log documenting my work on Apple2TC - an open source hobby project developed on GitHub:  https://github.com/tmikov/apple2tc . There are various interesting things that come up all the time and I thought it might be useful to record them for posterity. Part 2  describes the different components of the project Part 3 shows how we validate our disassembly by generating a working C representation of it. Part 4 shows the Apple II tricks we had to deal with to get to correct running code. Part 5 describes decompiling and running Robotron 2084 and Snake Byte. Part 6 introduces our SSA-based intermediate representation . Part 7 shows how to transform the IR to discover the original intent of the code.   What is Apple2TC? Apple2TC is an open source project to decompile original Apple II binaries, mostly games, into understandable and working modern C code,  completely automatically  by analyzing the runtime behavior...