Static Hermes: How to Speed Up a Micro-benchmark by 300x Without Cheating
This is the first of a series of light blog posts about Static Hermes, covering topics that I find interesting or amusing. It is not intended to be a systematic introduction to Static Hermes design.Consider it more like a backstage pass to the quirks, features, and “aha!” moments that make Static Hermes unique.
If you’re not familiar with Static Hermes, it’s an evolving project we’re working on to explore the confluence of static typing and JavaScript. It is work in progress, but we’re excited about the possibilities it opens up for performance improvements and more.
For more background:
Contents:
- Meet interp-dispatch.js
- Let’s Run It!
- Static Hermes with Untyped Code
- Helping the Compiler
- Other Ways to Help the Compiler
- Some Observations
- Revisiting The Original Untyped Code
- Conclusion
Meet interp-dispatch.js
interp-dispatch.js
is a Hermes micro-benchmark written to evaluate the interpreter
dispatch overhead and to catch regressions in it. When we say dispatch overhead,
we mean the extra work the interpreter needs to do in addition to executing the
actual instructions. Specifically, switching to the instruction implementation based
on the instruction opcode. So, this test deliberately uses cheap instructions, but
lots of them.
Of course, to some degree it also evaluates the performance of the implementations of the instructions themselves, since it is impossible to separate the dispatch overhead from instruction execution.
It is an fun benchmark because its minimalistic and loop-heavy design makes it highly sensitive to minor changes, offering insights into the impact of such changes on running time.
interp-dispatch.js
does not claim to have a higher meaning than that, to represent
useful classes of calculations, etc. You can make up your own mind about it.
Without further ado, this is it:
function bench (lc, fc) {
var n, fact;
var res = 0;
while (--lc >= 0) {
n = fc;
fact = n;
while (--n > 1)
fact *= n;
res += fact;
}
return res;
}
let t1 = Date.now();
var res = bench(4e6, 100);
print(Date.now() - t1, "ms,", res);
It calculates the same factorial in a loop a specified number of times and adds the results.
Let’s look at the compiled bytecode:
Function<bench>(3 params, 10 registers, 0 symbols):
LoadParam r5, 2
LoadConstZero r4
LoadConstUInt8 r3, 1
LoadParam r0, 1
Dec r2, r0
LoadConstZero r1
LoadConstZero r0
JNotGreaterEqual L1, r2, r0
L4:
Dec r8, r5
Mov r7, r5
Mov r6, r7
JNotGreater L2, r8, r3
L3:
Mul r7, r7, r8
Dec r8, r8
Mov r6, r7
JGreater L3, r8, r3
L2:
Add r1, r1, r6
Dec r2, r2
Mov r0, r1
JGreaterEqual L4, r2, r4
L1:
Ret r0
It is pretty tight. The inner loop is the instructions from L3:
to JGreater
. Not bad,
though we can eliminate the extra Mov
, shrinking the loop from four to three instructions.
The outer loop is from L4:
to JGreaterEqual
.
Let’s Run It!
Running it with the Hermes interpreter, we get the following:
$ hermes-rel -w bench.js
2086 ms, 3.7330486176528693e+164
2086 ms, that is a very respectable result for an interpreter. This isn’t too surprising, given the considerable effort put into optimizing the Hermes interpreter, with plans for even more significant improvements.
Now, let’s examine the impact of native compilation:
$ shermes-rel -w bench.js -o bench
$ ./bench
2388 ms 3.7330486176528693e+164
Well, that’s a bit underwhelming. 2388 ms - a bit slower than the interpreter. How is that possible?
The answer lies in the nature of untyped code. Intuitively, Static Hermes can’t do
much for completely untyped code. It has to assume that lc
and fc
can be any
type, including string or BigInt
. Let’s dig a little deeper.
Static Hermes with Untyped Code
First, let’s check out the assembly code generated by Static Hermes:
$ shermes-rel -S bench.js
$ cat bench.s
Remember the inner JS loop?
while (--n > 1)
fact *= n;
This is it compiled to armv8 assembly:
LBB2_4:
mov x0, x19
mov x1, x23
mov x2, x25
bl __sh_ljs_mul_rjs
str x0, [sp, #72]
mov x0, x19
mov x1, x25
bl __sh_ljs_dec_rjs
ldr x8, [sp, #72]
str x0, [sp, #80]
str x8, [sp, #64]
add x2, sp, #8
mov x0, x19
mov x1, x25
bl __sh_ljs_greater_rjs
tbnz w0, #0, LBB2_4
You can clearly see why it’s not fast: multiplication, decrement, and comparison are all function calls. This confirms our intuition.
To dig deeper, let’s see what Static Hermes thinks the types are, using the --dump-lra
option:
shermes -dump-lra bench.js
Here’s the relevant part of the inner loop’s internal representation:
%BB4:
$loc5 = BinaryMultiplyInst (:number|bigint) $loc5, $loc6
$loc6 = UnaryDecInst (:number|bigint) $loc6
$loc4 = MovInst (:number|bigint) $loc5
$loc5 = CmpBrGreaterThanInst $loc6, $np0, %BB4, %BB3
The types following each instruction show that Static Hermes infers n
and fact
to be either number or bigint. This is neat, but unfortunately still requires a
function call for each operation.
But how does it know that they can be only number
or bigint
? What if the inputs were
strings or undefined
or even something crazy like regular expressions? Let’s look at
the beginning of the function:
function bench(lc: any, fc: any): string|number
frame = []
%BB0:
...
$loc3 = LoadParamInst (:any) %fc: any
$loc0 = LoadParamInst (:any) %lc: any
$loc2 = UnaryDecInst (:number|bigint) $loc0
It is loading fc
into $loc3
and lc
into $loc0
. At this point we can see Static
Hermes still thinks that they can be anything, any
. But then it decrements $loc0
,
and we can see that the result if UnaryDecInst
changes to number|bigint
. Static Hermes
knows that the result of the JavaScript --
operator is always number
or bigint
,
regardless of its inputs and can take advantage of that.
In other words, Static Hermes uses the rules of JavaScript to reason about and infer the possible types of every value. This is very powerful, but unfortunately it is not always enough to generate efficient code. Often, we need to pin down a specific type, not just that it is one of several possible.
So, with this understanding, could a minimal change to the code make it faster?
Helping the Compiler
Static Hermes infers types based on JavaScript language rules. For example, it knows
that the result of the --
operator is always number|bigint
. Can we help it narrow
down the type even further?
Absolutely. In JavaScript, the unary +
operator is designed to always return a number.
So, when we write:
x = +x;
Both we and the compiler can be confident that x
is now a number.
The most crucial variable in this benchmark is fc
, which controls the inner loop.
Let’s apply this optimization technique to it:
function bench (lc, fc) {
fc = +fc;
var n, fact;
var res = 0;
// ... rest of the code is unchanged.
}
Now, let’s run it through Static Hermes again:
$ shermes-rel -w bench.js -o bench
$ ./bench
278 ms 3.7330486176528693e+164
Impressive! The execution time is down to 278 ms, making it 7.5 times faster than the interpreter and 8.6 times faster than our previous run—all from a single-line change.
To confirm out understanding, let’s check the IR again:
shermes -dump-lra bench.js
...
...
%BB4:
$np8 = BinaryMultiplyInst (:number) $np8, $np7
$np7 = UnaryDecInst (:number) $np7
$np6 = MovInst (:number) $np8
$loc1 = CmpBrGreaterThanInst $np7, $np4, %BB4, %BB3
As expected, all operations are now strictly of type number
. The assembly reflects
this optimization:
LBB2_4:
fmul d0, d0, d1
fadd d1, d1, d12
fcmp d1, d10
b.gt LBB2_4
It’s a tight loop, devoid of function calls, just as we’d hoped.
Other Ways to Help the Compiler
While wrapping up this post, I began pondering other ways to assist the compiler that are worth mentioning.
An observant reader might wonder: if we’re only calling bench with numbers, why doesn’t
Static Hermes recognize lc
and fc
as such? The reason is that bench()
is a global
function. It is assigned to a global variable, or more accurately to a property of the
global object globalThis.bench
. Static Hermes can’t make assumptions about how it might
be called from elsewhere in the code.
In reality, most code resides in modules, be it CommonJS or ESM. While we won’t get into the whole module system here, we can simulate this behavior by enclosing our code in a self-invoking function. This allows Static Hermes to make more precise inferences about function call sites.
(function module(exports) {
function bench (lc, fc) {
var n, fact;
var res = 0;
while (--lc >= 0) {
n = fc;
fact = n;
while (--n > 1)
fact *= n;
res += fact;
}
return res;
}
let t1 = Date.now();
var res = bench(4e6, 100);
print(Date.now() - t1, "ms", res);
})({});
Let’s run it:
$ shermes-rel -w bench-mod.js -o bench-mod
$ ./bench-mod
6 ms 3.7330486176528693e+164
Wait, what? A jaw-dropping 6 ms! That’s 348 times faster than the interpreter and 46 times faster than our earlier optimization. What led to this dramatic improvement?
It turns out that Static Hermes is now able to inline bench()
directly into the main
function, replacing the parameters with their actual values. Not only does it know their
types, it knows their values!
For clarity, I’ve trimmed down the code to focus on the compiler’s output:
(function module(exports) {
function bench (lc, fc) {
// ...
}
return bench(4e6, 100);
})({});
Here is the generated IR:
function module(exports: any): number [allCallsitesKnownInStrictMode]
frame = []
%BB0:
$np5 = HBCLoadConstInst (:number) 0: number
$np2 = HBCLoadConstInst (:number) 1: number
$np6 = HBCLoadConstInst (:number) 3999999: number
$np1 = HBCLoadConstInst (:number) 0: number
%BB1:
$np8 = HBCLoadConstInst (:number) 99: number
$np7 = HBCLoadConstInst (:number) 100: number
%BB4:
$np0 = BinaryMultiplyInst (:number) $np7, $np8
$np8 = UnaryDecInst (:number) $np8
$np7 = MovInst (:number) $np0
$loc0 = CmpBrGreaterThanInst $np8, $np2, %BB4, %BB3
%BB3:
$np0 = BinaryAddInst (:number) $np1, $np0
$np6 = UnaryDecInst (:number) $np6
$np1 = MovInst (:number) $np0
$np6 = MovInst (:number) $np6
$loc0 = CmpBrGreaterThanOrEqualInst $np6, $np5, %BB1, %BB2
%BB2:
$loc0 = ReturnInst $np0
function_end
As you can see, the parameters are now constants. The assembly code is even more remarkable:
LBB2_1:
fmov d1, x9
fadd d0, d0, d1
fadd d0, d0, d1
fadd d0, d0, d1
fadd d0, d0, d1
subs w8, w8, #4
b.ne LBB2_1
The compiler (LLVM) has determined that the inner loop is a constant and has
pre-calculated its value x9
. The outer loop has been unrolled four times
(its body has been copied four times, to decrease the number of iterations).
This is a fascinating result, demonstrating a series of compiler optimizations compounding each other to produce something unexpected and incredibly fast.
Some Observations
A discerning reader might observe that the original code isn’t exactly a model of efficiency — it calculates the same factorial 4 million times. Why? Because it’s a micro-benchmark designed to evaluate interpreter dispatch overhead, not to be a paragon of smart coding. The value of the factorial is irrelevant here.
One could rewrite the code to calculate the factorial just once and then multiply it by 4 million. However, this would defeat the purpose of the benchmark:
function bench (lc, fc) {
fact = fc;
while (--fc > 1)
fact *= fc;
return fact * lc;
}
Could a compiler theoretically optimize this for us? Sure, but compiler optimizations involve trade-offs and are often designed with a specific set of use-cases in mind. In our case, we’re focused on React Native performance, and an optimization like this wouldn’t be particularly beneficial.
Similarly, LLVM could technically optimize away the entire outer loop in the previous section, since the result of bench() is a compile-time constant when the parameters are constants. However, such an optimization isn’t generally useful, especially not in the context we care about.
Revisiting The Original Untyped Code
Let’s circle back to the original, unmodified code. We already know that in the inner
loop — the most computationally intensive part of the function — the variables could
only be of type number
or bigint
. So, could we compile two separate versions of the
loop and dynamically switch between them based on the variable type?
Technically, yes. However, applying this approach more broadly presents challenges. Imagine a loop with multiple variables, each having different potential types. This scenario could lead to a combinatorial explosion of type combinations, necessitating a separate compiled version for each. Even if we were to emit just two versions for most scenarios, that would still double the size of the output code.
For those interested in this approach, Manuel Serrano’s HopC compiler employs a similar strategy. It generates two versions of every function: one optimized based on heuristics for the most likely type combinations, and a generic “fallback” version for all other types.
This is an avenue we might explore down the line. However, it’s worth noting that its
utility may be somewhat limited in our specific use-case. This approach mainly benefits
basic types like number
and bigint
, but doesn’t offer much for more complex types such as
objects.
What About Type Annotations?
You might be asking, “Given that Static Hermes is designed to leverage type annotations, why didn’t we use them in this benchmark?” Good question. We chose not to use them for a couple of reasons. First, skipping annotations allowed us to explore the type inference capabilities of this nascent compiler technology. Second, in this particular benchmark, type annotations wouldn’t make a difference. Why? Because we employed other techniques—like the unary +
operator—that conveyed the same type information to the compiler. This approach gives us insight into what Static Hermes can deduce and optimize without explicit annotations.
Conclusion
So there you have it. We kicked the tires on Static Hermes a bit, played around with a micro-benchmark, and even managed to soup it up — way up. Turns out, understanding a little bit about how Static Hermes (or any compiler, really) thinks can go a long way in making your code faster. But it’s not like you need to pull out all the stops for every piece of code you write.
If you’re into this stuff, keep an eye out for more posts like this; we’re just scratching the surface.
This is awesome. I have been hoping for something like this in the JS ecosystem. The ramifications of the compiler being able to optimise so much based on these sort of general type information will be massive
ReplyDelete😎 cool, please what syntax highlighter are you using?
ReplyDeletecool
ReplyDeleteThis is wild! I copied the code and ran these myself. The numbers were slightly different but the effect was as you said "jaw dropping". I then crossed my heart and ran the optimized code with just hermes (not static hermes), because I was thinking it will also get fast. And it did, just not jaw droppingly though😅.
ReplyDeleteThe commands you have given had me looking for `hermes-rel` though. For anyone else who is also stuck on that part, if you compile the `static_h` branch of hermes from github then you can use the following commands:
1. for the bytecode output from hermes
build_release/bin/hermes -dump-bytecode bench.js
2. for all other commands just use the `build_release/bin/shermes` binary wherever `shermes-rel` is instructed above.
> You can clearly see why it’s not fast: multiplication, decrement, and comparison are all function calls. This confirms our intuition.
ReplyDeletehehe.... well we are javascript developers here come to seek refuge with hermes. And I think I speak for a large portion of us when I say that, we in fact do not clearly see😂. I googled the `armv8` instruction and how `bl` works. But I don't understand what those fucntions are. For example `__sh_ljs_greater_rjs` seems to be a function that will take something in two registers and compare if ljs is greater than rjs or something. But I cannot find the definition of that function in the file generated by shermes. Is this some kind of standard function that shermes provides? From the name Im guessing its the assembly implementation of how to compare two javascript values. But how can I see the sourcecode of this function and others like it in bench.s?