How To Make a Fast Dynamic Language Interpreter

Z zef-lang.dev ↗

▲ 252 points • 60 comments • by pizlonator • 3mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

1 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 7 of 7

SEGMENTS · AI 0 of 7

WORD COUNT 1,890

PEAK AI % 14% · §3

Analyzed

Apr 21

backend: pangram/v3.3

Segments scanned

7 windows

avg 270 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,890 words · 7 segments analyzed

Human AI-generated

§1 Human · 1%

This post is about optimizing an extremely simple AST-walking interpreter for a dynamic language called Zef that I created for fun to the point where it is competitive with the likes of Lua, QuickJS, and CPython. Why? Most of what gets written about making language implementations fast focuses on the work you'd do when you already have a stable foundation, like writing yet another JIT (just in time) compiler or fine tuning an already pretty good garbage collector. I've written a lot of posts about crazy optimizations in a mature JS runtime. This post is different. It's about the case where you're starting from scratch, and you're nowhere near writing a JIT and your GC isn't your top problem. The techniques in this post are easy to understand - there's no SSA, no GC, no bytecodes, no machine code - yet they achieve a massive 16x speed-up (67x if you include the incomplete port to Yolo-C++) and bring my tiny interpreter into the ballpark of QuickJS, CPython, and Lua. The techniques I'll focus on in this post are: Value representation. Inline caching. Object model. Watchpoints. Grinding through common sense optimizations. Evaluation Methodology To evaluate my progres, I created a benchmark suite called ScriptBench1. This has ports of classic language benchmarks to Zef: Richards (OS scheduler) DeltaBlue (constraint solver) N-Body (physics simulation) Splay (binary tree test) These benchmarks are also available in a wide variety of other languages. I found existing ports of these benchmarks to JavaScript, Python, and Lua. For Splay, there weren't existing Python and Lua ports, so I used Claude to port them. All experiments run on Ubuntu 22.04.5 running on a Intel Core Ultra 5 135U with 32GB RAM and Fil-C++ version 0.677. Lua 5.4.7 is compiled with GCC 11.4.0. QuickJS-ng 0.14.0 is the binary from QuickJS's GitHub releases page. CPython 3.10 is just what came with Ubuntu. All experiments use the average of 30 randomly interleaved runs.

§2 Human · 6%

To be clear: for most of this post, I'll be comparing my interpreter compiled with Fil-C++ to other folks' interpreters compiled with Yolo-C compilers. Table Of Contents This post starts with a high-level description of the original AST-walking, hashtable-heavy Zef interpreter, followed by a section for each optimization that I landed on my journey to a 16.6x speed-up. Implementation vs Zef Baseline vs Python 3.10 vs Lua 5.4.7 vs QuickJS-ng 0.14.0 Zef Baseline 1x faster 35.448x slower 79.588x slower 22.562x slower Zef Change #1: Direct Operators 1.175x faster 30.161x slower 67.716x slower 19.196x slower Zef Change #2: Direct RMWs 1.219x faster 29.081x slower 65.291x slower 18.509x slower Zef Change #3: Avoid IntObject 1.23x faster 28.82x slower 64.705x slower 18.343x slower Zef Change #4: Symbols 1.456x faster 24.338x slower 54.643x slower 15.491x slower Zef Change #5: Value Inline 1.497x faster 23.673x slower 53.15x slower 15.067x slower Zef Change #6: Object Model and Inline Caches 6.818x faster 5.199x slower 11.674x slower 3.309x slower Zef Change #7: Arguments 9.047x faster 3.918x slower 8.798x slower 2.494x slower Zef Change #8: Getters 9.55x faster 3.712x slower 8.333x slower

§3 Human · 14%

2.362x slower Zef Change #9: Setters 9.874x faster 3.59x slower 8.06x slower 2.285x slower Zef Change #10: callMethod inline 10.193x faster 3.478x slower 7.808x slower 2.213x slower Zef Change #11: Hashtable 11.758x faster 3.015x slower 6.769x slower 1.919x slower Zef Change #12: Avoid std::optional 11.963x faster 2.963x slower 6.653x slower 1.886x slower Zef Change #13: Specialized Arguments 12.39x faster 2.861x slower 6.423x slower 1.821x slower Zef Change #14: Improved Value Slow Paths 13.642x faster 2.598x slower 5.834x slower 1.654x slower Zef Change #15: Deduplicated DotSetRMW::evaluate 13.609x faster 2.605x slower 5.848x slower 1.658x slower Zef Change #16: Fast sqrt 13.824x faster 2.564x slower 5.757x slower 1.632x slower Zef Change #17: Fast toString 14.197x faster 2.497x slower 5.606x slower 1.589x slower Zef Change

§4 Human · 1%

#18: Array Literal Specialization 15.351x faster 2.309x slower 5.184x slower 1.47x slower Zef Change #19: Value callOperator Optimization 16.344x faster 2.169x slower 4.87x slower 1.38x slower Zef Change #20: Better C++ Configuration 16.639x faster 2.13x slower 4.783x slower 1.356x slower Zef Change #21: No Asserts 16.646x faster 2.13x slower 4.781x slower 1.355x slower Zef in Yolo-C++ 66.962x faster 1.889x faster 1.189x slower 2.968x faster Original Zef Interpreter The original Zef interpreter was written with almost no regard for performance. Only two performance-aware choices were made: The value representation is a 64-bit tagged value that may hold a double, a 32-bit integer, or a Object*. Doubles are represented by offsetting them by 0x1000000000000 (a technique I learned from JavaScriptCore; the literature has taken to calling this NuN tagging). Integers and pointers are represented natively, and I'm relying on the fact that no pointer will have a value below 0x100000000 (a dangerous choice, but one that you could force to be true; note that I could have represented integers by giving them a high bit tag of 0xffff000000000000 if I was worried about this). This makes it easy to have fast paths for operations on numbers (because you can detect if you have a number, and what kind, with a bit test). Even more importantly, this avoids heap allocations for numbers. If you're building an interpreter from scratch, it's good to start by making good choices about the fundamental value representation, since it's super hard to change later!

§5 Human · 0%

32-bit or 64-bit tagged values are a standard place to start, if you're implementing a dynamically typed language. I used some kind of C++. It's important to pick a language that allows me to do all of the optimizations that language implementations eventually grow to have, and C++ is such a language. Notably, I would not pick something like Java, since there's a ceiling to how many low level optimizations you can do. I would also not pick Rust, since a garbage collected language requires a heap representation that has global mutable state and cyclic references (though you could use Rust for some parts of the interpreter, if you were happy with being multilingual; or you could use Rust if you were happy with lots of unsafe code). I also made tons of expedient choices that were wrong from a performance engineering standpoint: I used Fil-C++. This did allow me to move very quickly - for example, I get a garbage collector for free. Also, it meant that I spent zero time debugging memory safety issues (Fil-C++ reports memory safety violations with a pretty stack trace and lots of diagnostics) or undefined behavior (Fil-C++ does not have undefined behavior). Fil-C++ costs about 4x performance typically, so I'm starting with that 4x handicap, on top of all of the other suboptimal choices. Recursive AST walking interpreter. The interpreter is implemented as a virtual Node::evaluate method that gets overridden in a bunch of places. Strings everywhere. For example, the Get AST node holds a std::string to describe the name of the variable that it's getting, and that string is used each time a variable is accessed. Hashtables everywhere. When that Get executes, the string is used as a key to a std::unordered_map, which contains the variable value. Chains of recursive calls to crawl the scope chain. Zef allows almost all constructs to be nested and nesting leads to closures; for example, class A nested in function F nested in class B nested in function G means that member functions of class A can see A's fields, F's locals, B's fields, and G's locals. The original interpreter achieved this by recursing in C++ over functions that can query different scope objects. That said, those poor choices allowed me to implement an interpreter for a fairly sophisticated language with very little code. The largest module by far is the parser.

§6 Human · 1%

Everything else is simple and crisp. This interpreter was 35x slower than CPython 3.10, 80x slower than Lua 5.4.7, and 23x slower than QuickJS-ng 0.14.0. Let's see how far we can get by implementing a bunch of optimizations! Optimization #1: Directly Call Operators The first optimization is to have the parser generate distinct AST nodes for each operator as opposed to using the DotCall node with the name of the operator. In Zef, this: a + b Is identical to this: a.add(b) So, the original interpreter would parse a + b to DotCall(a, "add") with b as an argument. That lead to slow execution since every since math operation involved a string lookup of the operator's method name: DotCall would pass the string to Value::callMethod. Value::callMethod would cascade through multiple string comparisons. With this optimization, we have the parser create Binary<> and Unary<> nodes. With the help of some template and lambda magic, these nodes have separate virtual overrides for Node::evaluate per operator. These call directly into the corresponding Value fast paths for those operators. Hence, doing a + b now results in a call to Binary<lambda for add>::evaluate, which then calls Value::add. This change is a 17.5% speed-up. At this point, Zef is 30x slower than CPython 3.10, 67x slower than Lua 5.4.7, and 19x slower than QuickJS-ng 0.14.0. Optimization #2: Directly Call RMWs (Read Modify Write Operators) In the previous optimization, we made operators fast by avoiding string comparison based dispatch. But that change didn't affect all operators! The RMW forms of those operators, like: a += b still used string based dispatch. So, the second optimization is to have the parser generate distinct nodes for each of the RMW cases.

§7 Human · 0%

What's happening here is that the parser requests LValue nodes to replace themselves with an RMW via the makeRMW virtual call: Get - corresponds to getting a variable, i.e. just id Dot - corresponds to expr.id Subscript - corresponds to subscript, i.e. expr[index] Each of these virtual calls use the SPECIALIZE_NEW_RMW macro to create template specialized forms of: SetRMW - corresponds to id += value DotSetRMW - corresponds to expr.id += value SubscriptRMW - corresponds to expr[index] += value Note that while the rest of the operator specialization (from change #1) uses lambdas to dispatch to the appropriate operator function Value, for RMWs we use an enumeration. This is a practical choice because of the number of places we have to thread the enum through to handle the fact that we may arrive at an RMW three different ways (get, dot, and subscript). All of this magic then bottoms out in the Value::callRMW<> template function, which dispatches the actual RMW operator call. This change is a 3.7% speed-up. At this point, Zef is 29x slower than CPython 3.10, 65x slower than Lua 5.4.7, and 18.5x slower than QuickJS-ng 0.14.0. We're now 1.22x faster than where we started. Optimization #3: Avoid IntObject Checks The Value fast paths have a small problem: they use isInt(), which uses isIntSlow(), which does a virtual call to Object::isInt() to check if we're really dealing with an int. This is happening because the Zef value representation in the original interpreter had four distinct cases: A tagged int32 value. A tagged double value. An IntObject for int64's that cannot be represented as int32's. All other objects. In the IntObject case, Value still drove the dispatch for all integer methods, since that allowed the interpreter to just have one implementation of all math operators (and that implementation was alwayts in Value). This simple optimization causes Value fast paths to only consider int32 and double, and puts all IntObject handling in IntObject itself. Additionally, this change avoids the isInt() call on every method dispatch. This is a 1% speed-up.