Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,663 words · 5 segments analyzed
Compilers, especially method just-in-time compilers, operate on one function at a time. It is a natural code unit size, especially for a dynamic language JIT: at a given point in time, what more information can you gather about other parts of a running, changing system?
I don’t have any data to back this up—maybe I should go gather some—but on average, methods are small. Especially in languages such as Ruby that use method dispatch for everything, even instance variable (attribute, field, …) lookups, they are small. And everywhere.
This makes the compiler sad. If we are to continue to anthropomorphize them, compilers like having more context so they can optimize better. Consider the following silly-looking example that is actually representative of a surprising amount of real-world code:
class Point attr_reader :x, :y
def initialize(x, y) @x = x @y = y end
def distance(other) Math.sqrt((@x - other.x)**2 + (@y - other.y)**2) end end
def distance_from_origin(x, y) Point.new(x, y).distance(Point.new(0, 0)) end
Right now, in the distance_from_origin method, I count 8 different method calls:
Point.new Point#initialize Point.new Point#initialize Point#distance Float#** Float#** Math.sqrt
(Technically more, but the ivar lookups (including attr_reader!), addition, and subtraction are generally specialized and don’t push a frame, even in the interpreter.)
Furthermore, there are at least two heap allocations: one for each Point instance.
Last, there is a bunch of memory traffic to and from Point instances.
This all is a huge bummer! What should be a simple math operation is now overwhelmed with a bunch of other stuff. Point is certainly not a zero-cost abstraction.
Even if we had a bunch of other optimizations such as load-store elimination or escape analysis, they would not be able to do much: pretty much everything escapes and is effectful. That is, unless we inline. Inlining is the lever that enables a bunch of other optimization passes to kick in.
Inlining: the “easy” part
I wrote about the design and implementation of Cinder’s inliner (FB link, personal blog link) a couple of years ago. I wrote about arguably the simplest part, which is copying the callee body into the caller. It took me at least a week to get working. Probably closer to months if you consider all the plumbing through the rest of the JIT. In February during a small hackathon, I watched my colleague k0kubun prototype that bit of the inliner inside ZJIT in about 30 minutes.
There is more to do when pretty much every part of the VM is observable from the guest language: both Python and Ruby allow inspecting the state of the locals, the call stack, etc from user code. Sampling profilers also expect some amount of breadcrumbs to work with to inspect the stack. So there’s some more machinery still required to pretend like the callee function was not inlined. I talk about this a little bit in the Cinder blog post.
Even so, all of that can probably be designed and wired together in a couple of months. Then you will find yourself tuning the inliner for the next 10 years. This is much harder.
When: the harder part
The thing that makes inlining difficult, especially in a method JIT, is that you are trying to make an entire (dynamic!) system faster but you are only looking through a microscope and only capable of local reasoning1. Whereas other optimizations such as strength reduction, inline caches, and value numbering are an un-alloyed good for the generated code, inlining can have negative effects. It is also perhaps the first optimization people add that has non-local impact.
If you inline wrong, your code size might blow up. This might thrash your CPU’s caches. Bummer, but happens to the best of us.
But also, if you inline wrong, you might get in the way of other helpful optimizations: if you hit some size limit after inlining method A, you might never get to inline B, which is the key to unlocking the performance of the method you are trying to optimize.
Last, inlining might hurt compile time.
In situations where latency is paramount (think: interactive client JavaScript), adding tons more code into the fray might add noticeable hiccups, even if the long-term throughput improves. As always, in-band compilation is a trade-off because any time you spend compiling, you are not executing code.
You have to write your compiler to reason about all of this stuff. So you have heuristics. For example, here is Michael Pollan’s inliner heuristic:
Inline methods. Mostly small. Not too many.
I did a survey of a bunch of compilers, mostly JIT compilers, to see what their inlining heuristics look like. I also read (skimmed) some papers to see what those folks had to say. I wonder if they agree.
This post was a long time coming. I started working on it about five years ago but then when I quit working at Facebook I accidentally left behind all of the inliner research I did for Cinder’s inliner. So then I kind of just thought about it aimlessly for a while before redoing it this year. Anyway, here’s wonderwall.
The heuristics
Spoiler alert: all in all, people tend to look at:
Profiles of call target Cumulative caller size (increasing as callees get inlined) Callee size Inline depth Number of inlined calls at a certain depth If recursion is present Callee/caller call count ratio (if callee only called less than K% of calls to caller, don’t inline callee) Callee stack usage Polymorphism in callee What mode the compiler is in (baseline vs more aggressive) If the callee looks like it always raises/throws
And also have different interesting ways to pipe in profile information.
Last, some newer papers do some wild stuff:
Train neural networks to make inlining decisions Let inlining drive the entire optimization pipeline, treating it as a search heuristic over a BFS walk of the call graph Use AOT-gathered information to aid in JIT heuristics
Another thing to consider in inlining is how you gather and interpret profiles.
Call context and profiles: the other harder part
When you compile a function, you tend to specialize it based on the input it has historically been given. For a monomorphic input, maybe you guard that the type is still the same and otherwise jump into the interpreter. For a polymorphic input, maybe you check the top K (~4) common cases and otherwise jump into the interpreter. Fine.
But sometimes you can be compiling a polymorphic method bar that is actually monomorphic in its caller foo. That is, foo might only ever pass one kind of input to bar, but other callers pass all kinds of stuff. Here is a bit of a silly example to show what I mean:
class HashWithIndifferentAccess def initialize @hash = {} end
# Allow reading from the Hash with either a String or a Symbol def [](key) = @hash[key.to_sym]
# ... end
# some method... some_hash = HashWithIndifferentAccess.new # ... some_hash["abc"]
# some other method... another_hash = HashWithIndifferentAccess.new # ... another_hash[:xyz]
Just kidding, not so silly at all. It’s a super common pattern in Rails. It makes key polymorphic in HashWithIndifferentAccess#[] even though for many of its callers, it may well be monomorphic (or even a constant).
In order to plumb this information through to the compiler, you have to figure out this call context relationship. There are a couple of common ways to do it.
Splitting
YJIT, for example, though it does not inline, splits methods based on the types of the arguments going in. This means that it clones the compiled code, generating a new version for each context. This does not give call context (“A calls B”) but gives type context (“B is called with integers, B’ is called with strings”).
A compiler could do type-based splitting in the interpreter or a baseline tier.
Profile splitting
If you don’t fancy duplicating the code, you can instead duplicate the profiles. You could either do this using type context (as above) or using call context.
SpiderMonkey, for example, does “trial inlining” that allows callers to pass down a bit of memory for potential inline candidate callees to record their inline caches. Instead of each function holding its own ICScript, the caller allocates a unique ICScript for that potential-inline call-site. This gives each callee function (at least?) one level of call context.
Later, when inlining the callee into the caller, we don’t have other callers’ type information polluting the IR builder (or whatever reads the profiles).
Bytecode inlining
JavaScriptCore handles this by inlining bytecode into other bytecode. This is a gnarly transformation but gives the interpreter, even (!) access to call context. On tier-up to the compiler, all the inlining decisions have been made already.
Early tier with counters
HotSpot handles this with multiple tiers. The interpreter tiers up to the client compiler, C1. C1 profiles branch and call targets in compiled code. C1 may eventually recompile based on this new information. C1 may eventually tier up to C2, which copies C1 inlining decisions. This way, we get call context in profiles via inlining.
Inline and analyze and hope
One last thing you could do is just trust your type inference and branch folding in the optimizer. You could inline and do polymorphic specialization in the callee when building the IR, then hope that your branch pruning monomorphizes the inlined callee. It’s a little wasteful because the polymorphic code is built “for nothing”, but it might work fine?
Okay, onto the collected notes and half-baked commentary. Here’s a survey of a bunch of JIT compilers and how they reason about inlining heuristics.
Thanks
But before we get into that, thanks to Iain Ireland, CF Bolz-Tereick, and Ian Rogers for feedback on this blog post!
The survey: bits and bobbles
What follows is mostly a “bits and bobbles” section a la Phil Zucker.
We’ll start with Cinder, because when I wrote Cinder’s inliner I added only the simplest heuristics, mostly “don’t inline” signals. Over time, after I left, people tuned it a bit more.