Theseus, a static Windows emulator

N neugierig.org ↗

▲ 113 points • 20 comments • by zdw • 3mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

3 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,778

PEAK AI % 6% · §1

Analyzed

Apr 21

backend: pangram/v3.3

Segments scanned

5 windows

avg 356 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,778 words · 5 segments analyzed

Human AI-generated

§1 Human · 6%

April 19, 2026

This post is likely the end of my series on retrowin32. I bring you: Theseus, a new Windows/x86 emulator that translates programs statically, solving a bunch of emulation problems while surely introducing new ones. What happened to retrowin32? I haven't been working on retrowin32, my win32 emulator, in part due to life stuff and in part because I haven't been sure where I wanted to go with it. And then someone who had contributed to it in the past posted retrotick, their own web-based Windows emulator that looks better than my years of work, and commented on HN that it took them an hour with Claude. This is not a post about AI, both because there are too many of those already and because I'm not yet sure of my own feelings on it. But one small thing I have been thinking about is that (1) AI has been slowly but surely climbing the junior to senior engineer ladder; and (2) one of the main pieces of being a senior engineer is better understanding what you ought to be building, as distinct from how to build it. (Is that just the Innovator's Dilemma's concept of "retreating upmarket", applied to my own utility as a human? Not even sure. I am grateful I do this work for the journey, to satisfy my own curiosity, because that means I am not existentially threatened like a business would be in this situation. As Benny Feldman says: "I cheat at the casino by secretly not having an attachment to material wealth!") So, Mr. Senior Engineer, what ought we build? What problem are we even solving with emulators, and how do our approaches meet that? I came to a kind of unorthodox solution that I'd like to tell you about! Emulators and JITs The simplest CPU emulator is very similar to an interpreter. An input program, after parsing, becomes x86 instructions like: mov eax, 3 add eax, 4 call ... ; some Windows system API

An interpreting emulator is a big loop that steps through the instructions.

§2 Human · 5%

It looks like: loop { let instr = next_instruction(); match instr { // e.g. `mov eax, 3` Mov => { set(argument_1(), argument_2()); } // e.g. `add eax, 4` Add => { set(argument_1(), argument_1() + argument_2()); } ... } }

Like an interpreter, this approach is slow. At a high level interpreters are slow because they are doing a bunch of dynamic work for each instruction. Imagine emulating a program that runs the same add instruction in a loop; the above emulator loop has all these function calls to repeatedly ask "what instruction am I running now?" and inspect the arguments, only to eventually do the same add on each iteration. x86 memory references are extra painful because they are very flexible. Further, on x86 the add instruction not only adds the numbers but also computes six derived values, including things like the parity flag: whether the result contains an even number of 1 bits(!). A correct emulator needs to either compute all of these as well, or perform some sort of side analysis of the code to decide how to run it efficiently. There are various fun techniques to improve emulators. But if you want to go fast what you really need is some combination of analyzing the code and generating native machine code from it — a JIT. JITs are famously hard to write! They are effectively optimizing compilers, which means all the complexity of optimization and generating machine code, but also where the runtime of the compilation itself is in the critical performance path. I liked this post's discussion of why JITs are hard which mentions there have been more than 15 attempts at a Python JIT. Static binary translation So suppose you want to generate efficient machine code, but you don't want to write a JIT. You know what's really good at analyzing code and generating efficient machine code from it? A compiler! So here's the main idea.

§3 Human · 3%

Given code like the above input x86 snippet, we can process it into source code that looks like: regs.eax = 3; regs.eax = add(regs.eax, 4); windows_api(); // some native implementation of the API that was called

We then feed this code back in to an optimizing compiler to get a program native to your current architecture, x86 no longer needed. In other words, instead of handing an .exe file directly to an emulator that might JIT code out, we instead have a sort of compiler that statically translates the .exe (via a second compiler in the middle) directly into a "native" executable. (I write native in scare quotes because while the resulting executable is a native binary, it is a binary that is carrying around a sort of inner virtual machine representing the x86 state, like the regs struct in the above code. More on this in a bit.) I think I came up with this basic idea on my own just by thinking hard about what I was trying to achieve, but it turns out this approach is known as static binary translation and is well studied. It has some nice properties, and also some big problems. Decompilation I'll go into those, but first, a minor detour about how I ended up here. Have you heard of decompilation? These madmen (madpeople?) are manually recreating the source code to old video games, one function at a time. They take the game binary, extract the machine code of one function, then use a fancy UI (click one of the entries under "Recent activity") to iteratively tinker on reproducing the higher-level code that generates the exact same machine code. It's kind of amazing. (To do this, they need to even run the same original compiler that was used to compile the target game. Those compilers are often Windows programs, which means implementing the above fancy UI involves running old Windows binaries on their Linux servers. This is how I first learned about them — they need a Windows emulator!) Decompilation is not only just a weird and fascinating (and likely tedious?)

§4 Human · 5%

human endeavor. It also highlighted something important for me: I don't so much care about having an emulator that can run any random program, I care about running a few very specific programs and I'm willing to go to even some manual lengths to help out. In practice, if you look at a person building a Windows emulator, they end up as surgeons needing to kind of manually reach in and pump the heart of the target program themselves anyway, including debugging the target program and working around its individual bugs. It's common for emulators to even manually curate a list of programs that are known to work or fail. An old idea Statically translating machine code is not a new idea. Why isn't it more popular? My impression in trying to read about it is that it is often dismissed because it can't work, but at least so far it's worked well. Maybe I haven't yet encountered some impossible problem that I've so far overlooked? (When trying to look up related work for this blog post, I saw this attempt at statically translating NES that concluded it can't be done, but then also these people seem to be succeeding at it so it's hard to say.) I think there are two main problems, a technical one and a more cultural one. The technical part is that the simple idea has complex details. To start with, any program that generates code at runtime (e.g. itself containing a JIT) won't work, but it's easy for me to just dismiss those programs as out of scope. There are also challenges around things like how control flow works, but those are small and interesting and I might go into them in future posts. A common topic of research is that it's in the limit impossible to statically find all of the code that might be executed even in a program that doesn't generate code at runtime, because of dynamic control flow from vtables or jump tables. In particular, while there are techniques to find most of the code, no approach is guaranteed to work perfectly. This is where decompilation changed my view: if I'm willing to manually help out a bit on a specific program, then this problem might be fine?

§5 Human · 2%

The main cultural reason I think binary translation isn't more common is that it's not as convenient as a generic emulator that handles most programs already. Users aren't likely to want to run a compiler toolchain, though I have seen projects embed the compiler (e.g. LLVM) directly to avoid this. The other cultural problem is there are legal ramifications if you intend to distribute translated programs. Every video game emulator relies on the legal fiction of "first, copy the game data from the physical copy you already own and pass that in as an input", so they get to plausibly remain non-derivative works. But I'm not solving for users, I'm solving for my own interest. These cultural problems don't matter to me. Benefits Again consider the snippet above, which is adding 3 and 4. In a static translator world we parse the instruction stream ahead of time, so the compiler gets to see that we want to put a 3 in eax and not (as an interpreter would) spend runtime considering what values we are reading and writing where. A compiler will not only generate the correct machine code for the target architecture, it even will optimize code like the above to just store the resulting value 7. And a compiler is capable of eliminating unneeded code like parity computations if you frame things right. Because the Theseus code generation happens "offline", separately from the execution of the program, I can worry less than a JIT might to about spending runtime analyzing the code to try to help. When I started this I had thought that performance would be the whole benefit of this approach, but it turns out to be easier to develop as well because it brings in all of the other developer tools:

The translated instructions appear as regular code in the output program, which means the native debugger can step translated instructions, which appear as regular source code. If the program crashes, the native stack trace traces back in to the (translated assembly of the) original program. I haven't tried it yet, but CPU profiling ought to have the same benefit.

In retrowin32 I ended up building a whole debugger UI to help track down problems, but in Theseus I've just used my system debugger so far and it's been fine.