z386: An Open-Source 80386 Built Around Original Microcode
Pangram verdict · v3.3
We believe that this document is a mix of AI-generated, AI-assisted, and human-written content
AI likelihood · overall
MixedArticle text · 1,839 words · 7 segments analyzed
This is the fifth installment of the 80386 series. The FPGA CPU is now far enough along to run real software, and this post is about how it works. z386 is a 386-class CPU built around the original Intel microcode, in the same spirit as z8086. The core is not an instruction-by-instruction emulator in RTL. The goal is to recreate enough of the original machine that the recovered 386 control ROM can drive it. Today z386 boots DOS 6 and DOS 7, runs protected-mode programs like DOS/4GW and DOS/32A, and plays games like Doom and Cannon Fodder. Here are some rough numbers against ao486:
Metric z386 ao486
Lines of code (cloc) 8K 17.6K
ALUTs 18K 21K
Registers 5K 6.5K
BRAM 116K 131K
FPGA clock 85MHz 90MHz
3DBench FPS 34 43
Doom (original) FPS, max details 16.5 21.0
In current builds, z386 performs like a fast (~70MHz) cached 386-class machine, or a low-end 486. It runs at a much higher clock than historical 386 CPUs, but with somewhat worse CPI (cycles per instruction). The current cache is a 16 KB, 4-way set-associative unified L1, chosen partly to keep the clock high. Real high-end 386 systems often used larger external caches, typically in the 32 KB to 128 KB range.
Doom II running on z386.
Much of this 386 microarchitecture archaeology has already been covered in the previous four posts: the multiplication/division datapath, the barrel shifter, protection and paging, and the memory pipeline. z386 tries to be both an educational reconstruction and a usable FPGA CPU.
It keeps many 386-like structures: a 32-entry paging TLB, a barrel shifter shaped like the original, ROM/PLA-style decoding, the Protection PLA model, and most importantly the 37-bit-wide, 2,560-entry microcode ROM. At the same time, it uses FPGA-friendly shortcuts where they make sense, such as DSP blocks for multiplication and the small fast L1 cache. In this post, I will fill in the rest of the design: instruction prefetch, decode, the microcode sequencer, cache design, testing, how z386 differs from ao486, and some lessons from the bring-up.
From z8086 to z386 A little background first. Last year I wrote z8086, an original-microcode-driven 8086, based on reenigne's disassembly work. That project showed that it was possible to build a working CPU around recovered microcode. Towards the end of the year, I learned that 80386 microcode had recently been extracted, and that reenigne and several others — credited at the end of this post — were working on a disassembly. They generously shared their work with me, and z386 started from there. The 386 is a very different problem from the 8086. The instruction set is larger, the internal state is much richer, and the machine has to enforce protection, paging, privilege checks, and precise faults.
More importantly, the 80386 micro-operations are denser and more contextual. If the 8086 microcode reads like a straightforward C program, the 386 microcode reads more like hand-tuned assembly: short, subtle, and full of assumptions about hidden hardware. That puzzle took about four months of evenings and weekends. The result is not a perfect 386 yet, but it is now far enough along to run real protected-mode DOS software. z386 - high-level view At a high level, the 386 is organized around eight major units. z386 follows the same division closely enough that the original Intel block diagram is still a useful map.
The 80386 as eight cooperating units. Source: Intel, The Intel 80386 - Architecture and Implementation, Figure 8.
The diagram actually maps quite well to the actual 386 die shot, although the relative positions of the units are different.
The same eight-unit organization on the 80386 die. Base image: Intel 80386 DX die, Wikimedia Commons.
Here is what those units do in z386: 1. Prefetch unit. Keeps a 16-byte code queue filled from memory. Branches, faults, interrupts, and segment changes can flush and restart it. 2. Decoder. Consumes instruction bytes, tracks prefixes, recognizes ModR/M and SIB forms, gathers immediates and displacements, and maps instructions to microcode entry points. 3. Microcode sequencer. Fetches expanded microcode words, handles jumps, delay slots, faults, and run-next-instruction behavior. 4. ALU and shifter. Implements arithmetic, logic, flags, bit operations, shifts, rotates, multiplication, and division support. 5. Segmentation unit. Computes logical-to-linear addresses, applies segment bases and limits, and stores the hidden descriptor-cache state. 6. Protection unit. Recreates the 386 Protection PLA behavior for selector and descriptor validation. 7. Paging unit. Handles TLB lookup, page walks, Accessed/Dirty updates, page faults, and the transition from linear to physical addresses. 8. BIU/cache/memory path. Connects CPU memory operations to paging, cache, SDRAM, ROM, I/O, and the surrounding PC system.
This organization is quite different from the tidy pipelines usually shown for modern RISC-style CPUs. The 386 is better thought of as several large, partly independent state machines that overlap. Prefetch can run while the execution unit is busy. Decode can prepare later instructions. Address translation can start before the bus is needed. Protection tests can redirect the sequencer a few cycles later. Intel's papers describe up to six instructions being in different phases of processing at once, but the execution unit still consumes one micro-instruction per cycle. Unlike the 486 and later processors, which reorganized the design into a finer-grained pipeline aimed at one instruction per clock, the 386 still needs at least two microcode cycles for even simple register-register instructions. Previous posts covered units 4 through 8 in some depth. Here let's start with the front end: prefetch, decode, and the microcode sequencer. Instruction prefetch The original 8086 can move one byte at a time from its instruction queue into the execution side. For the 386, the bandwidth math changed. Jim Slager's ICCD 1986 paper, "Performance Optimizations of the 80386", gives a useful back-of-the-envelope calculation: the average 80386 instruction is about four bytes long, and the weighted average instruction takes about four clocks, so steady-state execution needs about one byte of code per clock.
In practice, the prefetcher needs burst bandwidth above that average. It has to smooth over variable-length instructions, taken branches, and data cycles that steal bus slots from prefetch. The external bus can support this: it can read four bytes every two clocks, or two bytes per clock. The 386 prefetch unit therefore fills a 16-byte code queue with 32-bit fetches, taking advantage of the full non-multiplexed 32-bit bus.
The z386 front end keeps byte-at-a-time structure decode, but exposes a wider window for displacement and immediate fields.
The next question is the interface between the prefetcher and the instruction decoder. The 8086 side is again byte-at-a-time. On the 80386, the interface is more subtle: the structure-deciding part of decode still proceeds byte by byte, while literal fields such as displacement and immediate data can be consumed in 1-, 2-, or 4-byte chunks. This is a small but important difference from the 8086 model. A 386 instruction may contain prefixes, an opcode, a ModR/M byte, a SIB byte, a displacement, and an immediate. The prefix/opcode/ModR/M part controls what the instruction is, so reading it one byte at a time keeps the logic compact.
But once the decoder knows that the next four bytes are just a displacement, there is no architectural reason to spend four separate cycles collecting them. To implement this, z386 provides two views of the code queue: the next byte, and a 32-bit window starting at the current byte offset. Decode x86 instruction decoding is hard because the instruction boundary is not obvious. There may be several prefixes, then an opcode, maybe a 0F escape opcode, maybe a ModR/M byte, maybe a SIB byte, then displacement and immediate fields whose sizes depend on mode bits and earlier bytes. A decoder has to discover the structure of the instruction while it is still reading it. The decoder input is the byte stream from the prefetch queue, plus current mode state: operand-size default, address-size default, protected-mode state, accumulated prefixes, and whether the instruction is in the 0F extended-opcode space. The output is not just an opcode. Slager's paper describes the 386 instruction unit as producing a 111-bit decoded instruction word and inserting it into a three-entry instruction queue. That decoded word is the contract between the front end and the execution side. Conceptually, the decoded word contains the execution entry point and everything the microcode should not have to rediscover from raw bytes: opcode, prefix state, operand/address size, ModR/M and SIB bytes, immediate and displacement values, instruction length, selected source and destination register fields, segment override, memory-form bits, and special control flags such as stack operation or flag-update behavior. z386 represents this as a decoded-instruction record and pushes it into a small FIFO for the microcode sequencer. To build that word, the decoder is a state machine supported by two PLA-style tables. The Control PLA answers the structural question: what comes next? It classifies the current byte as a prefix, an opcode that is complete, an opcode that needs ModR/M, or an opcode with an immediate-size class. In the ModR/M state, the same PLA helps decide whether SIB and displacement bytes follow. The Entry PLA answers the execution question: where does the microcode start? The first pass uses operand size, opcode, REP state, protected-mode state, and the 0F escape flag.
Some opcodes need a second pass after ModR/M is known, because the ModR/M reg field or memory/register form selects the final routine. For example, decode of 8B 44 24 08 in 32-bit mode is not one lookup. The decoder learns the instruction's meaning as it goes:
Byte Meaning Decoder action
8B MOV r32, r/m32 Control PLA says ModR/M follows. Entry PLA first pass says this is the MOV r,rm class.
44 ModR/M: mod=01, reg=000, r/m=100 Select destination register EAX, mark memory form, and run the Entry PLA second pass. Because this is memory MOV r,rm, the final microcode entry is 0x019.
24 SIB: scale 1, no index, base ESP Capture SIB and select the effective-address base form.
08 disp8 Capture displacement +8, compute instruction length 4, and push the decoded record.
The resulting decoded entry says, in effect: opcode 8B, ModR/M 44, SIB 24, displacement 8, operand/address size 32-bit, destination register EAX, memory operand based on ESP+8, instruction length 4, and microcode entry 0x019. The microcode engine can then start at 0x019 without re-reading the raw byte stream. The PLA structure keeps decode compact: a few dense ROM/PLA tables are much smaller than large groups of separate gates. There are still open questions here. z386 uses the Control PLA lines needed by the current decoder, but many recovered lines are still unused or only partly understood. Understanding more of the PLAs may let the decoder become smaller and faster. Microcode sequencer - the control program z386 uses the original Intel 386 microcode as its main control program. The ROM decides which internal values move, when the ALU runs, when memory cycles start, when the sequencer branches, and when the next x86 instruction may begin. The RTL does not implement ADD, IRET, or SGDT instructions as large behavioral blocks. It implements the hardware that the microcode expects to control.