Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,667 words · 5 segments analyzed
This is a DRAFT of the first part of Chapter 4 - On CPU Physics and CPU Cycles, of the Vol.1 of an upcoming book "Efficient C++ Programming for Modern 64-bit CPUs" by Sherry Ignatchenko and Dmytro Ivanchykhin. Feel free to comment on it - especially if you see some factual inconsistencies, we'll be happy to fix them.Another benefit of striving for efficiency is thatthe process forces you to understand the problem in more depth— Alex Stepanov, author of the original STLNow, let’s get down to some specifics. Let’s see Fig. 1 for a diagram of a “representative” motherboard, which has a “representative” CPU chip, which in turn has a “representative” CPU core. Here, we’re intentionally NOT referring to any specific motherboard or CPU design, just saying they’re “representative” in the sense that “it's more or less common to see similar things in 2026 CPUs”.Before delving into all the exciting details of the picture, let’s postulate one thing: the farther physically our electrical signal has to go, the slower our access is.characteristic times of electronic signals are restricted by so-called parasitic capacitances, and parasitic capacitances in general are proportional to the length of the connectionThis is a pretty much universal law in the realm of well-designed electronics; however, for smaller distances, it is not directly related to a fundamental restriction imposed by the speed of light. Indeed, for light to go across mere 0.5mm and back would take mere 3e-12 seconds; and in practice, even an R-R operation takes like 3e-10 seconds, two orders of magnitude more. Rather, at least for the CPU core, this relation between distance and speed is a manifestation of an observation that characteristic times of electronic signals are restricted by so-called parasitic capacitances, and parasitic capacitances in general are proportional to the length of the connection. CoreNow, to the details on Fig.1. Usually, when we’re writing something like a += b;, it is a register-register (R-R) operation, so the data has to go from Registers to ALU (or SIMD) and back.
However, all our 64-bit CPUs are actually pipelined, so that by the time the ALU needs this information, it is often already at the ALU entrance, so in most cases we can avoid the latency penalty of going from registers to ALU and back. Simple operations such as additions/subtractions and bitwise ops, take 1 CPU cycle; multiplication takes around 3-6 cycles, and division – up to 20 cycles (note that compared to [Fog], where 64-bit division was taking up to 100+ CPU cycles, during few last years division times have improved significantly, see a table in Cold Hard Numbers section below).Fig. 1L1D/L1IL1 cache is commonly split into L1 Data cache (L1D) and L1 Instruction cache (L1I). One interesting thing about ALU which can be seen on the diagram is that there is usually more than one ALU (and more than one SIMD unit too); this indicates that our CPU is a superscalar one, and can process more than one ALU-level operation at once 1 This, in turn, means that it is perfectly possible to have more than one operation completed at the same CPU cycle; this can lead to some counter-intuitive observations that in micro-benchmarking some operations can take some non-integer number of cycles (such as 0.75); it doesn’t mean that some operation literally took less than one cycle; it rather means that statistically CPU has managed to arrange things so that on average it performs 4 operations within 3 cycles. More generally, modern superscalar CPUs can have so-called Retired Instructions Per Cycle (RIPC) all the way up to 10-12, though in RL achieving even 4 is quite a challenge and is usually limited to ALU-bound algorithms (opposed to RAM-bound ones).If we need to access some information outside of the CPU Registers – CPU has to go to the memory, but as the memory is cached, it goes to the closest one L1 (actually – L1D) cache first; reading from L1D cache usually takes like 3 CPU cycles these days.
Note though that writing is “almost instant"2 from the point of view of CPU; indeed, all CPU has to do is to issue an instruction to L1D “just write it”, and it doesn’t really need to wait for anything here. Going to L2 (if L1D is missed) is significantly more expensive than to L1, usually at 10-15 cycles. Branch Mispredictions and [[likely]]/[[unlikely]]Let’s keep in mind that while there is a lot of stuff in the “instruction” part of the core, we at our app-level don’t need to care too much about it; all of it is near-perfectly optimized, and unless branching is involved, modern compilers will ensure that our code works without delaying things – i.e. by the time our data comes in, all the instructions are ready to execute, and vast majority of the delays and stalls are data-related and not instructions-related. One notable exception, however, is related to branching. Whenever CPU reaches a point of some branching instruction (such as JZ/BEQ or JNC/BLO instructions in x64/ARM) – logically, it cannot really go ahead until it calculates the condition (which means the need to get the data, which can be rather long). To avoid this stalling, modern CPUs (AFAWU, all of them) speculate about the instruction, and make a guess what the result will be; after making such a guess – CPU will assume that the guess stands and will proceed with this assumption in mind. Then, if the result will finally match this guess, everything is fine, and if not, the CPU will discard everything calculated based on the guess, and will restart processing right after the branching instruction. Of course, such discarding is relatively expensive; this is known as the cost of branch misprediction (which takes on the order of 15-25 cycles [Fog04, section 7.12]).
The interesting part we as devs can deal with, is about “how to affect choosing which branch is more likely”, which in turn affects number of branch mispredictions (if 99% of the time the ‘right’ branch is chosen, we’re avoiding branch mispredictions penalties almost entirely; note, however, is that whenever branch has a 50-50 chance, we cannot possibly do much about it – whatever we or dynamic branch predictor do, it will mispredict 50% of the time). Here, the situation becomes quite complicated. At the instruction level, there are some pretty well-known heuristics; for example, if JNZ goes back (points to an instruction before itself) – it is considered a likely loop, and loops are considered to go back more likely than not, so CPU hardware will consider the outcome that goes back as a more likely thing; similar heuristics exist for other cases too. From a C++ perspective, this can be affected by [[likely]]/[[unlikely]] attributes.3modern CPUs use so-called “dynamic branch prediction”But this is not the end of the story; to make things even more elaborated, modern CPUs use so-called “dynamic branch prediction”; it means that for frequently-used branch operations, CPU stores runtime statistics on “how often this branch was chosen in the past”, and makes branch predictions accordingly. Given this, efficiency of [[likely]]/[[unlikely]] attributes is not as high as it might have seemed. Sure, these attributes are still working when a totally new branch instruction (the one for which there is no statistics yet) is encountered, but this is by definition fairly rare. In addition, developers are notoriously bad in guesstimating what is likely and what is not; overall, we would not recommend using [[likely]]/[[unlikely]] attributes, unless you are at least 200% sure that the branch is indeed very unlikely. Examples of “when it does make sense” include error handling (hey, errors are expected to be rare), and outright edge cases such as going into complicated handling of huge values of x within sin(x), or into big-int special case when parsing double within strtod().On TLBsOne thing which didn’t make it to our diagram of CPU core, is TLB (Translation Lookaside Buffer), which is usually divided into ITLB for instructions and DTLB for data.
TLB comes into play whenever our CPU has an MMU, and is all about translating virtual address into a physical one. These translations occur on each and every memory access, so obviously they are extremely performance-critical. In a sense, TLB is a cache (these days – usually a 2-level one) which effectively sits on top of L14 so that translation information can be read right within core without even going to L1. for a de-pessimized C++ program relying mostly on vectors, we expect effects related to DTLB misses to be observably smaller to that of the programs written in C-style and relying on node-based data structuresWhile in theory TLB misses can be quite expensive (in the extreme case, page-table load can go all the way to main memory, ouch) – in practice, we’ve never encountered TLB to be an issue for application level.5 BTW, pressure on DTLB is minimized by using linear data structures (such as vectors, see Hint [TODL: hint # re using vectors by default]) opposed to classical C-style node-based data structures such as linked lists or trees; as a result, for a de-pessimized C++ program relying mostly on vectors, we expect effects related to DTLB misses to be observably smaller to that of the programs written in C-style and relying on node-based data structures (and we didn’t see concerns related to long-running programs articulated in [Drepper, section 6.2.4] to materialize in RL – that is, for long-running vector-oriented C++ programs6). OTOH, [Bakhvalov, section 8.4] speaks about DTLB misses being a problem in case of apps using enormous amounts of RAM (think databases and JVM); also it was noted in [JiaEtAl] that TLB costs grow disproportionally in virtualized environments. In any case, whenever TLB costs become significant, it usually can be handled fairly easily by switching to so-called huge CPU-supported pages (see e.g. [JiaEtAl] and [Bakhvalov, section 8.4]), phew.As for ITLB misses, they’re discussed in [Bakhvalov, section 11.8] and also tend to apply only to seriously large apps (think Clang) and are also handled by huge pages.