Skip to content
HN On Hacker News ↗

Notes from Optimizing CPU-Bound Go Hot Paths

▲ 34 points 5 comments by nnx 2w ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated content detected

12 %

AI likelihood · overall

Mixed
90% human-written 10% AI-generated
SEGMENTS · HUMAN 8 of 8
SEGMENTS · AI 0 of 8
WORD COUNT 1,654
PEAK AI % 2% · §8
Analyzed
May 14
backend: pangram/v3.3
Segments scanned
8 windows
avg 207 words each
Distribution
90 / 10%
human / AI fraction
Verdict
Mixed
Pangram v3.3

Article text · 1,654 words · 8 segments analyzed

Human AI-generated
§1 Human · 0%

Andrii's Blog May 03, 2026Go does a lot of things right. And I love go because of that. But while porting Brotli to pure Go for go-brrr, I kept hitting the same pattern: idiomatic abstractions made hot paths slower, and the fastest version was often hand-duplicated and specialized.Lack of zero-cost abstractionsIn the hot loops I was optimizing, generics, polymorphic dispatch (via interface), and closures often prevented the compiler from producing the same code as the concrete version. The reason is that go doesn't inline these calls in the shapes I was using (we will see problems with inlining very often in this post because inlining is quite important). Yes, the compiler can sometimes inline a direct closure call or devirtualize an interface call, but in the patterns I actually ran into it didn't, and I ate the call overhead. It's clear why interface calls are not inlined. They enable swapping of implementation at runtime rather than compile time. But generics allow to swapping implementation at compile time. If you are coming from languages like c++ or rust you'd expect the generic functions to be monomorphized (all variants pre-generated as concrete functions at compile time) but in go it doesn't happen,at least not in that form. Go uses approach they called GC Shape Stenciling where some parts are pre-generated at compile time but method calls on type parameters end up going through interface-style dispatch (technically the itab is reached via a generics dictionary rather than an ordinary interface argument, but the effect on the hot path is the same). The impossibility of inlining is addressed in the proposal: The one exception is that method calls won't be fully resolvable at compile time... inlining won't happen in situations where it could happen with a fully stenciled implementation. So what do we do? Actually, no problem. We just don't use abstractions like generics and duplicate concrete functions. We just take a concrete function duplicate it completely with parts that we wanted to parametrize changed. Needless to say, this will cause lots of duplication. In the Brotli port there were 16 almost identical functions and the only difference between them was that they were calling different versions of the hash function. The 16 variants couldn't be collapsed into one via usage of abstractions because the function is used in hot path.

§2 Human · 1%

So performance problem is solved by duplication but this introduces a potentially big problem of maintenance. This can be somewhat mitigated by code generation of course but it's very likely that you will have big number of occurrences where you have only 2-3 duplicating variants which will not justify introduction of codegen.The next section is a deep dive into benchmarking of the concrete-vs-generic-vs-interface approaches with some exploration of underlying assembly which you can happily skip.Deep diveLet's illustrate all described above by example. Here is the function I used in a real codebase reduced from unimportant stuff.func StoreConcrete(t *Table, data []byte) { end := uint32(len(data)) for i := uint32(0); i+4 <= end; i++ { v := binary.LittleEndian.Uint32(data[i:]) key := (v * HashMul32) >> (32 - BucketBits) minor := uint32(t.Num[key]) & (BlockSize - 1) t.Buckets[minor+key<<BlockBits] = i t.Num[key]++ } } Now imagine that we need to use several versions of hash functions. Which hash function to use is known at compile time. There are several options how we can parametrize.One option is to use generics:type Hasher interface { Hash(v uint32) uint32 } type H5Hasher struct{} func (H5Hasher) Hash(v uint32) uint32 { return (v * HashMul32) >> (32 - BucketBits) } func StoreGeneric[H Hasher](t *Table, data []byte) { // ... key := h.Hash(v) // ... } Another option is to use polymorphic dispatch:func StoreInterface(t *Table, data []byte, h Hasher) { ... key := h.Hash(v) ... } Another option is to pass a closure to the function:func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) { // ... key := hash(v) // ... } Expand to see full code import "encoding/binary" const HashMul32 = 0x1E35A7BD const ( BucketBits = 14 BlockBits = 4 BucketSize = 1 << BucketBits

§3 Human · 1%

BlockSize = 1 << BlockBits ) type Table struct { Num [BucketSize]uint16 Buckets [BucketSize * BlockSize]uint32 } func StoreConcrete(t *Table, data []byte) { end := uint32(len(data)) for i := uint32(0); i+4 <= end; i++ { v := binary.LittleEndian.Uint32(data[i:]) key := (v * HashMul32) >> (32 - BucketBits) minor := uint32(t.Num[key]) & (BlockSize - 1) t.Buckets[minor+key<<BlockBits] = i t.Num[key]++ } } type Hasher interface { Hash(v uint32) uint32 } type H5Hasher struct{} func (H5Hasher) Hash(v uint32) uint32 { return (v * HashMul32) >> (32 - BucketBits) } func StoreGeneric[H Hasher](t *Table, data []byte) { var h H end := uint32(len(data)) for i := uint32(0); i+4 <= end; i++ { v := binary.LittleEndian.Uint32(data[i:]) key := h.Hash(v) minor := uint32(t.Num[key]) & (BlockSize - 1) t.Buckets[minor+key<<BlockBits] = i t.Num[key]++ } } func StoreInterface(t *Table, data []byte, h Hasher) { end := uint32(len(data)) for i := uint32(0); i+4 <= end; i++ { v := binary.LittleEndian.Uint32(data[i:]) key := h.Hash(v) minor := uint32(t.Num[key]) & (BlockSize - 1) t.Buckets[minor+key<<BlockBits] = i t.Num[key]++ } } func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) { end := uint32(len(data)) for i := uint32(0); i+4 <= end; i++ { v := binary.LittleEndian.Uint32(data[i:]) key := hash(v) minor := uint32(t.

§4 Human · 1%

Num[key]) & (BlockSize - 1) t.Buckets[minor+key<<BlockBits] = i t.Num[key]++ } } As it's known at compile time what hash function is used the compiler can produce optimal code, right? Wrong!Let's benchmark it first.Environment:go version go1.26.2-X:nodwarf5 linux/amd64 goos: linux goarch: amd64 pkg: hashdemo cpu: 12th Gen Intel(R) Core(TM) i5-12500 Run with:go test -bench=. -benchmem -count 6 -cpu 1 | tee bench.txt benchstat -filter '.unit:B/s' -col .name bench.txt The throughput numbers in the table below are the benchstat-reported values across 6 runs. Expand to see the full benchmarking code import "testing" const benchSize = 1 << 16 func BenchmarkConcrete(b *testing.B) { data := makeData(benchSize) t := new(Table) b.SetBytes(int64(len(data))) for b.Loop() { StoreConcrete(t, data) } } func BenchmarkGeneric(b *testing.B) { data := makeData(benchSize) t := new(Table) b.SetBytes(int64(len(data))) for b.Loop() { StoreGeneric[H5Hasher](t, data) } } func BenchmarkInterface(b *testing.B) { data := makeData(benchSize) t := new(Table) var h Hasher = H5Hasher{} b.SetBytes(int64(len(data))) for b.Loop() { StoreInterface(t, data, h) } } func BenchmarkClosure(b *testing.B) { data := makeData(benchSize) t := new(Table) hash := func(v uint32) uint32 { return (v * HashMul32) >> (32 - BucketBits) } b.SetBytes(int64(len(data))) for b.Loop() { StoreClosure(t, data, hash) } } func makeData(n int) []byte { b := make([]byte,

§5 Human · 1%

n) x := uint32(0xDEADBEEF) for i := range b { x = x*1664525 + 1013904223 b[i] = byte(x >> 24) } return b } VariantThroughputΔ vs Concrete Concrete378.0 MiB/s Generic320.6 MiB/s-15.18% Closure322.0 MiB/s-14.82% Interface274.3 MiB/s-27.44% Whoa! That's pretty dramatic difference. Assembly related to the Concrete function // func StoreConcrete(t *Table, data []byte) { PUSHQ BP // 0x52e4a0 MOVQ SP, BP MOVQ BX, 0x18(SP) // for i := uint32(0); i+4 <= end; i++ { XORL DX, DX JMP 0x52e4ed // minor := uint32(t.Num[key]) & (BlockSize - 1) TESTB AL, 0(AX) // 0x52e4ad // return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24 MOVL 0(BX)(DX*1), R9 // key := (v * HashMul32) >> (32 - BucketBits) IMULL $0x1e35a7bd, R9, R9 SHRL $0x12, R9 // minor := uint32(t.Num[key]) & (BlockSize - 1) MOVZX 0(AX)(R9*2), R10 ANDL $0xf, R10 // t.Buckets[minor+key<<BlockBits] = i MOVL R9, R11 SHLL $0x4, R9 ADDL R10, R9 MOVL R8, 0x8000(AX)(R9*4) // t.Num[key]++

§6 Human · 2%

MOVZX 0(AX)(R11*2), R9 INCL R9 MOVW R9, 0(AX)(R11*2) // for i := uint32(0); i+4 <= end; i++ { LEAL 0x1(R8), DX MOVQ SI, CX LEAL 0x4(DX), SI // 0x52e4ed CMPL CX, SI JB 0x52e514 // v := binary.LittleEndian.Uint32(data[i:]) CMPQ CX, DX JB 0x52e51b MOVQ CX, SI SUBQ DX, CX MOVL DX, R8 SUBQ DI, DX SARQ $0x3f, DX ANDQ R8, DX // _ = b[3] // bounds check hint to compiler; see golang.org/issue/14808 CMPQ CX, $0x3 JA 0x52e4ad JMP 0x52e516 // } POPQ BP // 0x52e514 RET // _ = b[3] // bounds check hint to compiler; see golang.org/issue/14808 CALL runtime.panicBounds(SB) // 0x52e516 // v := binary.LittleEndian.Uint32(data[i:]) NOPL 0(AX)(AX*1) // 0x52e51b CALL runtime.panicBounds(SB) NOPL // 0x52e525 Assembly related to the Generic function //TEXT hashdemo.H5Hasher.Hash(SB) /data/devel/my/blog-demo/hashdemo/hash.go // return (v * HashMul32) >> (32 - BucketBits) IMULL $0x1e35a7bd, AX, AX SHRL $0x12, AX RET //TEXT hashdemo.

§7 Human · 2%

StoreGeneric[go.shape.struct {}](SB) /data/devel/my/blog-demo/hashdemo/hash.go //func StoreGeneric[H Hasher](t *Table, data []byte) { CMPQ SP, 0x10(R14) JBE 0x52f0f0 PUSHQ BP MOVQ SP, BP SUBQ $0x10, SP // for i := uint32(0); i+4 <= end; i++ { MOVQ AX, 0x20(SP) MOVQ BX, 0x28(SP) MOVQ CX, 0x30(SP) MOVQ DI, 0x38(SP) MOVQ SI, 0x40(SP) XORL DX, DX JMP 0x52f072 // minor := uint32(t.Num[key]) & (BlockSize - 1) MOVZX 0(CX)(BX*2), R8 // 0x52f02f ANDL $0xf, R8 // t.Buckets[minor+key<<BlockBits] = i SHLL $0x4, AX ADDL AX, R8 MOVL 0xc(SP), DX MOVL DX, 0x8000(CX)(R8*4) // t.Num[key]++ MOVZX 0(CX)(BX*2), R8 INCL R8 MOVW R8, 0(CX)(BX*2) // for i := uint32(0); i+4 <= end; i++ { INCL DX // key := h.Hash(v) MOVQ 0x20(SP), AX // return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24 MOVQ 0x30(SP), CX // minor := uint32(t.Num[key]) & (BlockSize - 1) MOVQ 0x28(SP), BX // v := binary.

§8 Human · 2%

LittleEndian.Uint32(data[i:]) MOVQ 0x40(SP), SI // for i := uint32(0); i+4 <= end; i++ { MOVQ 0x38(SP), DI LEAL 0x4(DX), R8 // 0x52f072 CMPL DI, R8 JB 0x52f0cf NOPL 0(AX)(AX*1) // v := binary.LittleEndian.Uint32(data[i:]) CMPQ DI, DX JB 0x52f0ea SUBQ DX, DI MOVL DX, R9 SUBQ SI, DX SARQ $0x3f, DX ANDQ R9, DX // _ = b[3] // bounds check hint to compiler; see golang.org/issue/14808 CMPQ DI, $0x3 JBE 0x52f0e5 // for i := uint32(0); i+4 <= end; i++ { MOVL R9, 0xc(SP) // key := h.Hash(v) MOVQ 0(AX), BX // return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24 MOVL 0(CX)(DX*1), CX // key := h.Hash(v) MOVQ AX, DX MOVL CX, AX CALL BX // minor := uint32(t.Num[key]) & (BlockSize - 1) MOVQ 0x28(SP), CX TESTB AL, 0(CX) MOVL AX, BX NOPW 0(AX)(AX*1) NOPL CMPQ BX, $0x4000 JB 0x52f02f JMP 0x52f0d5 //} ADDQ $0x10, SP // 0x52f0cf POPQ BP RET // minor := uint32(t.