Remote Cache CDC: Reusing Bytes | BuildBuddy

B buildbuddy.io ↗

▲ 63 points • 10 comments • by siggi • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is a mix of AI-generated, AI-assisted, and human-written content

45 %

AI likelihood · overall

Mixed

50% human-written 37% AI-generated

SEGMENTS · HUMAN 2 of 5

SEGMENTS · AI 2 of 5

WORD COUNT 1,697

PEAK AI % 99% · §1

Analyzed

May 16

backend: pangram/v3.3

Segments scanned

5 windows

avg 339 words each

Distribution

50 / 37%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,697 words · 5 segments analyzed

Human AI-generated

§1 AI · 99%

The goal: move the changed bytes, not the whole output.BuildBuddy's Remote Cache uses Content-Defined Chunking (CDC) to make large build outputs behave more incrementally. When a binary, bundle, package, or archive is mostly unchanged, BuildBuddy can reuse chunks it has already seen instead of re-uploading or re-downloading the entire file.In our Bazel chunking implementation PR, we observed 40% less data uploaded and a 40% smaller disk cache when benchmarked on BuildBuddy's own repo. To enable client-side CDC with BuildBuddy, use Bazel 8.7 or 9.1+ and pass --experimental_remote_cache_chunking.Setting the SceneThe next frontier for build caching is not just skipping actions. It is skipping bytes.Build caching has come a long way. Instead of rebuilding the world after every edit, Bazel and remote caching let teams reuse action outputs across machines and CI jobs. In practice, builds have moved from something closer to O(size of repo) toward O(size of change).But "size of change" can be misleading. What really matters is the size of the transitive actions affected by the edit. A small source change can still ripple into many binaries, packages, bundles, and other large outputs, even when only a small part of each output actually changes.That invalidation is expected. Build systems should rerun an action when its inputs change. The remote-cache problem is what happens next: the cache sees a new digest and moves the whole blob, even if that blob is mostly the same bytes as the previous version.Transitive ActionsLinking, bundling, packaging, and archiving are where this shows up most often. They combine many transitive inputs into one output.That makes them different from actions that operate on a small, direct set of files. A typical compile action might compile one source file using a smaller set of direct inputs. A transitive action, on the other hand, often consumes the accumulated outputs of many dependencies and produces one final binary, bundle, package, or archive.In Bazel rules, this often shows up as a rule collecting files through a transitive depset and passing that accumulated set into a single action.

§2 AI · 79%

For example, a simplified compile action might look like this:ctx.actions.run( inputs = [src] + direct_headers, outputs = [obj], executable = compiler, arguments = ["-c", src.path, "-o", obj.path],)A bundling or packaging action often looks more like this:transitive_inputs = depset( direct = direct_files, transitive = [dep[MyInfo].files for dep in ctx.attr.deps],)ctx.actions.run( inputs = transitive_inputs, outputs = [bundle], executable = bundler, arguments = ["--output", bundle.path],)That second shape is where small source changes can fan out into large output changes. The source edit might only change a small sequence of bytes in the final output, but the output digest is still new.Without CDC, the cache treats that as a completely new blob, even when most of the binary, bundle, package, or archive is byte-for-byte identical to the previous version. If many final outputs depend on that changed input, they can all get new digests.For remote caching, the expensive part is not just that the output is large. It is that the output is large and mostly similar to something the cache already has, but the whole-blob digest is new.That creates two problems: Uploads and downloads move the whole blob, even when only a small part changed. Storage keeps another whole blob, even when most bytes are duplicates. One workaround is to disable remote caching for these actions. That avoids uploading huge outputs when the expected cache hit is not worth the write cost, but it creates a different problem: the action now has to run every time. It can also make the action harder to move to remote execution, because RBE depends on moving action inputs and outputs efficiently.

§3 Human · 10%

So the build avoids one expensive cache write, but gives up reuse entirely.A small source change can invalidate the final transitive action.Case study: Go testsA common example is a shared go_library, say foo, that is imported by many other libraries: bar1, bar2, through barN. Each bar library may also have its own go_test.An implementation-only change in foo might only rebuild foo's own GoCompilePkg action. The downstream compile actions can often still hit cache because Go compilation depends on direct dependency export data, like foo.x, not the full transitive archive graph.Linking is different. Each go_test needs a test binary, produced by a GoLink action, and that link action consumes the transitive set of Go archives, like foo.a. If foo.a changes, many downstream test binaries can get new digests even when their source and compile actions did not change. Finally, the TestRunner action needs that test binary as an input in order to run it.That means one small source edit can create many new test binary digests. Those test binaries are often large, and many of them are mostly the same bytes as before. Without CDC, each one is still transferred and stored as a new whole blob.Treating This as an Output ProblemOne option would be to make the actions themselves incremental: incremental linking, runtime linking, smarter bundling, smarter packaging, and so on. But this is usually very difficult, and requires extensive changes to the linkers and tools themselves.And even if we solved that for one tool, we would still need separate solutions for GoLink, C++ linkers, JavaScript bundlers, app packagers, generated archives, and every other action that can produce a large output. That does not scale.Instead, we can treat this as a generic output problem: these actions create large files, where only a small amount of content is changing. With Content-Defined Chunking (CDC), we can leave the actions themselves untouched, while still getting many of the wins of making those actions incremental.Content-Defined ChunkingCDC is a repeatable process for splitting a file into chunks based on its contents rather than fixed byte offsets.The TL;DR is: run a rolling hash over a small window of bytes, and split when the hash matches a rare pattern. The hash behaves randomly enough that this happens only occasionally, but the process is still deterministic: the same content produces the same chunk boundaries.

§4 Mixed · 40%

If you want chunks around 512 KiB on average, choose a pattern that has about a 1 in 512 KiB chance of matching at each byte. If the pattern does not match, shift the window and try again. Over time, this gives you the average chunk size you wanted while keeping the boundaries content-defined.Smaller chunks improve deduplication but increase metadata overhead and RPC cost, so CDC implementations balance chunk size against efficiency.For a toy example, imagine the rolling window is 4 bytes wide and we split whenever the hash of that 4-byte window ends in 00. Suppose the windows bbbb and cccc both happen to match that pattern (the exact hash values do not matter):original: aaaabbbbccccddddwindows: bbbb cccccuts: aaaa|bbbb|cccc|ddddIf we insert a few bytes inside bbbb, the nearby windows change, so that chunk changes:updated: aaaabbXXbbccccddddBut once the rolling window moves past the inserted bytes and reaches cccc again, it sees the same 4-byte sequence as before. That sequence produces the same hash, so the algorithm finds the same cut point again. The later chunks can keep the same boundaries and hashes.Real CDC uses a larger rolling window and a much rarer split pattern, but the idea is the same.This means that a large file with a few bytes added or removed somewhere in the file usually only changes the nearby chunk(s). Once the rolling window moves past the changed bytes and reaches unchanged content again, it starts seeing the same byte sequences as before, so it finds the same future cut points.One common CDC algorithm is FastCDC. The FastCDC presentation slides are also a helpful visual overview.Only the changed chunk needs to be uploaded again.How does this benefit remote caching?If an action creates a large output, like GoLink or CppLink, a small input change may still produce a new output that is mostly identical to the previous one.With CDC, the cache can split that output into chunks and discover that many of them already exist. Instead of uploading the whole output again, it uploads only the missing chunks.This works especially well for CI and developer builds, where nearby commits often produce outputs that are mostly similar.

§5 Human · 14%

Once a chunk has been uploaded, future builds can reuse it across related outputs.Most of the output can still map to chunks that already exist in the cache.ResultsIn this recent window, CDC deduplicated about 85% of written bytes across eligible BuildBuddy cache writes. In other words, most large-output writes were already present as reusable chunks, so only the remaining changed chunks needed to be uploaded.Over this two-week window, CDC skipped uploading ~300 TiB of duplicate chunk data on the write path, with peaks over 4 TiB per hour. This comes from write-side chunk deduplication across BuildBuddy-managed cache writes and executor output uploads. Total network savings should be higher, since this does not include read-side savings when chunks are served from disk caches, regional caches, or executor file caches.In production, CDC has already skipped hundreds of TiB of duplicate chunk uploads. Because BuildBuddy stores less duplicate data, effective cache retention has also improved.The Bazel implementation PR benchmarked 50 commits of the BuildBuddy repo and saw about 40% less data uploaded, about 40% smaller disk cache, and faster builds in that benchmark.BuildBuddy currently applies chunking to blobs larger than 2 MiB. In one test, only about 4.2% of objects were above that threshold, so most blobs are not chunked.Within that eligible subset, CDC deduplicated about 85% of written bytes. Across all cache traffic, overall savings are typically in the 20 to 40% range.As a rule of thumb, CDC works best for outputs that are large and byte-stable across revisions. Linking and packaging tend to be good fits, and most large outputs we see reuse most of their bytes. Bundling is also a good fit when the output is not compressed, obfuscated, or randomized.Compression is not terrible, but it usually causes more churn. Compressed formats like tar.gz archives and Docker image layers are often less chunkable because a small input change can rewrite more of the compressed byte stream. The key property is byte-level similarity, not the file extension.ImplementationTo make this work end to end, the change lands in three places: Remote APIs define the shared SplitBlob / SpliceBlob protocol so clients and caches can talk about chunks.