Corrupting a ZFS File on Purpose

O oshogbo.com ↗

▲ 85 points • 27 comments • by zdw • 3w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated and AI-assisted content detected

20 %

AI likelihood · overall

Mixed

79% human-written 8% AI-generated

SEGMENTS · HUMAN 8 of 9

SEGMENTS · AI 1 of 9

WORD COUNT 1,665

PEAK AI % 85% · §8

Analyzed

Jun 9

backend: pangram/v3.3

Segments scanned

9 windows

avg 185 words each

Distribution

79 / 8%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,665 words · 9 segments analyzed

Human AI-generated

§1 Human · 1%

Most of the time, the whole point of ZFS is that your data does not get corrupted. But during development you sometimes need the opposite: a controlled, reproducible corruption, so you can watch self-healing kick in, see what a scrub reports, or just understand how a file maps onto the physical disk. There is no better exercise than breaking one byte on purpose and seeing ZFS struggling.

The safe rule is simple: do this only on throwaway pools backed by throwaway files. Pointing these commands at a real disk would be less of a lesson and more of a confession.

This is the story of doing exactly that on Linux, the lazy way and the educational way.

The lazy way

If you just want a corrupted file and you do not care how it happened, ZFS has a tool for that. After creating a file on a ZFS filesystem, zinject will cause for data blocks to come back with a checksum error:

# zinject -t data -e checksum -a /tmp/zfs-blog-flow/single-mnt/file.bin Added handler 1 with the following properties: pool: zblog1 objset: 54 object: 3 type: 0 level: 0 range: all dvas: 0x0

You can list the active handlers:

# zinject ID POOL OBJSET OBJECT TYPE LVL DVAs RANGE --- --------------- ------ ------ -------- --- ---- --------------- 1 zblog1 54 3 - 0 0x00 all

And clear them again when you are done:

# zinject -c all removed all registered handlers

# zinject No handlers registered. Run 'zinject -h' for usage information.

That is it. zinject injects simulated corruption into a live pool It is a great tool, heavily used in the ZFS test suite.

It is also completely unsatisfying if what you actually want is to understand where the bytes live. For that, we have to do it by hand.

A pool made of files

I do not want to corrupt a real disk. Not for moral reasons. I just don't have one lying around.

§2 Human · 1%

Yes, I could use a VM with a virtual drive, but plain files are simply easier for demonstrating the idea. So the first step is to build pools out of plain files under /tmp/zfs-blog-flow. Every "disk" is then a file I can open with dd and a hex editor, which is the entire trick.

$ mkdir -p /tmp/zfs-blog-flow/single-mnt /tmp/zfs-blog-flow/raidz-mnt $ cd /tmp/zfs-blog-flow $ truncate -s 512M single.img $ truncate -s 512M r1.img $ truncate -s 512M r2.img $ truncate -s 512M r3.img $ truncate -s 512M r4.img

From here on I work from inside /tmp/zfs-blog-flow, so the backing files are just single.img, r1.img, and so on.

I will build two pools, because they fail in different ways. First a single-vdev pool, with no redundancy at all:

# zpool create -f -O atime=off \ -O mountpoint=/tmp/zfs-blog-flow/single-mnt \ zblog1 /tmp/zfs-blog-flow/single.img

And then a four-file RAIDZ2 pool, with parity:

# zpool create -f -O atime=off \ -O mountpoint=/tmp/zfs-blog-flow/raidz-mnt \ zblogR raidz2 \ /tmp/zfs-blog-flow/r1.img \ /tmp/zfs-blog-flow/r2.img \ /tmp/zfs-blog-flow/r3.img \ /tmp/zfs-blog-flow/r4.img

Both come up online:

# zpool status zblog1 zblogR pool: zblog1 state: ONLINE config:

NAME STATE READ WRITE CKSUM zblog1 ONLINE 0 0 0

§3 Human · 1%

/tmp/zfs-blog-flow/single.img ONLINE 0 0 0

errors: No known data errors

pool: zblogR state: ONLINE config:

NAME STATE READ WRITE CKSUM zblogR ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 /tmp/zfs-blog-flow/r1.img ONLINE 0 0 0 /tmp/zfs-blog-flow/r2.img ONLINE 0 0 0 /tmp/zfs-blog-flow/r3.img ONLINE 0 0 0 /tmp/zfs-blog-flow/r4.img ONLINE 0 0 0

errors: No known data errors

One habit is worth remembering, because file-backed pools are not where zpool looks by default. To let an import find a pool sitting in a plain directory, point it at that directory:

$ zpool import -d .

We will need that later, when the pool is exported and we are corrupting its backing files behind its back.

Following one file down to the hardware

Start with the single-vdev pool. Write a file with an easy-to-recognize pattern, and let us go find it:

$ yes 'SINGLE-ZFS-CORRUPTION-DEMO-BLOCK' | head -c 1M > single-mnt/file.bin $ sync

Get its inode, size, and block usage:

$ stat -c 'path=%n inode=%i size=%s blocks512=%b' single-mnt/file.bin path=single-mnt/file.bin inode=2 size=1048576 blocks512=21

A 1 MiB file, 21 sectors on disk.

§4 Human · 3%

Noted.

Now hand that object number to zdb and ask it to describe the object in detail:

# zdb -ddddd zblog1/ 2 Dataset zblog1 [ZPL], ID 54, cr_txg 1, 34K, 7 objects, rootbp DVA[0]=<0:1d400:200> [L0 DMU objset] fletcher4 lz4 ... size=1000L/200P birth=14L/14P fill=7 cksum=...

Object lvl iblk dblk dsize dnsize lsize %full type 2 2 128K 128K 10K 512 1M 100.00 ZFS plain file 176 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 7 path /file.bin uid 0 gid 0 atime Thu Jun 4 23:46:26 2026 mtime Thu Jun 4 23:46:34 2026 ctime Thu Jun 4 23:46:34 2026 crtime Thu Jun 4 23:46:26 2026 gen 12 mode 100664 size 1048576 parent 34 links 1 pflags 840800000004 projid 0 Indirect blocks: 0

§5 Human · 11%

L1 0:1ba00:400 20000L/400P F=8 B=14/14 cksum=... 0 L0 0:19a00:400 20000L/400P F=1 B=14/14 cksum=... 20000 L0 0:19e00:400 20000L/400P F=1 B=14/14 cksum=... 40000 L0 0:1a200:400 20000L/400P F=1 B=14/14 cksum=...

A warning that cost me some confusion: that argument is a dataset, not a pool. If you pass just the pool name - zblog1 - you are inspecting the object inside the pool's top-level object set, not inside your filesystem, and you will happily read the wrong numbers for a while. To look inside the root dataset of the zblog1 pool, use zblog1/.

There is a shortcut that does the dataset bookkeeping for you and looks the file up by path:

$ zdb -O zblog1 file.bin -vvvv

Either way, what we are hunting for is the block pointer, and inside it, the DVA.

What a DVA actually says

DVA stands for Data Virtual Address, and it is ZFS's way of saying "this block lives here". Each DVA carries a vdev ID and an offset into that vdev. The first level-0 block from the dump above is:

0 L0 0:19a00:400 20000L/400P F=1 B=14/14 cksum=...

Decoded, that line says:

0 before L0 - the offset of this block within the file.

§6 Human · 24%

L0 - a level-0 block, meaning actual data and not more metadata. 0:19a00:400 - the DVA: vdev 0, byte offset 0x19a00 into that vdev, and 0x400 bytes on disk. 20000L/400P - the logical size, then the physical size (in hex, so 0x20000 = 128 KiB). F=1 - there is real data here. B=14 - the transaction group that created it.

The offset is the interesting part, and it has a catch. The DVA offset does not count from the very start of the disk. ZFS keeps the first 4 MiB of every disk for itself: two copies of the vdev label and a boot block. The offset is measured after that reserved area, so to find the real byte on the file you add 0x400000 and convert to sectors:

physical byte offset = 0x400000 + DVA offset sector = physical byte offset / 512

For this block:

0x400000 + 0x19a00 = 0x419a00 0x419a00 / 512 = 8397

So sector 8397 of single.img should hold the start of my block. Let us check, straight off the backing file:

$ dd if=single.img bs=512 skip=8397 count=2 status=none | hexdump -C | head 00000000 00 00 02 2d ff 12 53 49 4e 47 4c 45 2d 5a 46 53 |...-..SINGLE-ZFS|

§7 Human · 21%

00000010 2d 43 4f 52 52 55 50 54 49 4f 4e 2d 44 45 4d 4f |-CORRUPTION-DEMO| 00000020 2d 42 4c 4f 43 4b 0a 21 00 ff ff ff ff ff ff ff |-BLOCK.!........| 00000030 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| * 00000220 ff ff ff ff ff ff ff ff ff ff c8 50 4d 4f 2d 42 |...........PMO-B| 00000230 4c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |L...............| 00000240 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000400

My pattern is in there - SINGLE-ZFS-CORRUPTION-DEMO-BLOCK - so the address arithmetic is right. But it is wrapped in junk, the text breaks off after one line, and the block is 0x400 bytes instead of the clean 128 KiB I was picturing.

§8 AI · 85%

The DVA was correct, the sector math was correct, the dd command was correct. The right place, the wrong mental model.

The compression trap

The block count had been trying to tell me this the whole time. A 1 MiB file does not fit in 21 sectors unless something is squeezing it, and zdb had been saying so in plain sight: 20000L/400P means 128 KiB logical, 1 KiB physical. The block is compressed. What is on the disk is the compressed image, so of course it does not look like my repeated string.

I forgot to turn compression off, and then I blamed ZFS for playing with my data. Compression off, then, before any of this offset arithmetic - and note that changing the property does not rewrite existing blocks, so the file has to be written again afterwards:

# zfs set compression=off zblog1 # zfs get -H -o name,property,value compression zblog1 zblog1 compression off

Recreate the file:

$ rm single-mnt/file.bin $ yes 'SINGLE-ZFS-CORRUPTION-DEMO-BLOCK' | head -c 1M > single-mnt/file.bin $ sync

The path form of zdb is convenient here, because it reports the new object number for us:

# zdb -O zblog1 file.bin -vvvv

obj=3 dataset=zblog1 path=/file.bin type=19 bonustype=44

Object lvl iblk dblk dsize

§9 Human · 6%

dnsize lsize %full type 3 2 128K 128K 1.00M 512 1M 100.00 ZFS plain file 176 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 7 uid 0 gid 0 atime Thu Jun 4 23:47:09 2026 mtime Thu Jun 4 23:47:09 2026 ctime Thu Jun 4 23:47:09 2026 crtime Thu Jun 4 23:47:09 2026 gen 22 mode 100664 size 1048576 parent 34 links 1 pflags 840800000004 projid 0 Indirect blocks: 0 L1 0:469e00:400 20000L/400P F=8 B=22/22 cksum=... 0 L0 0:369e00:20000 20000L/20000P F=1 B=22/22 cksum=... 20000 L0 0:389e00:20000 20000L/20000P F=1 B=22/22 cksum=...