Sep 20 2021

Imaging mounted disk volumes under duress

drive-jenga

Backups are critical. If you are lucky and organised you have a set of useful backup primitives, such as Point in Time snapshots on your Infrastructure As A Service (IaaS), your disk array controller, or volume manager. However there always seems to be some critical machine in my life that does not fall into these buckets.

Ideally, I prefer to have full disk images to restore from. I prefer to boot a system as it was 30 days ago and extract files from it rather than having to piece things together using an unbootable copy of all of its files. Bootable systems in my world always win. However, making a bootable image of a system is only really possible if the file system the system is installed on has been set up with something to enable this (for example LVM, DRBD, or MD).

Annoyingly, I often find myself in the position where I know I need to backup a system for some imminent demise, but it doesn’t have any point in time snapshot ability. Rebooting it or reconfiguring its storage setup is usually very undesirable in this position, so migrating it to LVM is not possible, I can’t use MD to add a replica disk to get a block level backup copy of the disk, nor setup the disk to be replicated over to another machine using DRBD

All of these options are amazing if you can get away with making deep changes to the system. But what do you do if your system looks like this?

root@doomed:~# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0     11:0    1 1024M  0 rom  
sda    254:0    0   64G  0 disk 
├─sda1 254:1    0 63.5G  0 part /
├─sda2 254:2    0    1K  0 part 
└─sda5 254:5    0  508M  0 part [SWAP]

We can’t simply read off the block device (sda) here because the system could be actively making changes to the disk behind where the imaging program has already read.

So, while a dd if=/dev/sda of=/somewhere-else might complete it will not produce a disk image that you could confidently mount without data corruption… the exact thing we are trying to avoid while backing up in the first place.

dd not seeing dirty blocks

But what if we could build something like dd that could see the changes happening to the disk in real time? Well thanks to a 2006 Linux tracing API that looks like it was designed for SAN performance debugging, we can!

Enter blktrace

blktrace is some user space and kernel space code that allows for decent insight into block device performance by providing tracing data on what actions are happening to them. It was made by HP who I assume was using it to debug their SAN performance at the time. But it seems the blktrace APIs have almost never been touched outside of the included blktrace programs itself. This is sad since it’s a genuinely useful tool (that honestly could also be performed with eBPF, but hey! This API is way more simple).

We can give it a quick go on a modern Debian 11 system to check that everything still works as expected:

root@test-debian11:~# blktrace /dev/vda
^C=== vda ===
  CPU  0:                    1 events,        1 KiB data
  CPU  1:                  102 events,        5 KiB data
  Total:                   103 events (dropped 0),        5 KiB data

In another terminal I ran touch foo to force some disk I/O to happen, then hit ^C on blktrace and it seems to have recorded some events! Lovely!

Then we can dump out the actions using blkparse:

root@test-debian11:~# blkparse -i vda.blktrace. 
Input file vda.blktrace.0 added
Input file vda.blktrace.1 added
254,0    1        1     0.000000000   181  A  WS 8673512 + 8 <- (254,1) 8671464
… <snip, too much data> ...
254,0    1       84     1.741724200   761  Q  RM 15200 + 8 [touch]
254,0    1       85     1.741724432   761  M  RM 15200 + 8 [touch]
254,0    1       86     1.741725516   761  U   N [touch] 1
254,0    1       87     1.741726831   761  I  RA 15104 + 104 [touch]
254,0    1       88     1.741730665   761  D  RA 15104 + 104 [touch]
254,0    1       89     1.742324123     0  C  RA 15104 + 104 [0]
CPU1 (vda):
 Reads Queued:          14,       56KiB	 Writes Queued:           9,       36KiB
 Read Dispatches:        4,       56KiB	 Write Dispatches:        2,       36KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        4,       56KiB	 Writes Completed:        3,       36KiB
 Read Merges:           12,       48KiB	 Write Merges:            7,       28KiB
 Read depth:             1        	 Write depth:             1
 IO unplugs:             2        	 Timer unplugs:           0

Throughput (R/W): 32KiB/s / 20KiB/s
Events (vda): 89 entries
Skips: 0 forward (0 -   0.0%)

This is great since it tells us what sectors on the disk is being altered and how much data is being changed. We can use this to build a tool like dd that does not have the flaw mentioned above!

Reusing the blktrace API

To get started we need to add the blktrace ioctls and system structs into golang’s unix API package, since this is the first time it seems anyone is using blktrace in Go (my generally preferred language).

This is basically a case of finding the C structs used and telling the go/sys generator to regenerate all the structs used for syscalls.

Admittedly this is not my first time doing this: the Splitting the Ping post used the PPS API for the first time. Thankfully it’s reasonably easy to track down the ioctl identifiers for setup and the structs used for events emitted, making adding support of this an easy task

All of that work results in a reasonably small diff to go’s sys/unix package:

diff --git a/sys/unix/linux/types.go b/sys/unix/linux/types.go
index 515e3b6..d5e3cd5 100644
--- a/sys/unix/linux/types.go
+++ b/sys/unix/linux/types.go
@@ -95,6 +95,7 @@ struct termios2 {
 #include <linux/fanotify.h>
 #include <linux/filter.h>
 #include <linux/fs.h>
+#include <linux/blktrace_api.h>
 #include <linux/fsverity.h>
 #include <linux/genetlink.h>
 #include <linux/hdreg.h>
@@ -3212,6 +3213,17 @@ const (
 	PPS_FETCH     = C.PPS_FETCH
 )
 
+// BLKTRACE API
+
+type BLK_user_trace_setup C.struct_blk_user_trace_setup
+type BLK_io_trace C.struct_blk_io_trace
+
+const (
+	BLKTRACESETUP    = C.BLKTRACESETUP
+	BLKTRACESTART    = C.BLKTRACESTART
+	BLKTRACESTOP     = C.BLKTRACESTOP
+	BLKTRACETEARDOWN = C.BLKTRACETEARDOWN
+)
+

Then it is a case of setting the blktrace parameters in a struct, and then ioctl-ing it on a file descriptor for the target (to be traced) device:

	traceOpts := unix.BLK_user_trace_setup{
		Act_mask: 2,
		Buf_size: 65536,
		Buf_nr:   4,
	}

	_, _, err = unix.Syscall(unix.SYS_IOCTL, f.Fd(), unix.BLKTRACESETUP, uintptr(unsafe.Pointer(&traceOpts)))

After running BLKTRACESTART some handy files appear in debugfs:

root@test-debian11:/sys/kernel/debug/block/sda# ls -alh
total 0
drwxr-xr-x  5 root root 0 Sep 19 15:22 .
drwxr-xr-x  4 root root 0 Sep 19 15:20 ..
-r--r--r--  1 root root 0 Sep 19 15:22 dropped
...
-r--------  1 root root 0 Sep 19 15:22 trace0
-r--------  1 root root 0 Sep 19 15:22 trace1
…

Now the last step is opening the trace(n) files (one per CPU) and reading blk_io_trace structs from them, then combining that with a simple sector bit mask to easily track altered disk sectors that received a write event during our imaging and going back to them after imaging finishes!

Since we must track writes on the sector level (512 bytes sections of a disk), we have to use up 1 byte for every 8 sectors. This means that for every TiB being imaged around 250MB of RAM is needed to track dirty sectors for that device.

I ended up writing a simple proof of concept and now needed a way to test the integrity of the whole thing to make sure this can copy data without missing altered sectors (and thus causing corruption)

Verification

My plan to prove that this worked was:

Zero out a block device to ensure no old data remained on it
Begin imaging the device with no data on it
Create an ext4 file system, mount it, writes some files
unmount the newly made filesystem before the imaging finishes

If the system works correctly, we should get back an image that was byte to byte exactly the same with what was on the target disk! Annoyingly all of my computer storage was cursed for this use by being too fast, however my day was saved by an incredibly sluggish hot pink 4GiB USB flash disk!

hot pink flash drive

[16:30:21] ben@metropolis:~$ lsblk | grep sdb
sdb                     8:16   1   3.7G  0 disk  /media/ben/800B-EAB7
[16:30:25] ben@metropolis:~$ sudo umount /dev/sdb
[sudo] password for ben: 
[16:30:37] ben@metropolis:~$ sudo -i 
root@metropolis:~# dd if^C
root@metropolis:~# pv -L 5M /dev/zero > /dev/sdb
 620MiB 0:02:04 [5.02MiB/s] [====>                             ] 16% ETA 0:10:31

Now that the device is wiped and ready, we can begin our test:

hot-clone imaging a device

Then we reassemble the output into a contiguous image,

[18:57:32] ben@metropolis:~/tmp$ hot-clone -reassemble sdb.hc -reassemble-output sdb.img
2021/09/19 18:57:35 Restoring section (Sector: 0 (len 3959422976 bytes) (debug: 'S:0	L:3959422976')
...

[18:58:07] ben@metropolis:~/tmp$ sudo md5sum sdb.img /dev/sdb
e6f63ef26853354fa60245ae16fb209b  sdb.img
e6f63ef26853354fa60245ae16fb209b  /dev/sdb

They are the same! Since all of the altered blocks got copied over, and since we unmounted the device before the imaging finished, the device and the image are identical!

It’s worth pointing out this tool is for quite narrow situations where you can’t use a point in time snapshot on the block level. So please don’t use it when you can do those things.

This tool has already saved my ass a few times, and I (personally) would trust it to be correct. However as with almost all software, it comes with no warranty.

You can find the tool, source code and pre-built binaries over at https://github.com/benjojo/hot-clone

If you want to stay up to date with the blog you can use the RSS feed or you can follow me on Twitter

Until next time!

Yubikey/Smartcard backed TLS servers (2018)

Random Post:

Building Ultra Long Range TOSLINK (2025)