(...)
102 64 0 0 W 1103045891 131072 sched
102 64 0 0 W 1103046147 131072 sched
102 64 0 0 W 1103046403 131072 sched
102 64 0 0 W 1103046659 131072 sched
102 64 0 0 W 1103046915 131072 sched
102 64 0 0 W 1103047171 131072 sched
102 64 0 0 W 1103047427 131072 sched
(...)
We see that Solaris nicely chunks the data up into 128K blocks, but the I/O's are to uneven block numbers, so indeed the EARS drives will have a problem with this (lots of read-modify-write cycles needed). What's worse, we have also writes smaller than 4096 bytes, probably metadata:
(...)
102 64 0 0 W 268531009 1536 sched
102 64 0 0 W 1226324923 1536 sched
102 64 0 0 W 1226324926 1024 sched
102 64 0 0 W 268531012 1024 sched
102 64 0 0 W 847259994 1024 sched
102 64 0 0 W 1226324928 1024 sched
(...)
Fortunately, ZFS (even as delivered in stock Solaris 10) has a way to enforce a proper alignment of sectors. It is the "ashift" parameter which is determined at pool creation time and stored as part of the pool configuration. The "zdb -C
(...)
$ zdb -C be02|tail -6 | sed -e"s/^ *//"
metaslab_array=23
metaslab_shift=32
ashift=9
asize=750134231040
is_log=0
DTL=128
(...)
The "ashift=9" means that blocks will have an alignment of 2**9 = 512 bytes. What we would want is an alignment of 2**12 = 4096 bytes.
So, how to increase this "ashift" value? As far as I know, there is no way to increase it without recreating the pool. Because if it was possible, existing blocks would need to be re-aligned and understandably this would be quite hard to do on a live zpool!
Fair enough, so, how to specify this "ashift" value at pool creation time? Well, in stock Solaris 10 there is no way to do that. The "ashift" value is not among the configuration parameters supplied by zpool to the ZFS_IOC_POOL_CREATE ioctl. It is currently left to the kernel to derive a proper value (from the hardware) for "ashift". In normal circumstances, this is probably the correct way to do it, in our case however, we want control over this value.
So we need to compile our own version of zpool, where we add this configuration parameter. I found out that b116 is the OpenSolaris version whose code is likely to be compatible with Solaris 10 10/09 (s10u8). So I grabbed on-src.tar.bz2 and extracted ./usr/src/cmd/zpool.
Without a proper build enviroment in place, i tried to compile it on-the-fly, and discovered that I need a few more source components. The following shows the steps needed to recompile "zpool":
$ cd /tmp
$ tar -b120 -xvf on-src.tar.bz2 \
./usr/src/cmd/stat/common \
./usr/src/common/zfs \
./usr/src/cmd/zpool \
./usr/src/lib/libuutil/common/ \
./usr/src/lib/libdiskmgt/common/
$ cd ./usr/src/cmd/zpool
$ ln -s /usr/lib/libuutil.so.1 libuutil.so
$ gcc -O2 -DTEXT_DOMAIN='"en_US"' \
-I/tmp/usr/src/cmd/stat/common \
-I/tmp/usr/src/common/zfs \
-I/tmp/usr/src/lib/libuutil/common \
-I/tmp/usr/src/lib/libdiskmgt/common \
-c *.c
$ gcc -o zpool *.o -L. -lzfs -lnvpair -ldevid -lefi -ldiskmgt -luutil -lumem
I tried whether this "zpool" was functional, and it indeed worked as the original one.
We now need to extend zpool_vdev.c to append "ashift" to the list of pool properties "zpool" passes on the ZFS ioctl. The relevant function seems to be "make_leaf_vdev". A first attempt, where I modifed "construct_spec" was unsuccessful.
--- zpool_vdev.c.orig 2009-06-01 06:33:27.000000000 +0200
+++ zpool_vdev.c 2010-08-07 20:07:53.010531000 +0200
@@ -471,6 +471,7 @@
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_PATH, path) == 0);
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_TYPE, type) == 0);
verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_IS_LOG, is_log) == 0);
+ verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_ASHIFT, 12) == 0);
if (strcmp(type, VDEV_TYPE_DISK) == 0)
verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_WHOLE_DISK,
(uint64_t)wholedisk) == 0);
Please forgive me that I actually hardcoded a value of "12", but this is currently only meant as a proof-of-concept anyway.
Recompile, and we now have a "zpool" command which creates zpools with an "ashift" value of 12. Let's try it out:
$ ./zpool create be02 c0d0s7
$ zdb -C be02|tail -6 | sed -e"s/^ *//"
whole_disk=0
metaslab_array=23
metaslab_shift=29
ashift=12
asize=59430928384
is_log=0
And let's have a look at the I/O pattern on this zpool:
(...)
102 7 0 0 W 163904 131072 sched
102 7 0 0 W 164160 131072 sched
102 7 0 0 W 164416 131072 sched
102 7 0 0 W 164672 131072 sched
102 7 0 0 W 164928 131072 sched
102 7 0 0 W 165184 131072 sched
102 7 0 0 W 680912 77824 sched
102 7 0 0 R 681216 65536 sched
102 7 0 0 W 681536 4096 sched
102 7 0 0 W 22032328 4096 sched
102 7 0 0 W 681544 4096 sched
102 7 0 0 W 22032336 4096 sched
102 7 0 0 W 681552 4096 sched
102 7 0 0 W 22032344 4096 sched
102 7 0 0 W 681560 4096 sched
102 7 0 0 W 22032352 4096 sched
102 7 0 0 W 44049928 4096 sched
102 7 0 0 W 681568 4096 sched
102 7 0 0 W 681576 8192 sched
(...)
All I/O's seem to be aligned to 4096 bytes, and also seem to have a minimum size of 4096 bytes.
I think I can now go ahead and buy those WD20EARS drives...
Unless somehow an ashift of k*2^n triggers some other code, which then makes ZFS believe it is dealing with a 4KB sectored drive, as per the PSARC you linked in Part 1 of these articles.
Also, does this new minimum block size correlate to a massive increase in the on-disk space used by metadata etc?
I have not noticed any noticable increase in spaced used by metadata.
Followed the guide by letter and it seems to be working.
Test rig:
First - OpenSolaris svn134
tank - 3x1gb, raidz
Second - OpenSolaris 0906
tank2 - 3x1gb, raidz with ashift=12
1. On Second - created tank2 with modified ashift
2. Shut down Second, moved .vmdk and attached to First
3. Start up First, zpool import -f tank2, zpool upgrade tank2.
4. created /tank2/1.txt, 0 bytes long
5. ran isnoop in console, opened /tank2/1.txt in text editor, typed single char and saved.
isnoop showed 4096 written
6. did the same with /tank/1.txt, isnoop showed 2048 bytes written.
Result - both zraids are the same size, versions, asize, disks etc. The only difference is ashift and it seem to work as intended.
Will buy three samsungs HD204 and see how they perform with random writes.
I have gzip-2 compression, recordsize=8k on pool with ashift=9 and on pool with ashift=12.
Compress ratio is almost 3 times(databas files).
du -s anyfile on the first pool gives "ondisk" size (different from ls), on pool with ashift=12 it gives size that is way bigger than even ls:
#ls -la file
8814
#du -k file
16
while on ashift=9 pool du -k gives 3
other big files:
88680521 | 44672 | 12092
250009 | 132 | 79
So, basically it's a good idea for movies/etc storage and not so good for intel x-25m (MLC) SSD storing mysql databases - though it gives ~1% read speed increase (and uses twice more disk bw). Perhaps may be more useful for toshiba/jmicron-based ssd's.
With ten 2 TB Samsung HD204UI (F4EG) drives in RAIDZ2, I lose over 600 gigabytes of space using ashift 12 compared to using ashift 9!
Is there a workaround to this? Or maybe I should stay with ashift 9 - the throughput hit isn't too bad right?