Monday, August 9. 2010
Solaris and the new 4K-Sector-Disks (e.g. WDxxEARS) / Part 3
Sunday, August 8. 2010
Solaris and the new 4K-Sector-Disks (e.g. WDxxEARS) / Part 2
Some days ago, I was just about to purchase two of those WD20EARS disks, when I stumbled over reports about their bad performance together with ZFS. I had a look at the I/O pattern produced writing large chunks of zeroes to one of my current disks, using "iosnoop" from the excellent DTrace Toolkit.
We see that Solaris nicely chunks the data up into 128K blocks, but the I/O's are to uneven block numbers, so indeed the EARS drives will have a problem with this (lots of read-modify-write cycles needed). What's worse, we have also writes smaller than 4096 bytes, probably metadata:
Fortunately, ZFS (even as delivered in stock Solaris 10) has a way to enforce a proper alignment of sectors. It is the "ashift" parameter which is determined at pool creation time and stored as part of the pool configuration. The "zdb -C" command show this:
The "ashift=9" means that blocks will have an alignment of 2**9 = 512 bytes. What we would want is an alignment of 2**12 = 4096 bytes.
So, how to increase this "ashift" value? As far as I know, there is no way to increase it without recreating the pool. Because if it was possible, existing blocks would need to be re-aligned and understandably this would be quite hard to do on a live zpool!
Fair enough, so, how to specify this "ashift" value at pool creation time? Well, in stock Solaris 10 there is no way to do that. The "ashift" value is not among the configuration parameters supplied by zpool to the ZFS_IOC_POOL_CREATE ioctl. It is currently left to the kernel to derive a proper value (from the hardware) for "ashift". In normal circumstances, this is probably the correct way to do it, in our case however, we want control over this value.
So we need to compile our own version of zpool, where we add this configuration parameter. I found out that b116 is the OpenSolaris version whose code is likely to be compatible with Solaris 10 10/09 (s10u8). So I grabbed on-src.tar.bz2 and extracted ./usr/src/cmd/zpool.
Without a proper build enviroment in place, i tried to compile it on-the-fly, and discovered that I need a few more source components. The following shows the steps needed to recompile "zpool":
I tried whether this "zpool" was functional, and it indeed worked as the original one.
We now need to extend zpool_vdev.c to append "ashift" to the list of pool properties "zpool" passes on the ZFS ioctl. The relevant function seems to be "make_leaf_vdev". A first attempt, where I modifed "construct_spec" was unsuccessful.
Please forgive me that I actually hardcoded a value of "12", but this is currently only meant as a proof-of-concept anyway.
Recompile, and we now have a "zpool" command which creates zpools with an "ashift" value of 12. Let's try it out:
And let's have a look at the I/O pattern on this zpool:
All I/O's seem to be aligned to 4096 bytes, and also seem to have a minimum size of 4096 bytes.
I think I can now go ahead and buy those WD20EARS drives...
(...)
102 64 0 0 W 1103045891 131072 sched
102 64 0 0 W 1103046147 131072 sched
102 64 0 0 W 1103046403 131072 sched
102 64 0 0 W 1103046659 131072 sched
102 64 0 0 W 1103046915 131072 sched
102 64 0 0 W 1103047171 131072 sched
102 64 0 0 W 1103047427 131072 sched
(...)
We see that Solaris nicely chunks the data up into 128K blocks, but the I/O's are to uneven block numbers, so indeed the EARS drives will have a problem with this (lots of read-modify-write cycles needed). What's worse, we have also writes smaller than 4096 bytes, probably metadata:
(...)
102 64 0 0 W 268531009 1536 sched
102 64 0 0 W 1226324923 1536 sched
102 64 0 0 W 1226324926 1024 sched
102 64 0 0 W 268531012 1024 sched
102 64 0 0 W 847259994 1024 sched
102 64 0 0 W 1226324928 1024 sched
(...)
Fortunately, ZFS (even as delivered in stock Solaris 10) has a way to enforce a proper alignment of sectors. It is the "ashift" parameter which is determined at pool creation time and stored as part of the pool configuration. The "zdb -C
(...)
$ zdb -C be02|tail -6 | sed -e"s/^ *//"
metaslab_array=23
metaslab_shift=32
ashift=9
asize=750134231040
is_log=0
DTL=128
(...)
The "ashift=9" means that blocks will have an alignment of 2**9 = 512 bytes. What we would want is an alignment of 2**12 = 4096 bytes.
So, how to increase this "ashift" value? As far as I know, there is no way to increase it without recreating the pool. Because if it was possible, existing blocks would need to be re-aligned and understandably this would be quite hard to do on a live zpool!
Fair enough, so, how to specify this "ashift" value at pool creation time? Well, in stock Solaris 10 there is no way to do that. The "ashift" value is not among the configuration parameters supplied by zpool to the ZFS_IOC_POOL_CREATE ioctl. It is currently left to the kernel to derive a proper value (from the hardware) for "ashift". In normal circumstances, this is probably the correct way to do it, in our case however, we want control over this value.
So we need to compile our own version of zpool, where we add this configuration parameter. I found out that b116 is the OpenSolaris version whose code is likely to be compatible with Solaris 10 10/09 (s10u8). So I grabbed on-src.tar.bz2 and extracted ./usr/src/cmd/zpool.
Without a proper build enviroment in place, i tried to compile it on-the-fly, and discovered that I need a few more source components. The following shows the steps needed to recompile "zpool":
$ cd /tmp
$ tar -b120 -xvf on-src.tar.bz2 \
./usr/src/cmd/stat/common \
./usr/src/common/zfs \
./usr/src/cmd/zpool \
./usr/src/lib/libuutil/common/ \
./usr/src/lib/libdiskmgt/common/
$ cd ./usr/src/cmd/zpool
$ ln -s /usr/lib/libuutil.so.1 libuutil.so
$ gcc -O2 -DTEXT_DOMAIN='"en_US"' \
-I/tmp/usr/src/cmd/stat/common \
-I/tmp/usr/src/common/zfs \
-I/tmp/usr/src/lib/libuutil/common \
-I/tmp/usr/src/lib/libdiskmgt/common \
-c *.c
$ gcc -o zpool *.o -L. -lzfs -lnvpair -ldevid -lefi -ldiskmgt -luutil -lumem
I tried whether this "zpool" was functional, and it indeed worked as the original one.
We now need to extend zpool_vdev.c to append "ashift" to the list of pool properties "zpool" passes on the ZFS ioctl. The relevant function seems to be "make_leaf_vdev". A first attempt, where I modifed "construct_spec" was unsuccessful.
--- zpool_vdev.c.orig 2009-06-01 06:33:27.000000000 +0200
+++ zpool_vdev.c 2010-08-07 20:07:53.010531000 +0200
@@ -471,6 +471,7 @@
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_PATH, path) == 0);
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_TYPE, type) == 0);
verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_IS_LOG, is_log) == 0);
+ verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_ASHIFT, 12) == 0);
if (strcmp(type, VDEV_TYPE_DISK) == 0)
verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_WHOLE_DISK,
(uint64_t)wholedisk) == 0);
Please forgive me that I actually hardcoded a value of "12", but this is currently only meant as a proof-of-concept anyway.
Recompile, and we now have a "zpool" command which creates zpools with an "ashift" value of 12. Let's try it out:
$ ./zpool create be02 c0d0s7
$ zdb -C be02|tail -6 | sed -e"s/^ *//"
whole_disk=0
metaslab_array=23
metaslab_shift=29
ashift=12
asize=59430928384
is_log=0
And let's have a look at the I/O pattern on this zpool:
(...)
102 7 0 0 W 163904 131072 sched
102 7 0 0 W 164160 131072 sched
102 7 0 0 W 164416 131072 sched
102 7 0 0 W 164672 131072 sched
102 7 0 0 W 164928 131072 sched
102 7 0 0 W 165184 131072 sched
102 7 0 0 W 680912 77824 sched
102 7 0 0 R 681216 65536 sched
102 7 0 0 W 681536 4096 sched
102 7 0 0 W 22032328 4096 sched
102 7 0 0 W 681544 4096 sched
102 7 0 0 W 22032336 4096 sched
102 7 0 0 W 681552 4096 sched
102 7 0 0 W 22032344 4096 sched
102 7 0 0 W 681560 4096 sched
102 7 0 0 W 22032352 4096 sched
102 7 0 0 W 44049928 4096 sched
102 7 0 0 W 681568 4096 sched
102 7 0 0 W 681576 8192 sched
(...)
All I/O's seem to be aligned to 4096 bytes, and also seem to have a minimum size of 4096 bytes.
I think I can now go ahead and buy those WD20EARS drives...
Saturday, August 7. 2010
Solaris and the new 4K-Sector-Disks (e.g. WDxxEARS) / Part 1
As disk capacity keeps growing to 2TB per disk and beyond, the traditional sector size of 512 bytes becomes more and more inefficient. Some vendors have already introduced disks with an internal sector size of 4096 bytes, e.g. Western Digital with their WDxxEARS series.
Some operating systems (e.g. Microsoft Windows) are not able to support sector sizes of more than 512 bytes, so the disk uses 4096 bytes only internally, and behaves to the outside world like a traditional disk.
For Solaris 10 and ZFS this poses a major problem, as ZFS assumes the sector size the drive reports to it is also its physical sector size. If the drive reports a sector size of 512 bytes, it will allocate data on 512 byte sector boundaries and it will read and write data in chunks of at least 512 bytes. Of course, ZFS will also write data in much bigger chunks, but only to optimize the overall I/O pattern. As a consequence, the newer 4K-sector-disks will exhibit terrible performance problems (the disk has to carry out read-modify-write cycles).
Oracle has addressed this problem in PSARC 2008/769. This seems to be a fairly generic and flexible solution, however it will only be available in the next release of Solaris. It might be already integrated into OpenSolaris, but a migration to OpenSolaris is currently not an option for me, even though I'm only using it at home.
So, is there a solution to the problem in Solaris 10? Yes, there is.
Stay tuned.
Some operating systems (e.g. Microsoft Windows) are not able to support sector sizes of more than 512 bytes, so the disk uses 4096 bytes only internally, and behaves to the outside world like a traditional disk.
For Solaris 10 and ZFS this poses a major problem, as ZFS assumes the sector size the drive reports to it is also its physical sector size. If the drive reports a sector size of 512 bytes, it will allocate data on 512 byte sector boundaries and it will read and write data in chunks of at least 512 bytes. Of course, ZFS will also write data in much bigger chunks, but only to optimize the overall I/O pattern. As a consequence, the newer 4K-sector-disks will exhibit terrible performance problems (the disk has to carry out read-modify-write cycles).
Oracle has addressed this problem in PSARC 2008/769. This seems to be a fairly generic and flexible solution, however it will only be available in the next release of Solaris. It might be already integrated into OpenSolaris, but a migration to OpenSolaris is currently not an option for me, even though I'm only using it at home.
So, is there a solution to the problem in Solaris 10? Yes, there is.
Stay tuned.
(Page 1 of 1, totaling 5 entries)