Solaris and the new 4K-Sector-Disks (e.g. WDxxEARS) / Part 2

Some days ago, I was just about to purchase two of those WD20EARS disks, when I stumbled over reports about their bad performance together with ZFS. I had a look at the I/O pattern produced writing large chunks of zeroes to one of my current disks, using "iosnoop" from the excellent DTrace Toolkit.



(...)

102  64     0     0 W 1103045891 131072      sched 

102  64     0     0 W 1103046147 131072      sched 

102  64     0     0 W 1103046403 131072      sched 

102  64     0     0 W 1103046659 131072      sched 

102  64     0     0 W 1103046915 131072      sched 

102  64     0     0 W 1103047171 131072      sched 

102  64     0     0 W 1103047427 131072      sched 

(...)

We see that Solaris nicely chunks the data up into 128K blocks, but the I/O's are to uneven block numbers, so indeed the EARS drives will have a problem with this (lots of read-modify-write cycles needed). What's worse, we have also writes smaller than 4096 bytes, probably metadata:



(...)

102  64     0     0 W 268531009   1536      sched 

102  64     0     0 W 1226324923   1536      sched 

102  64     0     0 W 1226324926   1024      sched 

102  64     0     0 W 268531012   1024      sched 

102  64     0     0 W 847259994   1024      sched 

102  64     0     0 W 1226324928   1024      sched 

(...)

Fortunately, ZFS (even as delivered in stock Solaris 10) has a way to enforce a proper alignment of sectors. It is the "ashift" parameter which is determined at pool creation time and stored as part of the pool configuration. The "zdb -C " command show this:



(...)

$ zdb -C be02|tail -6 | sed -e"s/^ *//"

metaslab_array=23

metaslab_shift=32

ashift=9

asize=750134231040

is_log=0

DTL=128

(...)

The "ashift=9" means that blocks will have an alignment of 2**9 = 512 bytes. What we would want is an alignment of 2**12 = 4096 bytes.

So, how to increase this "ashift" value? As far as I know, there is no way to increase it without recreating the pool. Because if it was possible, existing blocks would need to be re-aligned and understandably this would be quite hard to do on a live zpool!

Fair enough, so, how to specify this "ashift" value at pool creation time? Well, in stock Solaris 10 there is no way to do that. The "ashift" value is not among the configuration parameters supplied by zpool to the ZFS_IOC_POOL_CREATE ioctl. It is currently left to the kernel to derive a proper value (from the hardware) for "ashift". In normal circumstances, this is probably the correct way to do it, in our case however, we want control over this value.

So we need to compile our own version of zpool, where we add this configuration parameter. I found out that b116 is the OpenSolaris version whose code is likely to be compatible with Solaris 10 10/09 (s10u8). So I grabbed on-src.tar.bz2 and extracted ./usr/src/cmd/zpool.

Without a proper build enviroment in place, i tried to compile it on-the-fly, and discovered that I need a few more source components. The following shows the steps needed to recompile "zpool":



$ cd /tmp

$ tar -b120 -xvf on-src.tar.bz2 \

./usr/src/cmd/stat/common \

./usr/src/common/zfs \

./usr/src/cmd/zpool \

./usr/src/lib/libuutil/common/ \

./usr/src/lib/libdiskmgt/common/

$ cd ./usr/src/cmd/zpool

$ ln -s /usr/lib/libuutil.so.1 libuutil.so

$ gcc -O2 -DTEXT_DOMAIN='"en_US"' \

-I/tmp/usr/src/cmd/stat/common \

-I/tmp/usr/src/common/zfs \

-I/tmp/usr/src/lib/libuutil/common \

-I/tmp/usr/src/lib/libdiskmgt/common \

-c *.c

$ gcc -o zpool *.o -L. -lzfs -lnvpair -ldevid -lefi -ldiskmgt -luutil -lumem

I tried whether this "zpool" was functional, and it indeed worked as the original one.

We now need to extend zpool_vdev.c to append "ashift" to the list of pool properties "zpool" passes on the ZFS ioctl. The relevant function seems to be "make_leaf_vdev". A first attempt, where I modifed "construct_spec" was unsuccessful.



--- zpool_vdev.c.orig   2009-06-01 06:33:27.000000000 +0200

+++ zpool_vdev.c        2010-08-07 20:07:53.010531000 +0200

@@ -471,6 +471,7 @@

        verify(nvlist_add_string(vdev, ZPOOL_CONFIG_PATH, path) == 0);

        verify(nvlist_add_string(vdev, ZPOOL_CONFIG_TYPE, type) == 0);

        verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_IS_LOG, is_log) == 0);

+       verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_ASHIFT, 12) == 0);

        if (strcmp(type, VDEV_TYPE_DISK) == 0)

                verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_WHOLE_DISK,

                    (uint64_t)wholedisk) == 0);

Please forgive me that I actually hardcoded a value of "12", but this is currently only meant as a proof-of-concept anyway.

Recompile, and we now have a "zpool" command which creates zpools with an "ashift" value of 12. Let's try it out:



$ ./zpool create be02 c0d0s7

$ zdb -C be02|tail -6 | sed -e"s/^ *//"

whole_disk=0

metaslab_array=23

metaslab_shift=29

ashift=12

asize=59430928384

is_log=0

And let's have a look at the I/O pattern on this zpool:



(...)

102   7     0     0 W   163904 131072      sched 

102   7     0     0 W   164160 131072      sched 

102   7     0     0 W   164416 131072      sched 

102   7     0     0 W   164672 131072      sched 

102   7     0     0 W   164928 131072      sched 

102   7     0     0 W   165184 131072      sched 

102   7     0     0 W   680912  77824      sched 

102   7     0     0 R   681216  65536      sched 

102   7     0     0 W   681536   4096      sched 

102   7     0     0 W 22032328   4096      sched 

102   7     0     0 W   681544   4096      sched 

102   7     0     0 W 22032336   4096      sched 

102   7     0     0 W   681552   4096      sched 

102   7     0     0 W 22032344   4096      sched 

102   7     0     0 W   681560   4096      sched 

102   7     0     0 W 22032352   4096      sched 

102   7     0     0 W 44049928   4096      sched 

102   7     0     0 W   681568   4096      sched 

102   7     0     0 W   681576   8192      sched 

(...)

All I/O's seem to be aligned to 4096 bytes, and also seem to have a minimum size of 4096 bytes.

I think I can now go ahead and buy those WD20EARS drives...

Trackbacks

Trackback specific URI for this entry

Solarismen on Tuesday, January 11. 2011: Modified zpool program for newer Solaris versions

A few months ago I published a modified zpool program to create pools with a higher ASHIFT value, suitable for disks with 4k sectors. I expected the program compiled on s10u8 to be upward compatible. Apparently, it isn't. To compile "zpool" on s10u9, I

www.cosmicsonar.de on Sunday, February 6. 2011: PingBack

www.edugeek.net on Sunday, April 17. 2011: PingBack

Comments

Display comments as (Linear | Threaded)

Can you confirm the initial observation that the minimum IO size is now 4096B? It doesn't seem to make sense to me that changing ashift (and only that) would increase the minimum write size from 512B.

Unless somehow an ashift of k*2^n triggers some other code, which then makes ZFS believe it is dealing with a 4KB sectored drive, as per the PSARC you linked in Part 1 of these articles.

Also, does this new minimum block size correlate to a massive increase in the on-disk space used by metadata etc?

#1 taemun on 2010-09-10 17:22

Good question. Maybe the increase in the chunk size was coincidental (i created the second pool on a different, bigger device). The alignment alone, however, should improve performance significantly.

I have not noticed any noticable increase in spaced used by metadata.

#1.1 Christian Kühnke on 2010-09-19 21:53

I did test it in vmware 3x1gb virtual disks.
Followed the guide by letter and it seems to be working.
Test rig:
First - OpenSolaris svn134
tank - 3x1gb, raidz
Second - OpenSolaris 0906
tank2 - 3x1gb, raidz with ashift=12

1. On Second - created tank2 with modified ashift
2. Shut down Second, moved .vmdk and attached to First
3. Start up First, zpool import -f tank2, zpool upgrade tank2.
4. created /tank2/1.txt, 0 bytes long
5. ran isnoop in console, opened /tank2/1.txt in text editor, typed single char and saved.
isnoop showed 4096 written
6. did the same with /tank/1.txt, isnoop showed 2048 bytes written.

Result - both zraids are the same size, versions, asize, disks etc. The only difference is ashift and it seem to work as intended.

Will buy three samsungs HD204 and see how they perform with random writes.

#1.1.1 Andy Scull on 2010-10-08 21:18

I've found what gets broken with ashift change.
I have gzip-2 compression, recordsize=8k on pool with ashift=9 and on pool with ashift=12.
Compress ratio is almost 3 times(databas files).

du -s anyfile on the first pool gives "ondisk" size (different from ls), on pool with ashift=12 it gives size that is way bigger than even ls:
#ls -la file
8814
#du -k file
16
while on ashift=9 pool du -k gives 3
other big files:
88680521 | 44672 | 12092
250009 | 132 | 79

#2 unisol on 2010-11-11 13:14

main drawback is when you have compression turned on: a "recordsize" can be compressed into multiple of i/o block, limiting compression rate and causing huge (30%) overhead for setups like "3:1" compressible database files and recordsize=8KB.

So, basically it's a good idea for movies/etc storage and not so good for intel x-25m (MLC) SSD storing mysql databases - though it gives ~1% read speed increase (and uses twice more disk bw). Perhaps may be more useful for toshiba/jmicron-based ssd's.

#3 unisol on 2010-11-12 00:48

How does ashift interact with metadata layout / compatibility? I.e. if a pool is created or used with one value of ashift, will it be compatible on a system which is unpatched and uses different ashift?

#4 ivoras on 2010-11-16 18:37

It looks like this patch really messes up the free space reported by a zfs list.

With ten 2 TB Samsung HD204UI (F4EG) drives in RAIDZ2, I lose over 600 gigabytes of space using ashift 12 compared to using ashift 9!

Is there a workaround to this? Or maybe I should stay with ashift 9 - the throughput hit isn't too bad right?

#5 Max on 2011-01-03 11:58

The author does not allow comments to this entry

Solaris and the new 4K-Sector-Disks (e.g. WDxxEARS) / Part 2

Solarismen

Sunday, August 8. 2010

Solaris and the new 4K-Sector-Disks (e.g. WDxxEARS) / Part 2

Solarismen on Tuesday, January 11. 2011: Modified zpool program for newer Solaris versions

www.cosmicsonar.de on Sunday, February 6. 2011: PingBack

www.edugeek.net on Sunday, April 17. 2011: PingBack

Blog Administration

Archives

Categories