20min Handson ZFS

ZFS is often called the last word in file systems.
It is a new approach to deal with large pools of disks originally invented by Sun.
It was later then ported to FreeBSD, MacOS (only 10.5) and Linux.

This text should show some of the basic feature of ZFS and demonstrate them handson by example.

Prerequisites

-> FreeBSD
-> Solaris
-> MacOS (only Userland)

In our example we use

SunOS openindiana 5.11 oi_151a5 i86pc i386 i86pc Solaris.

as an environment.

But most commands also work on the other systems.

Since we do all the work within a VM, our commands have the pattern:

Input VM:

command

Output VM:

result

Pool Creation

The first information we need is the number of disk, present in our environment.
There are several ways to get a basic disk listing. Under (Open-)Solaris this can be done with:

Input VM:

format < /dev/null

Output VM:

AVAILABLE DISK SELECTIONS:
0. c4t0d0 
/pci@0,0/pci8086,2829@d/disk@0,0
1. c5t0d0 
/pci@0,0/pci1000,8000@16/sd@0,0
2. c5t1d0 
/pci@0,0/pci1000,8000@16/sd@1,0
3. c5t2d0 
/pci@0,0/pci1000,8000@16/sd@2,0
4. c5t3d0 
/pci@0,0/pci1000,8000@16/sd@3,0
5. c5t4d0 
/pci@0,0/pci1000,8000@16/sd@4,0
6. c5t5d0 
/pci@0,0/pci1000,8000@16/sd@5,0
7. c5t6d0 
/pci@0,0/pci1000,8000@16/sd@6,0
8. c5t7d0 
/pci@0,0/pci1000,8000@16/sd@7,0

about Pools

With ZFS it is possible to create different kinds of pools on a specific number of disk.
You can also create several pools within one system.

The following Pools are possible and most commonly used:

Type Info Performance Capacity Redundancy Command
JBOD Just a bunch of disks. In theory it is possible to create on pool for each disk in the system, although this is not quite commonly used. of each disk of each disk zpool create disk1 pool1
zpool create disk1 pool2
Stripe This is equivalent to RAID0, the data is distributed over all disks in the pool. If one disks fails, all the data is lost. But you can also stripe several Pools (e.g. two raidz pools) to have better redundancy. very high N Disks no zpool create disk1 disk2 pool1
Mirror This is equivalent to RAID1, the data is written to both disks in the Pool. Restoring a pool (resilvering) is less efficient, since the data needs to be copied from the remaining disk. normal N-1 Disks +1 zpool create mirror disk1 disk2 pool1
Raidz This is equivalent to RAID5. One disk contains the parity data. Restoring a pool (resilvering) is less efficient, since the data needs to be copied from the remaining disks. high N-1 Disks +1 zpool create raidz disk1 disk2 disk3 pool1
Raidz2 This is equivalent to RAID6. Two disks containing the parity data. Restoring a pool (resilvering) is less efficient, since the data needs to be copied from the remaining disk with parity data. high N-2 Disks +2 zpool create raidz2 disk1 disk2 disk3 disk4 pool1
Raidz3 There is no real equivalent existing for that one. You have basically three disks with parity data. high N-3 Disks +3 zpool create raidz3 disk1 disk2 disk3 disk4 disk5 pool1

You can also add hot-spares for a better fallback behaviour, SSDs for caching reads (cache) and writes (logs).
I also created a benchmark with various combinations.

create a basic Pool (raidz)

Input VM:

zpool create tank raidz c5t0d0 c5t1d0 c5t2d0
...
zpool status

Output VM:

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0

errors: No known data errors

(Raid5)

You already can access the newly created pool:

Input VM:

ls -al /tank

Output VM:

 
...
total 4
drwxr-xr-x  2 root root  2 2012-10-23 22:02 .
drwxr-xr-x 25 root root 28 2012-10-23 22:02 ..

create a basic Pool (raidz) with one spare drive

Input VM:

zpool create tank raidz1 c5t0d0 c5t1d0 c5t2d0 spare c5t3d0
...
zpool status

Output VM:

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
        spares
          c5t3d0    AVAIL   

errors: No known data errors

List the availibe Layout

Input VM:

zpool list

Output VM:

NAME     SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
tank    1,46G   185K  1,46G         -     0%  1.00x  ONLINE  -

*The 1,5G does not reflect the real availible space. If you copy a 1G File to the Pool it will use 1,5G (1G + 512M Parity).

create a stripped pool

Input VM:

zpool create tank raidz1 c5t0d0 c5t1d0 c5t2d0 raidz1 c5t4d0 c5t5d0 c5t6d0

Output VM:

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            c5t4d0  ONLINE       0     0     0
            c5t5d0  ONLINE       0     0     0
            c5t6d0  ONLINE       0     0     0

errors: No known data errors

(Raid50 = Raid5 + Raid5)

deal with disk failures

Input VM:

zpool create tank raidz1 c5t0d0 c5t1d0 c5t2d0 spare c5t3d0

Failure Handling

Input Host:

echo /dev/random >> 1.vdi

Wait for it…
or Input VM:

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 66K in 0h0m with 0 errors on Tue Oct 23 22:14:19 2012
config:

        NAME          STATE     READ WRITE CKSUM
        tank          DEGRADED     0     0     0
          raidz1-0    DEGRADED     0     0     0
            spare-0   DEGRADED     0     0     0
              c5t0d0  DEGRADED     0     0    64  too many errors
              c5t3d0  ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     0
        spares
          c5t3d0      INUSE     currently in use

errors: No known data errors

Input VM:

zpool clear tank
...
zpool detach tank c5t0d0
zpool replace tank c5t0d0 c5t7d0

Create File systems

Input VM:

zfs create tank/home
zfs create tank/home/user1
...
chown -R user:staff /tank/home/user1
...
zfs get all tank/home/user1
...
zfs set sharesmb=on tank/home/user1
...
zfs set quota=500M tank/home/user1

Copy File from MacOS into SMB Share.

Snapshot

Input VM:

zfs snapshot tank/home/user1@basic
...
zfs list
...
zfs list -t snapshot

Output VM:

NAME                              USED  AVAIL  REFER  MOUNTPOINT
rpool1/ROOT/openindiana@install  84,0M      -  1,55G  -
tank/home/user1@basic                0      -  42,6K  -

Input VM:

zfs snapshot -r tank/home@backup
...
zfs list -t snapshot

Output VM:

NAME                              USED  AVAIL  REFER  MOUNTPOINT
rpool1/ROOT/openindiana@install  84,0M      -  1,55G  -
tank/home@backup                     0      -  41,3K  -
tank/home/user1@basic                0      -  42,6K  -
tank/home/user1@backup               0      -  42,6K  -

Input VM:

zfs clone tank/home/user1@basic tank/home/user2

Output VM:

tank/home/user2          1,33K   894M  70,3M  /tank/home/user2

Restoring Snapshots

Delete ZIP File in SMB-Share.

Input VM:

ls -al tank/home/user1
...
zfs rollback tank/home/user1@backup

Output VM:

ls -al tank/home/user1

Resizing a Pool

Input VM:

zpool list
...
zpool replace tank c5t0d0 c5t4d0
zpool replace tank c5t1d0 c5t5d0
zpool replace tank c5t2d0 c5t6d0
...
zpool scrub tank
...
zpool list

Output VM:

NAME     SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
tank    1,46G   381K  1,46G     1,50G     0%  1.00x  ONLINE  -

Input VM:

zpool set autoexpand=on tank

Using ZFS for Backups

Bash-Script

rsync -avrz --progress --delete /Users/user root@nas.local::user-backup/
backupdate=$(date "+%Y-%m-%d")
ssh root@nas.local zfs snapshot tank/backup@$backupdate

Creating encrypted Volumes on ZFS Pools

One of the most anticipated Features of ZFS is transparent Encryption. But since Oracle decided to do not make updates from Solaris 11 availible as Open Source, the Feature of on-Disk Encryption is not availible on Illumos (e.g. Open-Source) based Distributions. But there are some ways to create transparent encrypted ZPools with current avaiblibe ZFS Version using pktool, lofiadm, zfs and zpool.

lofiadm- administer files available as block devices through lofi

http://www.unix.com/man-page/opensolaris/1m/lofiadm

That means, you can use normal Files as Block Devices while adding some Features to them (e.g. compression and also encryption). The Goal of this Post is to create a transparent encrypted Volume, that uses a Key-File for deryption (that might be stored on an usb stick or will be uploaded via a Browser once to mount the device). For an easy Start, i created a Vagrant File based on OmniOS here.

If you do not know Vagrant, here is an easy Start for you:

  1. Get yourself a VirtualBox Version matching your Platform: https://www.virtualbox.org/wiki/Downloads
  2. Get yourself a Vagrant Version matching your Platform: http://www.vagrantup.com/downloads.html
  3. Move to the Folder where you have saved your Vagrantfile
  4. Start your Box (will need some time, since the OmniOS Box will needs to be downloaded first)
    vagrant up
  5. After your box is finished, you can ssh into it with
    vagrant ssh
  6. Have a look around:
    zpool status

    You will find exactly one (Root-) Pool configured in that system:

      pool: rpool
     state: ONLINE
      scan: none requested
    config:
    
            NAME        STATE     READ WRITE CKSUM
            rpool       ONLINE       0     0     0
              c1d0s0    ONLINE       0     0     0

Next we want to create our encrypted Device, for that we need some “files” for using them with lofiadm. One very handy feature of ZFS is the possibility to also create Volumes (ZVols) in your ZPool.
First we need to finde out how big our Pool is:

zpool list

will give us an overview of the configured Volumes and File Systems:

NAME           SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rpool         39,8G  2,28G  37,5G         -     5%  1.00x  ONLINE  -
vagrant-priv      -      -      -         -      -      -  FAULTED  -

So we have roughly around 37G free space. For this Test we would like to create an encrypted Volume with 2G of Space.
Creating a ZVol is as easy as creating a normal ZFS Folder:

sudo zfs create -V 2G rpool/export/home/vagrant-priv

You can now see the new ZVol with the reserved size of 2G:

zfs list
NAME                             USED  AVAIL  REFER  MOUNTPOINT
rpool                           5,34G  33,8G  35,5K  /rpool
rpool/ROOT                      1,74G  33,8G    31K  legacy
rpool/ROOT/omnios               1,74G  33,8G  1,46G  /
rpool/ROOT/omniosvar              31K  33,8G    31K  legacy
rpool/dump                       512M  33,8G   512M  -
rpool/export                    2,06G  33,8G    32K  /export
rpool/export/home               2,06G  33,8G    41K  /export/home
rpool/export/home/vagrant-priv  2,06G  35,9G    16K  -
rpool/swap                      1,03G  34,8G  34,4M  -

Next we need a Key for en-/de-crypting the Device. That can be done with the pktool:

> pktool genkey keystore=file outkey=lofi.key keytype=aes keylen=256 print=y
< Key Value ="93af08fcfa9fc89724b5ee33dc244f219ac6ce75d73df2cb1442dc4cd12ad1c4"

We can now use this key with lofiadm to create an encrypted Device:

> sudo lofiadm -a /dev/zvol/rdsk/rpool/export/home/vagrant-priv -c aes-256-cbc -k lofi.key
< /dev/lofi/1

lofi.key is the File that contains the Key for the Encryption. You can keep it in that folder or move it to another device. If you want to reactivate the device (we will see later how to do this), you will need that key file again.
/dev/lofi/1 is our encrypted Device. We can use that for creating a new (encrypted) ZPool:

sudo zpool create vagrant-priv /dev/lofi/1

You know can use that Pool as a normal ZPool (including Quotas/Compression, etc.)

> zpool status

< pool: vagrant-priv
 state: ONLINE
  scan: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        vagrant-priv   ONLINE       0     0     0
          /dev/lofi/1  ONLINE       0     0     0

errors: No known data errors

You should change the Folder permissions of that mount-point:

sudo chown -R vagrant:other vagrant-priv

Creating some Test-Files:

cd /vagrant-priv/
mkfile 100m file2
> du -sh *
< 100M   file2

So what happens if we want to deactivate that Pool?

  1. Leave the Mount-Point:
    cd /
  2. Deactivate the Pool:
    sudo zpool export vagrant-priv
  3. Deactivate the Lofi Device:
    sudo lofiadm -d /dev/lofi/1

That’s all. Now let’s reboot the system and let us see how we can re-attach that Pool again.

Leave the Vagrant Box:

> exit
< logout
< Connection to 127.0.0.1 closed.

Restart the Box:

> vagrant halt
< [default] Attempting graceful shutdown of VM...
> vagrant up
...
< Waiting for machine to boot. This may take a few minutes...
< [default] VM already provisioned. Run `vagrant provision` or use `--provision` to force it

Re-Enter the Box:

vagrant ssh

So where is our Pool?

zpool status

Only gives us the default root-Pool.
First we need to re-create our Lofi-Device:

> sudo lofiadm -a /dev/zvol/rdsk/rpool/export/home/vagrant-priv -c aes-256-cbc -k lofi.key
< /dev/lofi/1

Instead of creating a new ZPool (that would delete our previous created Data), we need to import that ZPool. That’s can be done in two steps, using zpool. First we need to find our Pool:

sudo zpool import -d /dev/lofi/

That lists all ZPools, that are on Devices in that Directory. We need to find the id of “our” Pool (that needs to be done once, since that id stays the same, as long as the Pool exitsts).

...   
   pool: vagrant-priv
     id: 1140053612317909839
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        vagrant-priv   ONLINE
          /dev/lofi/1  ONLINE
...

We can now import that ZPool using the id 1140053612317909839:

sudo zpool import -d  /dev/lofi/ 1140053612317909839

After that we can again access our Pool as usual:

> cd /vagrant-priv/
> du -sh *
< 100M   file2