Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
ZPOOLCONCEPTS(7)   FreeBSD Miscellaneous Information Manual   ZPOOLCONCEPTS(7)

     zpoolconcepts -- overview of ZFS storage pools

   Virtual Devices (vdevs)
     A "virtual	device"	describes a single device or a collection of devices
     organized according to certain performance	and fault characteristics.
     The following virtual devices are supported:

     disk     A	block device, typically	located	under /dev.  ZFS can use indi-
	      vidual slices or partitions, though the recommended mode of op-
	      eration is to use	whole disks.  A	disk can be specified by a
	      full path, or it can be a	shorthand name (the relative portion
	      of the path under	/dev).	A whole	disk can be specified by omit-
	      ting the slice or	partition designation.	For example, sda is
	      equivalent to /dev/sda.  When given a whole disk,	ZFS automati-
	      cally labels the disk, if	necessary.

     file     A	regular	file.  The use of files	as a backing store is strongly
	      discouraged.  It is designed primarily for experimental pur-
	      poses, as	the fault tolerance of a file is only as good as the
	      file system on which it resides.	A file must be specified by a
	      full path.

     mirror   A	mirror of two or more devices.	Data is	replicated in an iden-
	      tical fashion across all components of a mirror.	A mirror with
	      N	disks of size X	can hold X bytes and can withstand N-1 devices
	      failing without losing data.

     raidz, raidz1, raidz2, raidz3
	      A	variation on RAID-5 that allows	for better distribution	of
	      parity and eliminates the	RAID-5 "write hole" (in	which data and
	      parity become inconsistent after a power loss).  Data and	parity
	      is striped across	all disks within a raidz group.

	      A	raidz group can	have single, double, or	triple parity, meaning
	      that the raidz group can sustain one, two, or three failures,
	      respectively, without losing any data.  The raidz1 vdev type
	      specifies	a single-parity	raidz group; the raidz2	vdev type
	      specifies	a double-parity	raidz group; and the raidz3 vdev type
	      specifies	a triple-parity	raidz group.  The raidz	vdev type is
	      an alias for raidz1.

	      A	raidz group with N disks of size X with	P parity disks can
	      hold approximately (N-P)*X bytes and can withstand P devices
	      failing without losing data. The minimum number of devices in a
	      raidz group is one more than the number of parity	disks.	The
	      recommended number is between 3 and 9 to help increase perfor-

     draid, draid1, draid2, draid3
	      A	variant	of raidz that provides integrated distributed hot
	      spares which allows for faster resilvering while retaining the
	      benefits of raidz.  A dRAID vdev is constructed from multiple
	      internal raidz groups, each with D data devices and P parity
	      devices. These groups are	distributed over all of	the children
	      in order to fully	utilize	the available disk performance.

	      Unlike raidz, dRAID uses a fixed stripe width (padding as	neces-
	      sary with	zeros) to allow	fully sequential resilvering.  This
	      fixed stripe width significantly effects both usable capacity
	      and IOPS.	 For example, with the default D=8 and 4kB disk
	      sectors the minimum allocation size is 32kB.  If using compres-
	      sion, this relatively large allocation size can reduce the ef-
	      fective compression ratio.  When using ZFS volumes and dRAID,
	      the default of the volblocksize property is increased to account
	      for the allocation size.	If a dRAID pool	will hold a signifi-
	      cant amount of small blocks, it is recommended to	also add a
	      mirrored special vdev to store those blocks.

	      In regards to I/O, performance is	similar	to raidz since for any
	      read all D data disks must be accessed. Delivered	random IOPS
	      can be reasonably	approximated as

	      Like raidzm a dRAID can have single-, double-, or	triple-parity.
	      The draid1, draid2, and draid3 types can be used to specify the
	      parity level.  The draid vdev type is an alias for draid1.

	      A	dRAID with N disks of size X, D	data disks per redundancy
	      group, P parity level, and S distributed hot spares can hold
	      approximately (N-S)*(D/(D+P))*X bytes and	can withstand P	de-
	      vices failing without losing data.

	      A	non-default dRAID configuration	can be specified by appending
	      one or more of the following optional arguments to the draid
	      parity	The parity level (1-3).
	      data	The number of data devices per redundancy group.  In
			general, a smaller value of D will increase IOPS,
			improve	the compression	ratio, and speed up resilver-
			ing at the expense of total usable capacity.  Defaults
			to 8, unless N-P-S is less than	8.
	      children	The expected number of children.  Useful as a cross-
			check when listing a large number of devices.  An er-
			ror is returned	when the provided number of children
	      spares	The number of distributed hot spares.  Defaults	to

     spare    A	pseudo-vdev which keeps	track of available hot spares for a
	      pool.  For more information, see the Hot Spares section.

     log      A	separate intent	log device.  If	more than one log device is
	      specified, then writes are load-balanced between devices.	 Log
	      devices can be mirrored.	However, raidz vdev types are not sup-
	      ported for the intent log.  For more information,	see the	Intent
	      Log section.

     dedup    A	device dedicated solely	for deduplication tables.  The redun-
	      dancy of this device should match	the redundancy of the other
	      normal devices in	the pool.  If more than	one dedup device is
	      specified, then allocations are load-balanced between those de-

     special  A	device dedicated solely	for allocating various kinds of	inter-
	      nal metadata, and	optionally small file blocks.  The redundancy
	      of this device should match the redundancy of the	other normal
	      devices in the pool.  If more than one special device is speci-
	      fied, then allocations are load-balanced between those devices.

	      For more information on special allocations, see the Special
	      Allocation Class section.

     cache    A	device used to cache storage pool data.	 A cache device	cannot
	      be configured as a mirror	or raidz group.	 For more information,
	      see the Cache Devices section.

     Virtual devices cannot be nested, so a mirror or raidz virtual device can
     only contain files	or disks.  Mirrors of mirrors (or other	combinations)
     are not allowed.

     A pool can	have any number	of virtual devices at the top of the configu-
     ration (known as "root vdevs").  Data is dynamically distributed across
     all top-level devices to balance data among devices.  As new virtual de-
     vices are added, ZFS automatically	places data on the newly available de-

     Virtual devices are specified one at a time on the	command	line, sepa-
     rated by whitespace.  Keywords like mirror	and raidz are used to distin-
     guish where a group ends and another begins.  For example,	the following
     creates a pool with two root vdevs, each a	mirror of two disks:
	   # zpool create mypool mirror	sda sdb	mirror sdc sdd

   Device Failure and Recovery
     ZFS supports a rich set of	mechanisms for handling	device failure and
     data corruption.  All metadata and	data is	checksummed, and ZFS automati-
     cally repairs bad data from a good	copy when corruption is	detected.

     In	order to take advantage	of these features, a pool must make use	of
     some form of redundancy, using either mirrored or raidz groups.  While
     ZFS supports running in a non-redundant configuration, where each root
     vdev is simply a disk or file, this is strongly discouraged.  A single
     case of bit corruption can	render some or all of your data	unavailable.

     A pool's health status is described by one	of three states: online,
     degraded, or faulted.  An online pool has all devices operating normally.
     A degraded	pool is	one in which one or more devices have failed, but the
     data is still available due to a redundant	configuration.	A faulted pool
     has corrupted metadata, or	one or more faulted devices, and insufficient
     replicas to continue functioning.

     The health	of the top-level vdev, such as a mirror	or raidz device, is
     potentially impacted by the state of its associated vdevs,	or component
     devices.  A top-level vdev	or component device is in one of the following

     DEGRADED  One or more top-level vdevs is in the degraded state because
	       one or more component devices are offline.  Sufficient replicas
	       exist to	continue functioning.

	       One or more component devices is	in the degraded	or faulted
	       state, but sufficient replicas exist to continue	functioning.
	       The underlying conditions are as	follows:
	       o   The number of checksum errors exceeds acceptable levels and
		   the device is degraded as an	indication that	something may
		   be wrong.  ZFS continues to use the device as necessary.
	       o   The number of I/O errors exceeds acceptable levels.	The
		   device could	not be marked as faulted because there are in-
		   sufficient replicas to continue functioning.

     FAULTED   One or more top-level vdevs is in the faulted state because one
	       or more component devices are offline.  Insufficient replicas
	       exist to	continue functioning.

	       One or more component devices is	in the faulted state, and in-
	       sufficient replicas exist to continue functioning.  The under-
	       lying conditions	are as follows:
	       o   The device could be opened, but the contents	did not	match
		   expected values.
	       o   The number of I/O errors exceeds acceptable levels and the
		   device is faulted to	prevent	further	use of the device.

     OFFLINE   The device was explicitly taken offline by the zpool offline

     ONLINE    The device is online and	functioning.

     REMOVED   The device was physically removed while the system was running.
	       Device removal detection	is hardware-dependent and may not be
	       supported on all	platforms.

     UNAVAIL   The device could	not be opened.	If a pool is imported when a
	       device was unavailable, then the	device will be identified by a
	       unique identifier instead of its	path since the path was	never
	       correct in the first place.

     Checksum errors represent events where a disk returned data that was ex-
     pected to be correct, but was not.	 In other words, these are instances
     of	silent data corruption.	 The checksum errors are reported in zpool
     status and	zpool events.  When a block is stored redundantly, a damaged
     block may be reconstructed	(e.g. from raidz parity	or a mirrored copy).
     In	this case, ZFS reports the checksum error against the disks that con-
     tained damaged data.  If a	block is unable	to be reconstructed (e.g. due
     to	3 disks	being damaged in a raidz2 group), it is	not possible to	deter-
     mine which	disks were silently corrupted.	In this	case, checksum errors
     are reported for all disks	on which the block is stored.

     If	a device is removed and	later re-attached to the system, ZFS attempts
     online the	device automatically.  Device attachment detection is hard-
     ware-dependent and	might not be supported on all platforms.

   Hot Spares
     ZFS allows	devices	to be associated with pools as "hot spares".  These
     devices are not actively used in the pool,	but when an active device
     fails, it is automatically	replaced by a hot spare.  To create a pool
     with hot spares, specify a	spare vdev with	any number of devices.	For
	   # zpool create pool mirror sda sdb spare sdc	sdd

     Spares can	be shared across multiple pools, and can be added with the
     zpool add command and removed with	the zpool remove command.  Once	a
     spare replacement is initiated, a new spare vdev is created within	the
     configuration that	will remain there until	the original device is re-
     placed.  At this point, the hot spare becomes available again if another
     device fails.

     If	a pool has a shared spare that is currently being used,	the pool can
     not be exported since other pools may use this shared spare, which	may
     lead to potential data corruption.

     Shared spares add some risk.  If the pools	are imported on	different
     hosts, and	both pools suffer a device failure at the same time, both
     could attempt to use the spare at the same	time.  This may	not be de-
     tected, resulting in data corruption.

     An	in-progress spare replacement can be cancelled by detaching the	hot
     spare.  If	the original faulted device is detached, then the hot spare
     assumes its place in the configuration, and is removed from the spare
     list of all active	pools.

     The draid vdev type provides distributed hot spares.  These hot spares
     are named after the dRAID vdev they're a part of (draid1-2-3 specifies
     spare 3 of	vdev 2,	which is a single parity dRAID)	and may	only be	used
     by	that dRAID vdev.  Otherwise, they behave the same as normal hot

     Spares cannot replace log devices.

   Intent Log
     The ZFS Intent Log	(ZIL) satisfies	POSIX requirements for synchronous
     transactions.  For	instance, databases often require their	transactions
     to	be on stable storage devices when returning from a system call.	 NFS
     and other applications can	also use fsync(2) to ensure data stability.
     By	default, the intent log	is allocated from blocks within	the main pool.
     However, it might be possible to get better performance using separate
     intent log	devices	such as	NVRAM or a dedicated disk.  For	example:
	   # zpool create pool sda sdb log sdc

     Multiple log devices can also be specified, and they can be mirrored.
     See the EXAMPLES section for an example of	mirroring multiple log de-

     Log devices can be	added, replaced, attached, detached and	removed.  In
     addition, log devices are imported	and exported as	part of	the pool that
     contains them.  Mirrored devices can be removed by	specifying the top-
     level mirror vdev.

   Cache Devices
     Devices can be added to a storage pool as "cache devices".	 These devices
     provide an	additional layer of caching between main memory	and disk.  For
     read-heavy	workloads, where the working set size is much larger than what
     can be cached in main memory, using cache devices allows much more	of
     this working set to be served from	low latency media.  Using cache	de-
     vices provides the	greatest performance improvement for random read-work-
     loads of mostly static content.

     To	create a pool with cache devices, specify a cache vdev with any	number
     of	devices.  For example:
	   # zpool create pool sda sdb cache sdc sdd

     Cache devices cannot be mirrored or part of a raidz configuration.	 If a
     read error	is encountered on a cache device, that read I/O	is reissued to
     the original storage pool device, which might be part of a	mirrored or
     raidz configuration.

     The content of the	cache devices is persistent across reboots and re-
     stored asynchronously when	importing the pool in L2ARC (persistent
     L2ARC).  This can be disabled by setting l2arc_rebuild_enabled=0.	For
     cache devices smaller than	1GB, we	do not write the metadata structures
     required for rebuilding the L2ARC in order	not to waste space.  This can
     be	changed	with l2arc_rebuild_blocks_min_l2size.  The cache device	header
     (512B) is updated even if no metadata structures are written.  Setting
     l2arc_headroom=0 will result in scanning the full-length ARC lists	for
     cacheable content to be written in	L2ARC (persistent ARC).	 If a cache
     device is added with zpool	add its	label and header will be overwritten
     and its contents are not going to be restored in L2ARC, even if the de-
     vice was previously part of the pool.  If a cache device is onlined with
     zpool online its contents will be restored	in L2ARC.  This	is useful in
     case of memory pressure where the contents	of the cache device are	not
     fully restored in L2ARC.  The user	can off- and online the	cache device
     when there	is less	memory pressure	in order to fully restore its contents
     to	L2ARC.

   Pool	checkpoint
     Before starting critical procedures that include destructive actions
     (like zfs destroy), an administrator can checkpoint the pool's state and
     in	the case of a mistake or failure, rewind the entire pool back to the
     checkpoint.  Otherwise, the checkpoint can	be discarded when the proce-
     dure has completed	successfully.

     A pool checkpoint can be thought of as a pool-wide	snapshot and should be
     used with care as it contains every part of the pool's state, from	prop-
     erties to vdev configuration.  Thus, certain operations are not allowed
     while a pool has a	checkpoint.  Specifically, vdev	removal/attach/detach,
     mirror splitting, and changing the	pool's GUID.  Adding a new vdev	is
     supported,	but in the case	of a rewind it will have to be added again.
     Finally, users of this feature should keep	in mind	that scrubs in a pool
     that has a	checkpoint do not repair checkpointed data.

     To	create a checkpoint for	a pool:
	   # zpool checkpoint pool

     To	later rewind to	its checkpointed state,	you need to first export it
     and then rewind it	during import:
	   # zpool export pool
	   # zpool import --rewind-to-checkpoint pool

     To	discard	the checkpoint from a pool:
	   # zpool checkpoint -d pool

     Dataset reservations (controlled by the reservation and refreservation
     properties) may be	unenforceable while a checkpoint exists, because the
     checkpoint	is allowed to consume the dataset's reservation.  Finally,
     data that is part of the checkpoint but has been freed in the current
     state of the pool won't be	scanned	during a scrub.

   Special Allocation Class
     Allocations in the	special	class are dedicated to specific	block types.
     By	default	this includes all metadata, the	indirect blocks	of user	data,
     and any deduplication tables.  The	class can also be provisioned to ac-
     cept small	file blocks.

     A pool must always	have at	least one normal (non-dedup/-special) vdev be-
     fore other	devices	can be assigned	to the special class.  If the special
     class becomes full, then allocations intended for it will spill back into
     the normal	class.

     Deduplication tables can be excluded from the special class by unsetting
     the zfs_ddt_data_is_special ZFS module parameter.

     Inclusion of small	file blocks in the special class is opt-in.  Each
     dataset can control the size of small file	blocks allowed in the special
     class by setting the special_small_blocks property	to nonzero.  See
     zfsprops(7) for more info on this property.

FreeBSD	13.0			 June 2, 2021			  FreeBSD 13.0


Want to link to this manual page? Use this URL:

home | help