FreeBSD Manual Pages

home | help
ZPOOLCONCEPTS(7)	Miscellaneous Information Manual      ZPOOLCONCEPTS(7)

NAME
       zpoolconcepts --	overview of ZFS	storage	pools

DESCRIPTION
   Virtual Devices (vdevs)
       A  "virtual  device"  describes	a single device	or a collection	of de-
       vices, organized	according to certain performance and fault  character-
       istics.	The following virtual devices are supported:

       disk	A block	device,	typically located under	/dev.  ZFS can use in-
		dividual  slices or partitions,	though the recommended mode of
		operation is to	use whole disks.  A disk can be	specified by a
		full path, or it can be	a shorthand name (the relative portion
		of the path under /dev).  A whole disk	can  be	 specified  by
		omitting the slice or partition	designation.  For example, sda
		is equivalent to /dev/sda.  When given a whole disk, ZFS auto-
		matically labels the disk, if necessary.

       file	A  regular  file.   The	 use  of  files	 as a backing store is
		strongly discouraged.  It is designed primarily	for experimen-
		tal purposes, as the fault tolerance of	a file is only as good
		as the file system on which it resides.	 A file	must be	speci-
		fied by	a full path.

       mirror	A mirror of two	or more	devices.  Data	is  replicated	in  an
		identical fashion across all components	of a mirror.  A	mirror
		with  N	disks of size X	can hold X bytes and can withstand N-1
		devices	failing, without losing	data.

       raidz, raidz1, raidz2, raidz3
		A distributed-parity layout, similar  to  RAID-5/6,  with  im-
		proved	distribution of	parity,	and which does not suffer from
		the RAID-5/6 "write hole", (in which data  and	parity	become
		inconsistent  after a power loss).  Data and parity is striped
		across all disks within	a raidz	group, though not  necessarily
		in a consistent	stripe width.

		A raidz	group can have single, double, or triple parity, mean-
		ing  that the raidz group can sustain one, two,	or three fail-
		ures, respectively, without losing any data.  The raidz1  vdev
		type  specifies	 a  single-parity raidz	group; the raidz2 vdev
		type specifies a double-parity raidz  group;  and  the	raidz3
		vdev  type  specifies  a triple-parity raidz group.  The raidz
		vdev type is an	alias for raidz1.

		A raidz	group with N disks of size X with P parity  disks  can
		hold  approximately  (N-P)*X bytes and can withstand P devices
		failing	without	losing data.  The minimum number of devices in
		a raidz	group is one more than the  number  of	parity	disks.
		The  recommended  number  is  between 3	and 9 to help increase
		performance.

       draid, draid1, draid2, draid3
		A variant of raidz that	provides  integrated  distributed  hot
		spares,	 allowing  for faster resilvering, while retaining the
		benefits of raidz.  A dRAID vdev is constructed	from  multiple
		internal  raidz	 groups, each with D data devices and P	parity
		devices.  These	groups are distributed over all	of  the	 chil-
		dren in	order to fully utilize the available disk performance.

		Unlike raidz, dRAID uses a fixed stripe	width (padding as nec-
		essary	with  zeros)  to  allow	 fully sequential resilvering.
		This fixed stripe width	significantly affects both usable  ca-
		pacity	and IOPS.  For example,	with the default D=8 and 4 KiB
		disk sectors the minimum allocation size is 32 KiB.  If	 using
		compression,  this relatively large allocation size can	reduce
		the effective  compression  ratio.   When  using  ZFS  volumes
		(zvols)	and dRAID, the default of the volblocksize property is
		increased to account for the allocation	size.  If a dRAID pool
		will  hold  a significant amount of small blocks, it is	recom-
		mended to also add a mirrored  special	vdev  to  store	 those
		blocks.

		In  regards to I/O, performance	is similar to raidz since, for
		any read, all D	data disks must	be accessed.  Delivered	random
		IOPS	 can	  be	  reasonably	  approximated	    as
		floor((N-S)/(D+P))*single_drive_IOPS.

		Like  raidz, a dRAID can have single-, double-,	or triple-par-
		ity.  The draid1, draid2, and draid3  types  can  be  used  to
		specify	the parity level.  The draid vdev type is an alias for
		draid1.

		A  dRAID  with	N disks	of size	X, D data disks	per redundancy
		group, P parity	level, and S distributed hot spares  can  hold
		approximately  (N-S)*(D/(D+P))*X bytes and can withstand P de-
		vices failing without losing data.

       draid[parity][:datad][:childrenc][:sparess]
		A non-default dRAID configuration can be specified by  append-
		ing  one  or  more  of the following optional arguments	to the
		draid keyword:
		parity	  The parity level (1-3).
		data	  The number of	data devices per redundancy group.  In
			  general, a smaller value of D	 will  increase	 IOPS,
			  improve  the	compression ratio, and speed up	resil-
			  vering at the	expense	of total usable	capacity.  De-
			  faults to 8, unless N-P-S is less than 8.
		children  The expected number of children.  Useful as a	cross-
			  check	when listing a large number  of	 devices.   An
			  error	 is returned when the provided number of chil-
			  dren differs.
		spares	  The number of	distributed hot	spares.	  Defaults  to
			  zero.

       spare	A  pseudo-vdev which keeps track of available hot spares for a
		pool.  For more	information, see the "Hot Spares" section.

       log	A separate intent log device.  If more than one	log device  is
		specified, then	writes are load-balanced between devices.  Log
		devices	 can  be  mirrored.  However, raidz vdev types are not
		supported for the intent log.  For more	information,  see  the
		"Intent	Log" section.

       dedup	A  device  solely dedicated for	deduplication tables.  The re-
		dundancy of this device	should match  the  redundancy  of  the
		other  normal devices in the pool.  If more than one dedup de-
		vice is	specified, then	allocations are	load-balanced  between
		those devices.

       special	A  device dedicated solely for allocating various kinds	of in-
		ternal metadata, and optionally	small file blocks.  The	redun-
		dancy of this device should match the redundancy of the	 other
		normal	devices	 in the	pool.  If more than one	special	device
		is specified, then allocations are load-balanced between those
		devices.

		For more information on	special	allocations, see the  "Special
		Allocation Class" section.

       cache	A device used to cache storage pool data.  A cache device can-
		not be configured as a mirror or raidz group.  For more	infor-
		mation,	see the	"Cache Devices"	section.

       Virtual devices cannot be nested	arbitrarily.  A	mirror,	raidz or draid
       virtual	device	can  only  be created with files or disks.  Mirrors of
       mirrors or other	such combinations are not allowed.

       A pool can have any number of virtual devices at	the top	of the config-
       uration (known as  "root	 vdevs").   Data  is  dynamically  distributed
       across  all  top-level  devices	to balance data	among devices.	As new
       virtual devices are added, ZFS automatically places data	on  the	 newly
       available devices.

       Virtual	devices	are specified one at a time on the command line, sepa-
       rated by	whitespace.  Keywords like mirror and raidz are	used  to  dis-
       tinguish	 where a group ends and	another	begins.	 For example, the fol-
       lowing creates a	pool with two root vdevs, each a mirror	of two disks:
	     # zpool create mypool mirror sda sdb mirror sdc sdd

   Device Failure and Recovery
       ZFS supports a rich set of mechanisms for handling device  failure  and
       data  corruption.   All metadata	and data is checksummed, and ZFS auto-
       matically repairs bad data from a good copy,  when  corruption  is  de-
       tected.

       In  order  to take advantage of these features, a pool must make	use of
       some form of redundancy,	using either mirrored or raidz groups.	 While
       ZFS  supports running in	a non-redundant	configuration, where each root
       vdev is simply a	disk or	file, this is strongly discouraged.  A	single
       case of bit corruption can render some or all of	your data unavailable.

       A  pool's  health  status  is described by one of three states: online,
       degraded, or faulted.  An online	pool has all  devices  operating  nor-
       mally.	A  degraded  pool  is  one  in	which one or more devices have
       failed, but the data is still available due to a	 redundant  configura-
       tion.   A  faulted  pool	has corrupted metadata,	or one or more faulted
       devices,	and insufficient replicas to continue functioning.

       The health of the top-level vdev, such as a mirror or raidz device,  is
       potentially  impacted by	the state of its associated vdevs or component
       devices.	 A top-level vdev or component device is in one	of the follow-
       ing states:

       DEGRADED	 One or	more top-level vdevs is	in the degraded	state  because
		 one or	more component devices are offline.  Sufficient	repli-
		 cas exist to continue functioning.

		 One  or  more component devices is in the degraded or faulted
		 state,	but sufficient replicas	exist to continue functioning.
		 The underlying	conditions are as follows:
		 •   The number	of checksum errors or slow  I/Os  exceeds  ac-
		     ceptable  levels and the device is	degraded as an indica-
		     tion that something may be	wrong.	ZFS continues  to  use
		     the device	as necessary.
		 •   The  number of I/O	errors exceeds acceptable levels.  The
		     device could not be marked	as faulted because  there  are
		     insufficient replicas to continue functioning.

       FAULTED	 One  or  more top-level vdevs is in the faulted state because
		 one or	more  component	 devices  are  offline.	  Insufficient
		 replicas exist	to continue functioning.

		 One  or  more	component devices is in	the faulted state, and
		 insufficient replicas exist to	continue functioning.  The un-
		 derlying conditions are as follows:
		 •   The device	could be opened,  but  the  contents  did  not
		     match expected values.
		 •   The  number  of  I/O errors exceeds acceptable levels and
		     the device	is faulted to prevent further use of  the  de-
		     vice.

       OFFLINE	 The  device was explicitly taken offline by the zpool offline
		 command.

       ONLINE	 The device is online and functioning.

       REMOVED	 The device was	physically removed while the system  was  run-
		 ning.	Device removal detection is hardware-dependent and may
		 not be	supported on all platforms.

       UNAVAIL	 The device could not be opened.  If a pool is imported	when a
		 device	was unavailable, then the device will be identified by
		 a  unique  identifier	instead	of its path since the path was
		 never correct in the first place.

       Checksum	errors represent events	where a	disk returned  data  that  was
       expected	 to  be	 correct,  but was not.	 In other words, these are in-
       stances of silent data corruption.  The checksum	errors are reported in
       zpool status and	zpool events.  When a block is stored  redundantly,  a
       damaged	block  may  be reconstructed (e.g. from	raidz parity or	a mir-
       rored copy).  In	this case, ZFS reports the checksum error against  the
       disks  that  contained damaged data.  If	a block	is unable to be	recon-
       structed	(e.g. due to 3 disks being damaged in a	raidz2 group),	it  is
       not possible to determine which disks were silently corrupted.  In this
       case,  checksum errors are reported for all disks on which the block is
       stored.

       If a device is removed and later	re-attached to	the  system,  ZFS  at-
       tempts to bring the device online automatically.	 Device	attachment de-
       tection	is  hardware-dependent and might not be	supported on all plat-
       forms.

   Hot Spares
       ZFS allows devices to be	associated with	pools as "hot spares".	 These
       devices	are not	actively used in the pool.  But, when an active	device
       fails, it is automatically replaced by a	hot spare.  To create  a  pool
       with  hot spares, specify a spare vdev with any number of devices.  For
       example,
	     # zpool create pool mirror	sda sdb	spare sdc sdd

       Spares can be shared across multiple pools, and can be added  with  the
       zpool  add  command  and	removed	with the zpool remove command.	Once a
       spare replacement is initiated, a new spare vdev	is created within  the
       configuration  that  will remain	there until the	original device	is re-
       placed.	At this	point, the hot spare becomes available again,  if  an-
       other device fails.

       If  a  pool  has	 a shared spare	that is	currently being	used, the pool
       cannot be exported, since other pools may use this shared spare,	 which
       may lead	to potential data corruption.

       Shared  spares  add  some risk.	If the pools are imported on different
       hosts, and both pools suffer a device failure at	the  same  time,  both
       could  attempt  to use the spare	at the same time.  This	may not	be de-
       tected, resulting in data corruption.

       An in-progress spare replacement	can be canceled	by detaching  the  hot
       spare.	If the original	faulted	device is detached, then the hot spare
       assumes its place in the	configuration, and is removed from  the	 spare
       list of all active pools.

       The  draid vdev type provides distributed hot spares.  These hot	spares
       are named after the dRAID vdev they're a	part of	(draid1-2-3  specifies
       spare 3 of vdev 2, which	is a single parity dRAID) and may only be used
       by  that	 dRAID	vdev.	Otherwise,  they behave	the same as normal hot
       spares.

       Spares cannot replace log devices.

   Intent Log
       The ZFS Intent Log (ZIL)	satisfies POSIX	requirements  for  synchronous
       transactions.  For instance, databases often require their transactions
       to be on	stable storage devices when returning from a system call.  NFS
       and  other applications can also	use fsync(2) to	ensure data stability.
       By default, the intent log is allocated from  blocks  within  the  main
       pool.   However,	 it  might be possible to get better performance using
       separate	intent log devices such	as NVRAM or a dedicated	disk.  For ex-
       ample:
	     # zpool create pool sda sdb log sdc

       Multiple	log devices can	also be	specified, and they can	 be  mirrored.
       See the "EXAMPLES" section for an example of mirroring multiple log de-
       vices.

       Log  devices  can  be added, replaced, attached,	detached, and removed.
       In addition, log	devices	are imported and exported as part of the  pool
       that  contains them.  Mirrored devices can be removed by	specifying the
       top-level mirror	vdev.

   Cache Devices
       Devices can be added to a storage pool as "cache	devices".   These  de-
       vices  provide  an  additional layer of caching between main memory and
       disk.  For read-heavy workloads,	where the working  set	size  is  much
       larger  than what can be	cached in main memory, using cache devices al-
       lows much more of this working set to be	served from low	latency	media.
       Using cache devices provides the	greatest performance  improvement  for
       random read-workloads of	mostly static content.

       To create a pool	with cache devices, specify a cache vdev with any num-
       ber of devices.	For example:
	     # zpool create pool sda sdb cache sdc sdd

       Cache  devices cannot be	mirrored or part of a raidz configuration.  If
       a read error is encountered on a	cache device, that read	I/O  is	 reis-
       sued to the original storage pool device, which might be	part of	a mir-
       rored or	raidz configuration.

       The  content  of	the cache devices is persistent	across reboots and re-
       stored asynchronously when importing  the  pool	in  L2ARC  (persistent
       L2ARC).	 This can be disabled by setting l2arc_rebuild_enabled=0.  For
       cache devices smaller than 1 GiB,  ZFS  does  not  write	 the  metadata
       structures  required for	rebuilding the L2ARC, to conserve space.  This
       can be changed with l2arc_rebuild_blocks_min_l2size.  The cache	device
       header  (512  B)	is updated even	if no metadata structures are written.
       Setting l2arc_headroom=0	will result in scanning	 the  full-length  ARC
       lists  for  cacheable  content to be written in L2ARC (persistent ARC).
       If a cache device is added with zpool add, its label and	header will be
       overwritten and its contents will not be	restored in L2ARC, even	if the
       device was previously part of the pool.	If a cache device  is  onlined
       with  zpool  online,  its  contents will	be restored in L2ARC.  This is
       useful in case of memory	pressure, where	the contents of	the cache  de-
       vice are	not fully restored in L2ARC.  The user can off-	and online the
       cache  device  when there is less memory	pressure, to fully restore its
       contents	to L2ARC.

   Pool	checkpoint
       Before starting critical	procedures that	 include  destructive  actions
       (like  zfs  destroy),  an administrator can checkpoint the pool's state
       and, in the case	of a mistake or	failure, rewind	the entire  pool  back
       to the checkpoint.  Otherwise, the checkpoint can be discarded when the
       procedure has completed successfully.

       A  pool checkpoint can be thought of as a pool-wide snapshot and	should
       be used with care as it contains	every part of the pool's  state,  from
       properties to vdev configuration.  Thus,	certain	operations are not al-
       lowed  while  a	pool has a checkpoint.	Specifically, vdev removal/at-
       tach/detach, mirror splitting, and changing the pool's GUID.  Adding  a
       new  vdev  is supported,	but in the case	of a rewind it will have to be
       added again.  Finally, users of this feature should keep	in  mind  that
       scrubs in a pool	that has a checkpoint do not repair checkpointed data.

       To create a checkpoint for a pool:
	     # zpool checkpoint	pool

       To  later rewind	to its checkpointed state, you need to first export it
       and then	rewind it during import:
	     # zpool export pool
	     # zpool import --rewind-to-checkpoint pool

       To discard the checkpoint from a	pool:
	     # zpool checkpoint	-d pool

       Dataset reservations (controlled	by the reservation and	refreservation
       properties) may be unenforceable	while a	checkpoint exists, because the
       checkpoint  is  allowed to consume the dataset's	reservation.  Finally,
       data that is part of the	checkpoint but has been	freed in  the  current
       state of	the pool won't be scanned during a scrub.

   Special Allocation Class
       Allocations in the special class	are dedicated to specific block	types.
       By  default,  this  includes  all metadata, the indirect	blocks of user
       data, and any deduplication tables.  The	class can also be  provisioned
       to accept small file blocks.

       A  pool	must always have at least one normal (non-dedup/-special) vdev
       before other devices can	be assigned to	the  special  class.   If  the
       special class becomes full, then	allocations intended for it will spill
       back into the normal class.

       Deduplication  tables  can be excluded from the special class by	unset-
       ting the	zfs_ddt_data_is_special	ZFS module parameter.

       Inclusion of small file blocks in the special class  is	opt-in.	  Each
       dataset	can  control the size of small file blocks allowed in the spe-
       cial class by setting the  special_small_blocks	property  to  nonzero.
       See zfsprops(7) for more	info on	this property.

FreeBSD	ports 15.0		 April 7, 2023		      ZPOOLCONCEPTS(7)
NAME | DESCRIPTION
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=zpoolconcepts&sektion=7&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages