Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
fi_cq(3)		       Libfabric v1.15.1		      fi_cq(3)

NAME
       fi_cq - Completion queue	operations

       fi_cq_open / fi_close
	      Open/close a completion queue

       fi_control
	      Control CQ operation or attributes.

       fi_cq_read / fi_cq_readfrom / fi_cq_readerr
	      Read a completion	from a completion queue

       fi_cq_sread / fi_cq_sreadfrom
	      A	 synchronous (blocking)	read that waits	until a	specified con-
	      dition has been met before reading a completion from  a  comple-
	      tion queue.

       fi_cq_signal
	      Unblock any thread waiting in fi_cq_sread	or fi_cq_sreadfrom.

       fi_cq_strerror
	      Converts	provider  specific  error information into a printable
	      string

SYNOPSIS
	      #include <rdma/fi_domain.h>

	      int fi_cq_open(struct fid_domain *domain,	struct fi_cq_attr *attr,
		  struct fid_cq	**cq, void *context);

	      int fi_close(struct fid *cq);

	      int fi_control(struct fid	*cq, int command, void *arg);

	      ssize_t fi_cq_read(struct	fid_cq *cq, void *buf, size_t count);

	      ssize_t fi_cq_readfrom(struct fid_cq *cq,	void *buf, size_t count,
		  fi_addr_t *src_addr);

	      ssize_t fi_cq_readerr(struct fid_cq *cq, struct fi_cq_err_entry *buf,
		  uint64_t flags);

	      ssize_t fi_cq_sread(struct fid_cq	*cq, void *buf,	size_t count,
		  const	void *cond, int	timeout);

	      ssize_t fi_cq_sreadfrom(struct fid_cq *cq, void *buf, size_t count,
		  fi_addr_t *src_addr, const void *cond, int timeout);

	      int fi_cq_signal(struct fid_cq *cq);

	      const char * fi_cq_strerror(struct fid_cq	*cq, int prov_errno,
		    const void *err_data, char *buf, size_t len);

ARGUMENTS
       domain Open resource domain

       cq     Completion queue

       attr   Completion queue attributes

       context
	      User specified context associated	with the completion queue.

       buf    For read calls, the data buffer to write completions into.   For
	      write  calls,  a completion to insert into the completion	queue.
	      For fi_cq_strerror, an optional buffer that  receives  printable
	      error information.

       count  Number of	CQ entries.

       len    Length of	data buffer

       src_addr
	      Source address of	a completed receive operation

       flags  Additional flags to apply	to the operation

       command
	      Command of control operation to perform on CQ.

       arg    Optional control argument

       cond   Condition	that must be met before	a completion is	generated

       timeout
	      Time  in milliseconds to wait.  A	negative value indicates infi-
	      nite timeout.

       prov_errno
	      Provider specific	error value

       err_data
	      Provider specific	error data related to a	completion

DESCRIPTION
       Completion queues are used to report events associated with data	trans-
       fers.  They are associated with message sends and receives, RMA,	 atom-
       ic, tagged messages, and	triggered events.  Reported events are usually
       associated with a fabric	endpoint, but may also refer to	memory regions
       used as the target of an	RMA or atomic operation.

   fi_cq_open
       fi_cq_open allocates a new completion queue.  Unlike event queues, com-
       pletion	queues	are  associated	 with a	resource domain	and may	be of-
       floaded entirely	in provider hardware.

       The properties and behavior of a	completion queue are defined by	struct
       fi_cq_attr.

	      struct fi_cq_attr	{
		  size_t	       size;	  /* # entries for CQ */
		  uint64_t	       flags;	  /* operation flags */
		  enum fi_cq_format    format;	  /* completion	format */
		  enum fi_wait_obj     wait_obj;  /* requested wait object */
		  int		       signaling_vector; /* interrupt affinity */
		  enum fi_cq_wait_cond wait_cond; /* wait condition format */
		  struct fid_wait     *wait_set;  /* optional wait set */
	      };

       size   Specifies	the minimum size of a completion queue.	 A value of  0
	      indicates	that the provider may choose a default value.

       flags  Flags that control the configuration of the CQ.

       - FI_AFFINITY
	      Indicates	that the signaling_vector field	(see below) is valid.

       format Completion  queues allow the application to select the amount of
	      detail that it must store	and report.  The format	attribute  al-
	      lows  the	 application  to select	one of several completion for-
	      mats, indicating the structure of	the data that  the  completion
	      queue should return when read.  Supported	formats	and the	struc-
	      tures  that correspond to	each are listed	below.	The meaning of
	      the CQ entry fields are defined in the  Completion  Fields  sec-
	      tion.

       - FI_CQ_FORMAT_UNSPEC
	      If  an  unspecified  format is requested,	then the CQ will use a
	      provider selected	default	format.

       - FI_CQ_FORMAT_CONTEXT
	      Provides only user specified context that	 was  associated  with
	      the completion.

	      struct fi_cq_entry {
		  void	   *op_context;	/* operation context */
	      };
	      	.RS 2

       FI_CQ_FORMAT_MSG
	      Provides	minimal	data for processing completions, with expanded
	      support for reporting information	about received messages.

	      struct fi_cq_msg_entry {
		  void	   *op_context;	/* operation context */
		  uint64_t flags;	/* completion flags */
		  size_t   len;		/* size	of received data */
	      };
	      	.RS 2

       FI_CQ_FORMAT_DATA
	      Provides data associated with a  completion.   Includes  support
	      for  received  message length, remote CQ data, and multi-receive
	      buffers.

	      struct fi_cq_data_entry {
		  void	   *op_context;	/* operation context */
		  uint64_t flags;	/* completion flags */
		  size_t   len;		/* size	of received data */
		  void	   *buf;	/* receive data	buffer */
		  uint64_t data;	/* completion data */
	      };
	      	.RS 2

       FI_CQ_FORMAT_TAGGED
	      Expands completion data to include support for the  tagged  mes-
	      sage interfaces.

	      struct fi_cq_tagged_entry	{
		  void	   *op_context;	/* operation context */
		  uint64_t flags;	/* completion flags */
		  size_t   len;		/* size	of received data */
		  void	   *buf;	/* receive data	buffer */
		  uint64_t data;	/* completion data */
		  uint64_t tag;		/* received tag	*/
	      };

       wait_obj
	      CQ's  may	 be  associated	with a specific	wait object.  Wait ob-
	      jects allow applications to block	until the wait object is  sig-
	      naled,  indicating  that	a  completion is available to be read.
	      Users may	use fi_control to retrieve the underlying wait	object
	      associated  with a CQ, in	order to use it	in other system	calls.
	      The following values may be used to specify the type of wait ob-
	      ject  associated	with  a	 CQ:   FI_WAIT_NONE,   FI_WAIT_UNSPEC,
	      FI_WAIT_SET,  FI_WAIT_FD,	FI_WAIT_MUTEX_COND, and	FI_WAIT_YIELD.
	      The default is FI_WAIT_NONE.

       - FI_WAIT_NONE
	      Used to indicate that the	user will not block (wait) for comple-
	      tions on the CQ.	When FI_WAIT_NONE is specified,	 the  applica-
	      tion may not call	fi_cq_sread or fi_cq_sreadfrom.

       - FI_WAIT_UNSPEC
	      Specifies	 that  the  user will only wait	on the CQ using	fabric
	      interface	calls, such as	fi_cq_sread  or	 fi_cq_sreadfrom.   In
	      this case, the underlying	provider may select the	most appropri-
	      ate  or highest performing wait object available,	including cus-
	      tom wait mechanisms.  Applications  that	select	FI_WAIT_UNSPEC
	      are not guaranteed to retrieve the underlying wait object.

       - FI_WAIT_SET
	      Indicates	that the completion queue should use a wait set	object
	      to  wait for completions.	 If specified, the wait_set field must
	      reference	an existing wait set object.

       - FI_WAIT_FD
	      Indicates	that the CQ should use a file descriptor as  its  wait
	      mechanism.   A file descriptor wait object must be usable	in se-
	      lect, poll, and epoll routines.  However,	a provider may	signal
	      an  FD  wait object by marking it	as readable, writable, or with
	      an error.

       - FI_WAIT_MUTEX_COND
	      Specifies	that the CQ should use a pthread mutex and cond	 vari-
	      able as a	wait object.

       - FI_WAIT_YIELD
	      Indicates	 that  the  CQ will wait without a wait	object but in-
	      stead yield on every wait.   Allows  usage  of  fi_cq_sread  and
	      fi_cq_sreadfrom through a	spin.

       signaling_vector
	      If  the  FI_AFFINITY flag	is set,	this indicates the logical cpu
	      number (0..max cpu - 1) that interrupts associated with  the  CQ
	      should  target.	This  field should be treated as a hint	to the
	      provider and may be ignored if the provider does not support in-
	      terrupt affinity.

       wait_cond
	      By default, when a completion is inserted	into a	CQ  that  sup-
	      ports  blocking  reads (fi_cq_sread/fi_cq_sreadfrom), the	corre-
	      sponding wait object is signaled.	 Users may specify a condition
	      that must	first be met before the	wait is	satisfied.  This field
	      indicates	how the	provider  should  interpret  the  cond	field,
	      which describes the condition needed to signal the wait object.

       A  wait	condition should be treated as an optimization.	 Providers are
       not required to meet the	requirements of	the condition before signaling
       the wait	object.	 Applications should not rely on the condition	neces-
       sarily being true when a	blocking read call returns.

       If  wait_cond  is set to	FI_CQ_COND_NONE, then no additional conditions
       are applied to the signaling of the CQ wait object, and	the  insertion
       of  any new entry will trigger the wait condition.  If wait_cond	is set
       to FI_CQ_COND_THRESHOLD,	then the cond field is interpreted as a	size_t
       threshold value.	 The threshold indicates the number  of	 entries  that
       are to be queued	before at the CQ before	the wait is satisfied.

       This field is ignored if	wait_obj is set	to FI_WAIT_NONE.

       wait_set
	      If  wait_obj is FI_WAIT_SET, this	field references a wait	object
	      to which the completion queue should attach.  When an  event  is
	      inserted	into  the completion queue, the	corresponding wait set
	      will be signaled if all necessary	conditions are met.   The  use
	      of  a wait_set enables an	optimized method of waiting for	events
	      across multiple event and	completion queues.  This field is  ig-
	      nored if wait_obj	is not FI_WAIT_SET.

   fi_close
       The  fi_close  call releases all	resources associated with a completion
       queue.  Any completions which remain on the CQ when it  is  closed  are
       lost.

       When  closing  the CQ, there must be no opened endpoints, transmit con-
       texts, or receive contexts associated with the CQ.   If	resources  are
       still  associated  with	the CQ when attempting to close, the call will
       return -FI_EBUSY.

   fi_control
       The fi_control call is used to access provider or  implementation  spe-
       cific  details of the completion	queue.	Access to the CQ should	be se-
       rialized	across all calls when fi_control is invoked, as	it  may	 redi-
       rect  the  implementation of CQ operations.  The	following control com-
       mands are usable	with a CQ.

       FI_GETWAIT (void	**)
	      This command allows the user to retrieve the low-level wait  ob-
	      ject  associated	with the CQ.  The format of the	wait-object is
	      specified	during CQ creation, through the	 CQ  attributes.   The
	      fi_control arg parameter should be an address where a pointer to
	      the returned wait	object will be written.	 See fi_eq.3 for addi-
	      tion details using fi_control with FI_GETWAIT.

   fi_cq_read
       The fi_cq_read operation	performs a non-blocking	read of	completion da-
       ta from the CQ.	The format of the completion event is determined using
       the  fi_cq_format  option  that	was  specified when the	CQ was opened.
       Multiple	completions may	be retrieved from a CQ in a single call.   The
       maximum	number	of entries to return is	limited	to the specified count
       parameter, with the number of entries successfully read from the	CQ re-
       turned by the call.  (See return	values section below.) A  count	 value
       of  0 may be used to drive progress on associated endpoints when	manual
       progress	is enabled.

       CQs are optimized to report operations which have completed successful-
       ly.  Operations which fail are reported `out of band'.  Such operations
       are retrieved using the fi_cq_readerr function.	When an	operation that
       has completed with an unexpected	error is encountered, it is placed in-
       to a temporary error queue.  Attempting to read from a CQ while an item
       is in the error queue results in	fi_cq_read failing with	a return  code
       of -FI_EAVAIL.  Applications may	use this return	code to	determine when
       to call fi_cq_readerr.

   fi_cq_readfrom
       The  fi_cq_readfrom  call behaves identical to fi_cq_read, with the ex-
       ception that it allows the CQ to	return source address  information  to
       the  user for any received data.	 Source	address	data is	only available
       for  those  endpoints  configured  with	 FI_SOURCE   capability.    If
       fi_cq_readfrom is called	on an endpoint for which source	addressing da-
       ta  is  not  available,	the  source address will be set	to FI_ADDR_NO-
       TAVAIL.	The number of input src_addr entries must be the same  as  the
       count parameter.

       Returned	 source	 addressing  data is converted from the	native address
       used by the underlying fabric into an fi_addr_t,	which may be  used  in
       transmit	operations.  Under most	circumstances, returning fi_addr_t re-
       quires  that the	source address already have been inserted into the ad-
       dress vector associated with the	receiving endpoint.  This is true  for
       address	 vectors  of  type  FI_AV_TABLE.   In  select  providers  when
       FI_AV_MAP is used, source addresses may	be  converted  algorithmically
       into  a	usable	fi_addr_t, even	though the source address has not been
       inserted	into the address vector.  This is permitted by the API,	as  it
       allows the provider to avoid address look-up as part of receive message
       processing.   In	no case	do providers insert addresses into an AV sepa-
       rate from an application	calling	fi_av_insert or	similar	call.

       For endpoints allocated using  the  FI_SOURCE_ERR  capability,  if  the
       source  address	cannot	be  converted  into  a	valid fi_addr_t	value,
       fi_cq_readfrom will return -FI_EAVAIL, even if the data	were  received
       successfully.  The completion will then be reported through fi_cq_read-
       err with	error code -FI_EADDRNOTAVAIL.  See fi_cq_readerr for details.

       If FI_SOURCE is specified without FI_SOURCE_ERR,	source addresses which
       cannot  be mapped to a usable fi_addr_t will be reported	as FI_ADDR_NO-
       TAVAIL.

   fi_cq_sread / fi_cq_sreadfrom
       The fi_cq_sread and fi_cq_sreadfrom calls are the  blocking  equivalent
       operations to fi_cq_read	and fi_cq_readfrom.  Their behavior is similar
       to  the	non-blocking calls, with the exception that the	calls will not
       return until either a completion	has been read from the CQ or an	 error
       or timeout occurs.

       Threads blocking	in this	function will return to	the caller if they are
       signaled	by some	external source.  This is true even if the timeout has
       not occurred or was specified as	infinite.

       It  is  invalid	for applications to call these functions if the	CQ has
       been configured with a wait object of FI_WAIT_NONE or FI_WAIT_SET.

   fi_cq_readerr
       The read	error function,	fi_cq_readerr, retrieves information regarding
       any asynchronous	operation which	has completed with an  unexpected  er-
       ror.   fi_cq_readerr  is	 a  non-blocking  call,	 returning immediately
       whether an error	completion was found or	not.

       Error information is reported to	the user through struct	 fi_cq_err_en-
       try.  The format	of this	structure is defined below.

	      struct fi_cq_err_entry {
		  void	   *op_context;	/* operation context */
		  uint64_t flags;	/* completion flags */
		  size_t   len;		/* size	of received data */
		  void	   *buf;	/* receive data	buffer */
		  uint64_t data;	/* completion data */
		  uint64_t tag;		/* message tag */
		  size_t   olen;	/* overflow length */
		  int	   err;		/* positive error code */
		  int	   prov_errno;	/* provider error code */
		  void	  *err_data;	/*  error data */
		  size_t   err_data_size; /* size of err_data */
	      };

       The  general  reason  for  the error is provided	through	the err	field.
       Provider	specific error information may also be available  through  the
       prov_errno  and err_data	fields.	 Users may call	fi_cq_strerror to con-
       vert provider specific error information	into a	printable  string  for
       debugging  purposes.   See  field details below for more	information on
       the use of err_data and err_data_size.

       Note that error completions are generated for all operations, including
       those for which a completion was	not  requested	(e.g. an  endpoint  is
       configured  with	 FI_SELECTIVE_COMPLETION, but the request did not have
       the FI_COMPLETION flag set).  In	such cases, providers will  return  as
       much information	as made	available by the underlying software and hard-
       ware  about  the	 failure, other	fields will be set to NULL or 0.  This
       includes	the op_context value, which may	not have been provided or  was
       ignored on input	as part	of the transfer.

       Notable completion error	codes are given	below.

       FI_EADDRNOTAVAIL
	      This  error code is used by CQs configured with FI_SOURCE_ERR to
	      report completions for which a usable fi_addr_t  source  address
	      could not	be found.  An error code of FI_EADDRNOTAVAIL indicates
	      that  the	data transfer was successfully received	and processed,
	      with the fi_cq_err_entry fields containing information about the
	      completion.  The err_data	field will be set to  the  source  ad-
	      dress  data.   The  source address will be in the	same format as
	      specified	through	the fi_info addr_format	field for  the	opened
	      domain.	This  may be passed directly into an fi_av_insert call
	      to add the source	address	to the address vector.

   fi_cq_signal
       The fi_cq_signal	call will unblock any thread waiting in	fi_cq_sread or
       fi_cq_sreadfrom.	 This may be used to wake-up a thread that is  blocked
       waiting	to read	a completion operation.	 The fi_cq_signal operation is
       only available if the CQ	was configured with a wait object.

COMPLETION FIELDS
       The CQ entry data structures share many of the same fields.  The	 mean-
       ings of these fields are	the same for all CQ entry structure formats.

       op_context
	      The operation context is the application specified context value
	      that  was	 provided with an asynchronous operation.  The op_con-
	      text field is valid for all completions that are associated with
	      an asynchronous operation.

       For completion events that are not associated with a posted  operation,
       this field will be set to NULL.	This includes completions generated at
       the  target  in	response  to  RMA  write operations that carry CQ data
       (FI_REMOTE_WRITE	| FI_REMOTE_CQ_DATA flags set),	when the FI_RX_CQ_DATA
       mode bit	is not required.

       flags  This specifies flags associated with  the	 completed  operation.
	      The  Completion  Flags  section  below  lists valid flag values.
	      Flags are	set for	all relevant completions.

       len    This len field only  applies  to	completed  receive  operations
	      (e.g. fi_recv,  fi_trecv,	 etc.).	  It indicates the size	of re-
	      ceived message data - i.e. how many data bytes were placed  into
	      the    associated	   receive    buffer	by   a	 corresponding
	      fi_send/fi_tsend/et al call.  If an endpoint has been configured
	      with the FI_MSG_PREFIX mode, the len also	reflects the  size  of
	      the prefix buffer.

       buf    The  buf	field  is only valid for completed receive operations,
	      and only applies when the	receive	buffer	was  posted  with  the
	      FI_MULTI_RECV  flag.   In	 this case, buf	points to the starting
	      location where the receive data was placed.

       data   The data field is	only valid if the FI_REMOTE_CQ_DATA completion
	      flag is set, and only applies to receive completions.  If	FI_RE-
	      MOTE_CQ_DATA is set, this	field will contain the completion data
	      provided by the peer as part of  their  transmit	request.   The
	      completion data will be given in host byte order.

       tag    A	 tag  applies  only  to	received messages that occur using the
	      tagged interfaces.  This field contains the tag that was includ-
	      ed with the received message.  The tag will be in	host byte  or-
	      der.

       olen   The  olen	field applies to received messages.  It	is used	to in-
	      dicate that a received message has overrun the available	buffer
	      space  and has been truncated.  The olen specifies the amount of
	      data that	did not	fit into the available receive buffer and  was
	      discarded.

       err    This  err	code is	a positive fabric errno	associated with	a com-
	      pletion.	The err	value indicates	the general reason for an  er-
	      ror, if one occurred.  See fi_errno.3 for	a list of possible er-
	      ror codes.

       prov_errno
	      On  an  error,  prov_errno may contain a provider	specific error
	      code.  The use of	this field and its meaning is provider specif-
	      ic.  It is  intended  to	be  used  as  a	 debugging  aid.   See
	      fi_cq_strerror  for  additional details on converting this error
	      value into a human readable string.

       err_data
	      The err_data field is used to return provider specific  informa-
	      tion,  if	available, about the error.  On	input, err_data	should
	      reference	a data buffer of size err_data_size.  On  output,  the
	      provider will fill in this buffer	with any provider specific da-
	      ta which may help	identify the cause of the error.  The contents
	      of  the err_data field and its meaning is	provider specific.  It
	      is intended to be	used as	a debugging aid.   See	fi_cq_strerror
	      for  additional details on converting this error data into a hu-
	      man readable string.  See	the compatibility note	below  on  how
	      this field is used for older libfabric releases.

       err_data_size
	      On  input,  err_data_size	 indicates  the	 size  of the err_data
	      buffer in	bytes.	On output, err_data_size will be  set  to  the
	      number of	bytes copied to	the err_data buffer.  The err_data in-
	      formation	 is  typically used with fi_cq_strerror	to provide de-
	      tails about the type of error that occurred.

       For compatibility purposes, the behavior	of the	err_data  and  err_da-
       ta_size	fields	is may be modified from	that listed above.  If err_da-
       ta_size is 0 on input, or the fabric was	opened	with  release  <  1.5,
       then  any  buffer  referenced by	err_data will be ignored on input.  In
       this situation, on output err_data will be set to a data	 buffer	 owned
       by  the provider.  The contents of the buffer will remain valid until a
       subsequent read call against the	CQ.  Applications must	serialize  ac-
       cess  to	the CQ when processing errors to ensure	that the buffer	refer-
       enced by	err_data does not change.

COMPLETION FLAGS
       Completion flags	provide	additional details regarding the completed op-
       eration.	 The following completion flags	are defined.

       FI_SEND
	      Indicates	that the completion was	for a  send  operation.	  This
	      flag may be combined with	an FI_MSG or FI_TAGGED flag.

       FI_RECV
	      Indicates	that the completion was	for a receive operation.  This
	      flag may be combined with	an FI_MSG or FI_TAGGED flag.

       FI_RMA Indicates	 that  an  RMA	operation completed.  This flag	may be
	      combined with an FI_READ,	FI_WRITE,  FI_REMOTE_READ,  or	FI_RE-
	      MOTE_WRITE flag.

       FI_ATOMIC
	      Indicates	 that an atomic	operation completed.  This flag	may be
	      combined with an FI_READ,	FI_WRITE,  FI_REMOTE_READ,  or	FI_RE-
	      MOTE_WRITE flag.

       FI_MSG Indicates	 that  a message-based operation completed.  This flag
	      may be combined with an FI_SEND or FI_RECV flag.

       FI_TAGGED
	      Indicates	that a tagged message operation	completed.  This  flag
	      may be combined with an FI_SEND or FI_RECV flag.

       FI_MULTICAST
	      Indicates	 that  a multicast operation completed.	 This flag may
	      be combined with FI_MSG and relevant flags.  This	flag  is  only
	      guaranteed to be valid for received messages if the endpoint has
	      been configured with FI_SOURCE.

       FI_READ
	      Indicates	 that a	locally	initiated RMA or atomic	read operation
	      has completed.  This flag	may be	combined  with	an  FI_RMA  or
	      FI_ATOMIC	flag.

       FI_WRITE
	      Indicates	that a locally initiated RMA or	atomic write operation
	      has  completed.	This  flag  may	 be combined with an FI_RMA or
	      FI_ATOMIC	flag.

       FI_REMOTE_READ
	      Indicates	that a remotely	initiated RMA or atomic	read operation
	      has completed.  This flag	may be	combined  with	an  FI_RMA  or
	      FI_ATOMIC	flag.

       FI_REMOTE_WRITE
	      Indicates	 that  a remotely initiated RMA	or atomic write	opera-
	      tion has completed.  This	flag may be combined with an FI_RMA or
	      FI_ATOMIC	flag.

       FI_REMOTE_CQ_DATA
	      This indicates that remote CQ data is available as part  of  the
	      completion.

       FI_MULTI_RECV
	      This  flag  applies to receive buffers that were posted with the
	      FI_MULTI_RECV flag set.  This completion flag indicates that the
	      original receive buffer referenced by the	 completion  has  been
	      consumed	and  was  released by the provider.  Providers may set
	      this flag	on the last message that is received into  the	multi-
	      recv  buffer,  or	 may generate a	separate completion that indi-
	      cates that the buffer has	been released.

       Applications can	distinguish between these two cases by	examining  the
       completion  entry  flags	 field.	 If additional flags, such as FI_RECV,
       are set,	the completion is associated with a received message.  In this
       case, the buf field will	reference the location where the received mes-
       sage was	placed into the	multi-recv buffer.  Other fields in  the  com-
       pletion	entry  will  be	 determined based on the received message.  If
       other flag bits are zero, the provider is reporting that	the multi-recv
       buffer has been released, and the completion entry  is  not  associated
       with a received message.

       FI_MORE
	      See  the	`Buffered  Receives' section in	fi_msg(3) for more de-
	      tails.  This flag	is associated with receive completions on end-
	      points that have FI_BUFFERED_RECV	mode  enabled.	 When  set  to
	      one,  it	indicates that the buffer referenced by	the completion
	      is limited by the	FI_OPT_BUFFERED_LIMIT threshold, and addition-
	      al message data must be retrieved	by the	application  using  an
	      FI_CLAIM operation.

       FI_CLAIM
	      See  the	`Buffered  Receives' section in	fi_msg(3) for more de-
	      tails.  This flag	is set on completions associated with  receive
	      operations  that	claim  buffered	 receive data.	Note that this
	      flag   only   applies   to   endpoints   configured   with   the
	      FI_BUFFERED_RECV mode bit.

COMPLETION EVENT SEMANTICS
       Libfabric  defines several completion `levels', identified using	opera-
       tional flags.  Each flag	indicates the soonest that a completion	 event
       may be generated	by a provider, and the assumptions that	an application
       may  make  upon processing a completion.	 The operational flags are de-
       fined below, along with an example of how a  provider  might  implement
       the  semantic.	Note that only meeting the semantic is required	of the
       provider	and not	the implementation.  Providers may implement  stronger
       completion semantics than necessary for a given operation, but only the
       behavior	defined	by the completion level	is guaranteed.

       To  help	 understand  the  conceptual differences in completion levels,
       consider	mailing	a letter.  Placing the letter into the	local  mailbox
       for  pick-up is similar to `inject complete'.  Having the letter	picked
       up and dropped off at the destination mailbox is	equivalent to  `trans-
       mit  complete'.	The `delivery complete'	semantic is a stronger guaran-
       tee, with a person at the destination signing for the letter.  However,
       the person who signed for the letter is not  necessarily	 the  intended
       recipient.   The	 `match	 complete'  option is similar to delivery com-
       plete, but requires the intended	recipient to sign for the letter.

       The `commit complete' level has different semantics than	the previously
       mentioned levels.  Commit complete would	be closer to the letter	arriv-
       ing at the destination and being	placed into a fire proof safe.

       The operational flags for the described completion levels  are  defined
       below.

       FI_INJECT_COMPLETE
	      Indicates	 that a	completion should be generated when the	source
	      buffer(s)	may be	reused.	  A  completion	 guarantees  that  the
	      buffers  will not	be read	from again and the application may re-
	      claim them.  No other guarantees are made	with  respect  to  the
	      state of the operation.

       Example:	 A  provider  may generate this	completion event after copying
       the source buffer into a	network	buffer,	either in host	memory	or  on
       the NIC.	 An inject completion does not indicate	that the data has been
       transmitted  onto  the network, and a local error could occur after the
       completion event	has been generated that	could prevent  it  from	 being
       transmitted.

       Inject  complete	 allows	 for  the  fastest  completion reporting (and,
       hence, buffer reuse), but provides the weakest guarantees against  net-
       work errors.

       Note:  This flag	is used	to control when	a completion entry is inserted
       into a completion queue.	 It does not apply to operations that  do  not
       generate	a completion queue entry, such as the fi_inject	operation, and
       is not subject to the inject_size message limit restriction.

       FI_TRANSMIT_COMPLETE
	      Indicates	 that a	completion should be generated when the	trans-
	      mit operation has	completed relative to the local	provider.  The
	      exact behavior is	dependent on the endpoint type.

       For reliable endpoints:

       Indicates that a	completion should be generated when the	operation  has
       been  delivered to the peer endpoint.  A	completion guarantees that the
       operation is no longer dependent	on the fabric or local resources.  The
       state of	the operation at the peer endpoint is not defined.

       Example:	A provider may generate	a transmit complete event upon receiv-
       ing an ack from the peer	endpoint.  The state of	 the  message  at  the
       peer  is	 unknown and may be buffered in	the target NIC at the time the
       ack has been generated.

       For unreliable endpoints:

       Indicates that a	completion should be generated when the	operation  has
       been  delivered to the fabric.  A completion guarantees that the	opera-
       tion is no longer dependent on local resources.	The state of the oper-
       ation within the	fabric is not defined.

       FI_DELIVERY_COMPLETE
	      Indicates	that a completion should not be	generated until	an op-
	      eration has been processed by the	 destination  endpoint(s).   A
	      completion guarantees that the result of the operation is	avail-
	      able; however, additional	steps may need to be taken at the des-
	      tination	to  retrieve the results.  For example,	an application
	      may need to provide a receive buffers in order to	retrieve  mes-
	      sages that were buffered by the provider.

       Delivery	 complete indicates that the message has been processed	by the
       peer.  If an application	buffer was ready to receive the	results	of the
       message when it arrived,	then delivery complete indicates that the data
       was placed into the application's buffer.

       This completion mode applies only to reliable  endpoints.   For	opera-
       tions  that  return  data  to  the initiator, such as RMA read or atom-
       ic-fetch, the source endpoint is	also  considered  a  destination  end-
       point.  This is the default completion mode for such operations.

       FI_MATCH_COMPLETE
	      Indicates	 that  a completion should be generated	only after the
	      operation	has been matched with an application specified buffer.
	      Operations using this completion semantic	are dependent  on  the
	      application at the target	claiming the message or	results.  As a
	      result, match complete may involve additional provider level ac-
	      knowledgements or	lengthy	delays.	 However, this completion mod-
	      el  enables  peer	 applications  to synchronize their execution.
	      Many providers may not support this semantic.

       FI_COMMIT_COMPLETE
	      Indicates	that a completion should not be	generated (locally  or
	      at  the  peer)  until  the result	of an operation	have been made
	      persistent.  A completion	guarantees that	 the  result  is  both
	      available	and durable, in	the case of power failure.

       This  completion	mode applies only to operations	that target persistent
       memory regions over reliable endpoints.	This completion	mode is	exper-
       imental.

       FI_FENCE
	      This is not a completion level, but plays	a role in the  comple-
	      tion  ordering between operations	that would not normally	be or-
	      dered.  An operation that	is marked with the FI_FENCE  flag  and
	      all  operations  posted  after the fenced	operation are deferred
	      until all	previous operations targeting the same	peer  endpoint
	      have  completed.	Additionally, the completion of	the fenced op-
	      eration indicates	that prior operations have met the  same  com-
	      pletion level as the fenced operation.  For example, if an oper-
	      ation  is	 posted	 as  FI_DELIVERY_COMPLETE | FI_FENCE, then its
	      completion indicates prior operations have met the semantic  re-
	      quired for FI_DELIVERY_COMPLETE.	This is	true even if the prior
	      operation	 was  posted  with  a  lower completion	level, such as
	      FI_TRANSMIT_COMPLETE or FI_INJECT_COMPLETE.

       Note that a completion generated	for an operation posted	prior  to  the
       fenced  operation  only	guarantees  that the completion	level that was
       originally requested has	been met.  It is the completion	of the	fenced
       operation that guarantees that the additional semantics have been met.

       The  above completion semantics are defined with	respect	to the initia-
       tor of the operation.  The different semantics are useful for  describ-
       ing  when  the  initiator may re-use a data buffer, and guarantees what
       state a transfer	must reach prior  to  a	 completion  being  generated.
       This  allows  applications  to  determine appropriate error handling in
       case of communication failures.

TARGET COMPLETION SEMANTICS
       The completion semantic at the target is	used to	determine when data at
       the target is visible to	the peer  application.	 Visibility  indicates
       that  a	memory	read to	the same address that was the target of	a data
       transfer	will return the	results	of the	transfer.   The	 target	 of  a
       transfer	can be identified by the initiator, as may be the case for RMA
       and atomic operations, or determined by the target, for example by pro-
       viding a	matching receive buffer.  Global visibility indicates that the
       results	are  available regardless of where the memory read originates.
       For example, the	read could come	from a process running on a host  CPU,
       it may be accessed by subsequent	data transfer over the fabric, or read
       from a peer device such as a GPU.

       In terms	of completion semantics, visibility usually indicates that the
       transfer	 meets the FI_DELIVERY_COMPLETE	requirements from the perspec-
       tive of the target.  The	target completion semantic may be, but is  not
       necessarily,  linked with the completion	semantic specified by the ini-
       tiator of the transfer.

       Often, target processes do not explicitly state	a  desired  completion
       semantic	 and instead rely on the default semantic.  The	default	behav-
       ior is based on several factors,	including:

        whether a completion even is generated	at the target

        the type of transfer involved (e.g. msg vs RMA)

        endpoint data and message ordering guarantees

        properties of the targeted memory buffer

        the initiator's specified completion semantic

       Broadly,	target completion semantics are	grouped	based  on  whether  or
       not  the	transfer generates a completion	event at the target.  This in-
       cludes writing a	CQ entry or updating a completion counter.  In	common
       use cases, transfers that use a message interface (FI_MSG or FI_TAGGED)
       typically  generate target events, while	transfers involving an RMA in-
       terface (FI_RMA or FI_ATOMIC) often do not.  There  are	exceptions  to
       both  these cases, depending on endpoint	to CQ and counter bindings and
       operational flags.  For example,	RMA writes that	carry remote  CQ  data
       will generate a completion event	at the target, and are frequently used
       to convey visibility to the target application.	The general guidelines
       for  target  side semantics are described below,	followed by exceptions
       that modify that	behavior.

       By default, completions generated  at  the  target  indicate  that  the
       transferred  data  is  immediately available to be read from the	target
       buffer.	That is, the target sees FI_DELIVERY_COMPLETE (or better)  se-
       mantics,	even if	the initiator requested	lower semantics.  For applica-
       tions using only	data buffers allocated from host memory, this is often
       sufficient.

       For  operations	that do	not generate a completion event	at the target,
       the visibility of the data at the target	may need to be inferred	 based
       on subsequent operations	that do	generate target	completions.  Absent a
       target  completion, when	a completion of	an operation is	written	at the
       initiator, the visibility semantic  of  the  operation  at  the	target
       aligns with the initiator completion semantic.  For instance, if	an RMA
       operation  completes  at	 the initiator as either FI_INJECT_COMPLETE or
       FI_TRANSMIT_COMPLETE, the data visibility at the	target is not  guaran-
       teed.

       One  or	more  of  the  following  mechanisms can be used by the	target
       process to guarantee that the results of	a data transfer	that  did  not
       generate	 a  completion at the target is	now visible.  This list	is not
       inclusive of all	options, but defines common uses.  In the descriptions
       below, the first	transfer does not result in a completion event at  the
       target, but is eventually followed by a transfer	which does.

        If  the  endpoint  guarantees message ordering	between	two transfers,
	 the target completion of a second transfer will indicate that the da-
	 ta from the first transfer is available.  For example,	 if  the  end-
	 point	supports  send after write ordering (FI_ORDER_SAW), then a re-
	 ceive completion corresponding	to the send  will  indicate  that  the
	 write	data  is available.  This holds	independent of the initiator's
	 completion semantic for either	the write or send.  When  ordering  is
	 guaranteed, the second	transfer can be	queued with the	provider imme-
	 diately after queuing the first.

        If  the  endpoint  does not guarantee message ordering, the initiator
	 must take additional steps to ensure visibility.   If	initiator  re-
	 quests	 FI_DELIVERY_COMPLETE  semantics  for the first	operation, the
	 initiator can wait for	the operation to complete locally.   Once  the
	 completion  has been read, the	target completion of a second transfer
	 will indicate that the	first transfer's data is visible.

        Alternatively,	if message ordering is not guaranteed by the endpoint,
	 the initiator can use the FI_FENCE and	FI_DELIVERY_COMPLETE flags  on
	 the  second  data  transfer  to force the first transfers to meet the
	 FI_DELIVERY_COMPLETE semantics.  If the second	transfer  generates  a
	 completion  at	 the target, that will indicate	that the data is visi-
	 ble.  Otherwise, a target  completion	for  any  transfer  after  the
	 fenced	operation will indicate	that the data is visible.

       The above semantics apply for transfers targeting traditional host mem-
       ory  buffers.   However,	 the  behavior	may  differ when device	memory
       and/or persistent memory	is involved (FI_HMEM  and  FI_PMEM  capability
       bits).  When heterogenous memory	is involved, the concept of memory do-
       mains  come into	play.  Memory domains identify the physical separation
       of memory, which	may or may not be accessible through the same  virtual
       address space.  See the fi_mr(3)	man page for further details on	memory
       domains.

       Completion  ordering  and  data	visibility  are	 only well-defined for
       transfers that target the same memory domain.  Applications need	to  be
       aware of	ordering and visibility	differences when transfers target dif-
       ferent memory domains.  Additionally, applications also need to be con-
       cerned  with  the memory	domain that completions	themselves are written
       and if it differs from the memory domain	targeted by  a	transfer.   In
       some  situations,  either  the provider or application may need to call
       device specific APIs to synchronize or flush device  memory  caches  in
       order to	achieve	the desired data visibility.

       When  heterogenous  memory is in	use, the default target	completion se-
       mantic for transfers that generate a completion at the target is	 still
       FI_DELIVERY_COMPLETE,  however, applications should be aware that there
       may be a	negative impact	on overall performance for providers  to  meet
       this requirement.

       For example, a target process may be using a GPU	to accelerate computa-
       tions.	A memory region	mapping	to memory on the GPU may be exposed to
       peers as	either an RMA target or	posted locally as  a  receive  buffer.
       In  this	 case,	the application	is concerned with two memory domains -
       system and GPU memory.  Completions are written to system memory.

       Continuing the example, a peer process sends a  tagged  message.	  That
       message	is matched with	the receive buffer located in GPU memory.  The
       NIC copies the data from	the network into the receive buffer and	writes
       an entry	into the completion queue.  Note that both memory domains were
       accessed	as part	of this	transfer.  The message data  was  directed  to
       the  GPU	memory,	but the	completion went	to host	memory.	 Because sepa-
       rate memory domains may not be synchronized with	each other, it is pos-
       sible for the host CPU to see and process the completion	 entry	before
       the  transfer  to  the  GPU memory is visible to	either the host	GPU or
       even software  running  on  the	GPU.   From  the  perspective  of  the
       provider, visibility of the completion does not imply visibility	of da-
       ta written to the GPU's memory domain.

       The  default  completion	semantic at the	target application for message
       operations is FI_DELIVERY_COMPLETE.  An anticipated provider  implemen-
       tation  in  this	 situation is for the provider software	running	on the
       host CPU	to intercept the CQ entry, detect that the data	landed in het-
       erogenous memory, and perform the necessary device  synchronization  or
       flush  operation	before reporting the completion	up to the application.
       This ensures that the data is visible to	CPU and	GPU software prior  to
       the application processing the completion.

       In  addition  to	the cost of provider software intercepting completions
       and checking if a transfer targeted heterogenous	 memory,  device  syn-
       chronization  itself may	impact performance.  As	a result, applications
       can request a lower completion semantic when  posting  receives.	  That
       indicates  to the provider that the application will be responsible for
       handling	any device specific flush operations  that  might  be  needed.
       See fi_msg(3) FLAGS.

       For  data  transfers  that  do not generate a completion	at the target,
       such as RMA or atomics, it is the responsibility	of the application  to
       ensure  that  all target	buffers	meet the necessary visibility require-
       ments of	the application.  The previously  mentioned  bulleted  methods
       for  notifying  the  target  that the data is visible may not be	suffi-
       cient, as the provider software at the target could  lack  the  context
       needed  to  ensure  visibility.	 This implies that the application may
       need to call device synchronization/flush APIs directly.

       For example, a peer application could perform several RMA  writes  that
       target GPU memory buffers.  If the provider offloads RMA	operations in-
       to  the	NIC,  the provider software at the target will be unaware that
       the RMA operations have occurred.  If the peer sends a message  to  the
       target application that indicates that the RMA operations are done, the
       application must	ensure that the	RMA data is visible to the host	CPU or
       GPU prior to executing code that	accesses the data.  The	target comple-
       tion  of	 having	 received  the sent message is not sufficient, even if
       send-after-write	ordering is supported.

       Most target heterogenous	memory completion semantics map	 to  FI_TRANS-
       MIT_COMPLETE or FI_DELIVERY_COMPLETE.  Persistent memory	(FI_PMEM capa-
       bility),	 however,  is  often  used  with FI_COMMIT_COMPLETE semantics.
       Heterogenous completion concepts	still apply.

       For transfers flagged by	the initiator with FI_COMMIT_COMPLETE, a  com-
       pletion	at  the	 target	 indicates  that  the  results are visible and
       durable.	 For transfers targeting persistent memory, but	using  a  dif-
       ferent completion semantic at the initiator, the	visibility at the tar-
       get  is similar to that described above.	 Durability is only associated
       with transfers marked with FI_COMMIT_COMPLETE.

       For transfers targeting persistent memory that request FI_DELIVERY_COM-
       PLETE, then a completion, at either the initiator or target,  indicates
       that the	data is	visible.  Visibility at	the target can be conveyed us-
       ing  one	 of the	above describe mechanism - generating a	target comple-
       tion, sending a message from the	initiator,  etc.   Similarly,  if  the
       initiator  requested  FI_TRANSMIT_COMPLETE,  then  additional steps are
       needed to ensure	visibility at the target.  For example,	 the  transfer
       can generate a completion at the	target,	which would indicate visibili-
       ty,  but	 not  durability.   The	initiator can also follow the transfer
       with another operation that forces visibility, such as  using  FI_FENCE
       in conjunction with FI_DELIVERY_COMPLETE.

NOTES
       A  completion  queue must be bound to at	least one enabled endpoint be-
       fore any	operation such	as  fi_cq_read,	 fi_cq_readfrom,  fi_cq_sread,
       fi_cq_sreadfrom etc.  can be called on it.

       Completion flags	may be suppressed if the FI_NOTIFY_FLAGS_ONLY mode bit
       has been	set.  When enabled, only the following flags are guaranteed to
       be  set	in  completion	data  when  they are valid: FI_REMOTE_READ and
       FI_REMOTE_WRITE (when FI_RMA_EVENT capability bit has been set),	FI_RE-
       MOTE_CQ_DATA, and FI_MULTI_RECV.

       If a completion queue has been overrun,	it  will  be  placed  into  an
       `overrun'  state.   Read	 operations will continue to return any	valid,
       non-corrupted completions, if available.	 After all  valid  completions
       have  been  retrieved, any attempt to read the CQ will result in	it re-
       turning an FI_EOVERRUN error event.  Overrun completion queues are con-
       sidered fatal and may not be used to report additional completions once
       the overrun occurs.

RETURN VALUES
       fi_cq_open / fi_cq_signal
	      Returns 0	on success.  On	error, a negative value	 corresponding
	      to fabric	errno is returned.

       fi_cq_read  / fi_cq_readfrom / fi_cq_readerr fi_cq_sread	/ fi_cq_sread-
       from : On success, returns the number of	 completion  events  retrieved
       from the	completion queue.  On error, a negative	value corresponding to
       fabric  errno  is  returned.  If	no completions are available to	return
       from the	CQ, -FI_EAGAIN will be returned.

       fi_cq_sread / fi_cq_sreadfrom
	      On success, returns the number of	 completion  events  retrieved
	      from  the	 completion  queue.  On	error, a negative value	corre-
	      sponding to fabric errno is returned.  If	the timeout expires or
	      the calling thread is signaled and no data is  available	to  be
	      read from	the completion queue, -FI_EAGAIN is returned.

       fi_cq_strerror
	      Returns  a  character string interpretation of the provider spe-
	      cific error returned with	a completion.

       Fabric errno values are defined in rdma/fi_errno.h.

SEE ALSO
       fi_getinfo(3),  fi_endpoint(3),	fi_domain(3),  fi_eq(3),   fi_cntr(3),
       fi_poll(3)

AUTHORS
       OpenFabrics.

Libfabric Programmer's Manual	  2022-01-28			      fi_cq(3)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=fi_cq_readerr&sektion=3&manpath=FreeBSD+Ports+14.3.quarterly>

home | help