FreeBSD Manual Pages

home | help
fi_endpoint(3)		       Libfabric v1.15.1		fi_endpoint(3)

NAME
       fi_endpoint - Fabric endpoint operations

       fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
	      Allocate or close	an endpoint.

       fi_ep_bind
	      Associate	 an  endpoint  with  hardware resources, such as event
	      queues, completion queues, counters, address vectors, or	shared
	      transmit/receive contexts.

       fi_scalable_ep_bind
	      Associate	a scalable endpoint with an address vector

       fi_pep_bind
	      Associate	a passive endpoint with	an event queue

       fi_enable
	      Transitions an active endpoint into an enabled state.

       fi_cancel
	      Cancel a pending asynchronous data transfer

       fi_ep_alias
	      Create an	alias to the endpoint

       fi_control
	      Control endpoint operation.

       fi_getopt / fi_setopt
	      Get or set endpoint options.

       fi_rx_context / fi_tx_context / fi_srx_context /	fi_stx_context
	      Open a transmit or receive context.

       fi_tc_dscp_set /	fi_tc_dscp_get
	      Convert between a	DSCP value and a network traffic class

       fi_rx_size_left / fi_tx_size_left (DEPRECATED)
	      Query the	lower bound on how many	RX/TX operations may be	posted
	      without  an operation returning -FI_EAGAIN.  This	functions have
	      been deprecated and will be removed in a future version  of  the
	      library.

SYNOPSIS
	      #include <rdma/fabric.h>

	      #include <rdma/fi_endpoint.h>

	      int fi_endpoint(struct fid_domain	*domain, struct	fi_info	*info,
		  struct fid_ep	**ep, void *context);

	      int fi_scalable_ep(struct	fid_domain *domain, struct fi_info *info,
		  struct fid_ep	**sep, void *context);

	      int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
		  struct fid_pep **pep,	void *context);

	      int fi_tx_context(struct fid_ep *sep, int	index,
		  struct fi_tx_attr *attr, struct fid_ep **tx_ep,
		  void *context);

	      int fi_rx_context(struct fid_ep *sep, int	index,
		  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
		  void *context);

	      int fi_stx_context(struct	fid_domain *domain,
		  struct fi_tx_attr *attr, struct fid_stx **stx,
		  void *context);

	      int fi_srx_context(struct	fid_domain *domain,
		  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
		  void *context);

	      int fi_close(struct fid *ep);

	      int fi_ep_bind(struct fid_ep *ep,	struct fid *fid, uint64_t flags);

	      int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);

	      int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);

	      int fi_enable(struct fid_ep *ep);

	      int fi_cancel(struct fid_ep *ep, void *context);

	      int fi_ep_alias(struct fid_ep *ep, struct	fid_ep **alias_ep, uint64_t flags);

	      int fi_control(struct fid	*ep, int command, void *arg);

	      int fi_getopt(struct fid *ep, int	level, int optname,
		  void *optval,	size_t *optlen);

	      int fi_setopt(struct fid *ep, int	level, int optname,
		  const	void *optval, size_t optlen);

	      uint32_t fi_tc_dscp_set(uint8_t dscp);

	      uint8_t fi_tc_dscp_get(uint32_t tclass);

	      DEPRECATED ssize_t fi_rx_size_left(struct	fid_ep *ep);

	      DEPRECATED ssize_t fi_tx_size_left(struct	fid_ep *ep);

ARGUMENTS
       fid    On  creation,  specifies	a  fabric  or access domain.  On bind,
	      identifies the event queue, completion queue,  counter,  or  ad-
	      dress  vector  to	 bind to the endpoint.	In other cases,	it's a
	      fabric identifier	of an associated resource.

       info   Details about the	fabric interface endpoint to  be  opened,  ob-
	      tained from fi_getinfo.

       ep     A	fabric endpoint.

       sep    A	scalable fabric	endpoint.

       pep    A	passive	fabric endpoint.

       context
	      Context associated with the endpoint or asynchronous operation.

       index  Index to retrieve	a specific transmit/receive context.

       attr   Transmit or receive context attributes.

       flags  Additional flags to apply	to the operation.

       command
	      Command of control operation to perform on endpoint.

       arg    Optional control argument.

       level  Protocol level at	which the desired option resides.

       optname
	      The protocol option to read or set.

       optval The option value that was	read or	to set.

       optlen The size of the optval buffer.

DESCRIPTION
       Endpoints  are  transport  level	 communication portals.	 There are two
       types of	endpoints: active and passive.	Passive	endpoints belong to  a
       fabric domain and are most often	used to	listen for incoming connection
       requests.   However, a passive endpoint may be used to reserve a	fabric
       address that can	be granted to an active	 endpoint.   Active  endpoints
       belong to access	domains	and can	perform	data transfers.

       Active  endpoints may be	connection-oriented or connectionless, and may
       provide data reliability.  The  data  transfer  interfaces  -  messages
       (fi_msg),  tagged  messages  (fi_tagged),  RMA  (fi_rma),  and  atomics
       (fi_atomic) - are associated with active	endpoints.  In basic  configu-
       rations,	an active endpoint has transmit	and receive queues.  In	gener-
       al,  operations	that  generate traffic on the fabric are posted	to the
       transmit	queue.	This includes all RMA  and  atomic  operations,	 along
       with  sent  messages  and  sent	tagged messages.  Operations that post
       buffers for receiving incoming data are submitted to the	receive	queue.

       Active endpoints	are created in the disabled state.  They must  transi-
       tion  into  an enabled state before accepting data transfer operations,
       including posting of receive buffers.  The fi_enable call  is  used  to
       transition  an  active  endpoint	into an	enabled	state.	The fi_connect
       and fi_accept calls will	also transition	an endpoint into  the  enabled
       state, if it is not already active.

       In  order  to  transition an endpoint into an enabled state, it must be
       bound to	one or more fabric resources.  An endpoint that	will  generate
       asynchronous  completions,  either  through data	transfer operations or
       communication establishment events, must	be bound  to  the  appropriate
       completion  queues or event queues, respectively, before	being enabled.
       Additionally, endpoints that use	manual	progress  must	be  associated
       with  relevant  completion  queues  or  event  queues in	order to drive
       progress.  For endpoints	that are only used as the  target  of  RMA  or
       atomic  operations,  this  means	 binding  the endpoint to a completion
       queue associated	with  receive  processing.   Connectionless  endpoints
       must be bound to	an address vector.

       Once  an	 endpoint has been activated, it may be	associated with	an ad-
       dress vector.  Receive buffers may be posted to it  and	calls  may  be
       made  to	 connection  establishment routines.  Connectionless endpoints
       may also	perform	data transfers.

       The behavior of an endpoint may be adjusted by setting its control data
       and protocol options.  This allows the underlying provider to  redirect
       function	 calls to implementations optimized to meet the	desired	appli-
       cation behavior.

       If an endpoint experiences a critical error, it	will  transition  back
       into  a disabled	state.	Critical errors	are reported through the event
       queue associated	with the EP.  In certain cases,	 a  disabled  endpoint
       may  be	re-enabled.   The  ability  to transition back into an enabled
       state is	provider specific and depends on the type of  error  that  the
       endpoint	 experienced.	When  an endpoint is disabled as a result of a
       critical	error, all pending operations are discarded.

   fi_endpoint / fi_passive_ep / fi_scalable_ep
       fi_endpoint allocates a new active endpoint.  fi_passive_ep allocates a
       new passive endpoint.  fi_scalable_ep allocates	a  scalable  endpoint.
       The  properties	and  behavior of the endpoint are defined based	on the
       provided	struct fi_info.	 See  fi_getinfo  for  additional  details  on
       fi_info.	  fi_info  flags that control the operation of an endpoint are
       defined below.  See section SCALABLE ENDPOINTS.

       If an active endpoint is	allocated in order to accept a connection  re-
       quest,  the fi_info parameter must be the same as the fi_info structure
       provided	with the connection request (FI_CONNREQ) event.

       An active endpoint may acquire the properties of	a passive endpoint  by
       setting	the  fi_info  handle  field to the passive endpoint fabric de-
       scriptor.  This is useful for applications that	need  to  reserve  the
       fabric  address of an endpoint prior to knowing if the endpoint will be
       used on the active or passive side of a connection.  For	example,  this
       feature is useful for simulating	socket semantics.  Once	an active end-
       point  acquires	the properties of a passive endpoint, the passive end-
       point is	no longer bound	to any fabric resources	and must no longer  be
       used.  The user is expected to close the	passive	endpoint after opening
       the  active  endpoint  in order to free up any lingering	resources that
       had been	used.

   fi_close
       Closes an endpoint and release all resources associated with it.

       When closing a scalable endpoint, there must be no opened transmit con-
       texts, or receive contexts associated with the scalable	endpoint.   If
       resources are still associated with the scalable	endpoint when attempt-
       ing to close, the call will return -FI_EBUSY.

       Outstanding  operations	posted to the endpoint when fi_close is	called
       will be discarded.  Discarded operations	will silently be dropped, with
       no completions reported.	 Additionally, a provider may  discard	previ-
       ously  completed	 operations  from  the associated completion queue(s).
       The behavior to discard completed operations is provider	specific.

   fi_ep_bind
       fi_ep_bind is used to associate an endpoint with	 other	allocated  re-
       sources,	 such  as  completion queues, counters,	address	vectors, event
       queues, shared contexts,	and memory regions.  The type of objects  that
       must be bound with an endpoint depend on	the endpoint type and its con-
       figuration.

       Passive	endpoints  must	 be  bound with	an EQ that supports connection
       management events.  Connectionless endpoints must be bound to a	single
       address	vector.	  If an	endpoint is using a shared transmit and/or re-
       ceive context, the shared contexts must be bound	to the endpoint.  CQs,
       counters, AV, and shared	contexts must be  bound	 to  endpoints	before
       they are	enabled	either explicitly or implicitly.

       An endpoint must	be bound with CQs capable of reporting completions for
       any  asynchronous operation initiated on	the endpoint.  For example, if
       the endpoint supports any  outbound  transfers  (sends,	RMA,  atomics,
       etc.),  then  it	 must  be  bound to a completion queue that can	report
       transmit	completions.  This is true even	if the endpoint	is  configured
       to  suppress successful completions, in order that operations that com-
       plete in	error may be reported to the user.

       An active endpoint may direct  asynchronous  completions	 to  different
       CQs,  based  on	the  type  of  operation.   This  is  specified	 using
       fi_ep_bind flags.  The following	flags may be OR'ed together when bind-
       ing an endpoint to a completion domain CQ.

       FI_RECV
	      Directs the notification of inbound data transfers to the	speci-
	      fied completion queue.  This includes received  messages.	  This
	      binding automatically includes FI_REMOTE_WRITE, if applicable to
	      the endpoint.

       FI_SELECTIVE_COMPLETION
	      By default, data transfer	operations write CQ completion entries
	      into the associated completion queue after they have successful-
	      ly completed.  Applications can use this bind flag to selective-
	      ly  enable when completions are generated.  If FI_SELECTIVE_COM-
	      PLETION is specified, data transfer operations will not generate
	      CQ entries for successful	completions  unless  FI_COMPLETION  is
	      set  as an operational flag for the given	operation.  Operations
	      that fail	asynchronously will still generate  completions,  even
	      if  a completion is not requested.  FI_SELECTIVE_COMPLETION must
	      be OR'ed with FI_TRANSMIT	and/or FI_RECV flags.

       When FI_SELECTIVE_COMPLETION is set, the	user must determine when a re-
       quest that does NOT have	FI_COMPLETION set  has	completed  indirectly,
       usually	based  on the completion of a subsequent operation or by using
       completion counters.  Use of this flag may improve performance  by  al-
       lowing  the  provider  to avoid writing a CQ completion entry for every
       operation.

       See Notes section below for additional information on how this flag in-
       teracts with the	FI_CONTEXT and FI_CONTEXT2 mode	bits.

       FI_TRANSMIT
	      Directs the completion of	outbound data transfer requests	to the
	      specified	completion queue.  This	includes  send	message,  RMA,
	      and atomic operations.

       An endpoint may optionally be bound to a	completion counter.  Associat-
       ing  an endpoint	with a counter is in addition to binding the EP	with a
       CQ.  When binding an endpoint to	a counter, the following flags may  be
       specified.

       FI_READ
	      Increments  the  specified  counter whenever an RMA read,	atomic
	      fetch, or	atomic compare operation initiated from	 the  endpoint
	      has completed successfully or in error.

       FI_RECV
	      Increments  the specified	counter	whenever a message is received
	      over the endpoint.  Received messages include  both  tagged  and
	      normal message operations.

       FI_REMOTE_READ
	      Increments  the  specified  counter whenever an RMA read,	atomic
	      fetch, or	atomic compare operation is initiated  from  a	remote
	      endpoint	that targets the given endpoint.  Use of this flag re-
	      quires that the endpoint be created using	FI_RMA_EVENT.

       FI_REMOTE_WRITE
	      Increments the specified counter whenever	an RMA write  or  base
	      atomic  operation	 is initiated from a remote endpoint that tar-
	      gets the given endpoint.	Use of this  flag  requires  that  the
	      endpoint be created using	FI_RMA_EVENT.

       FI_SEND
	      Increments  the  specified  counter  whenever a message transfer
	      initiated	over the endpoint has completed	successfully or	in er-
	      ror.  Sent messages include both tagged and normal message oper-
	      ations.

       FI_WRITE
	      Increments the specified counter whenever	an RMA write  or  base
	      atomic  operation	initiated from the endpoint has	completed suc-
	      cessfully	or in error.

       An endpoint may only be bound to	a single CQ or	counter	 for  a	 given
       type of operation.  For example,	a EP may not bind to two counters both
       using  FI_WRITE.	 Furthermore, providers	may limit CQ and counter bind-
       ings to endpoints of the	same endpoint type (DGRAM, MSG,	RDM, etc.).

   fi_scalable_ep_bind
       fi_scalable_ep_bind is used to associate	a scalable  endpoint  with  an
       address	vector.	  See  section on SCALABLE ENDPOINTS.  A scalable end-
       point has a single transport level address  and	can  support  multiple
       transmit	and receive contexts.  The transmit and	receive	contexts share
       the  transport-level  address.  Address vectors that are	bound to scal-
       able endpoints are implicitly bound to any transmit or receive contexts
       created using the scalable endpoint.

   fi_enable
       This call transitions the endpoint into an enabled state.  An  endpoint
       must  be	 enabled before	it may be used to perform data transfers.  En-
       abling an endpoint typically results in hardware	 resources  being  as-
       signed  to  it.	 Endpoints  making use of completion queues, counters,
       event queues, and/or address vectors must be bound to them before being
       enabled.

       Calling connect or accept on an endpoint	will implicitly	enable an end-
       point if	it has not already been	enabled.

       fi_enable may also be used to re-enable an endpoint that	has been  dis-
       abled  as  a  result  of	 experiencing  a critical error.  Applications
       should check the	return value from fi_enable to see if a	disabled  end-
       point has successfully be re-enabled.

   fi_cancel
       fi_cancel  attempts  to	cancel	an outstanding asynchronous operation.
       Canceling an operation causes the fabric	provider to search for the op-
       eration and, if it is still pending, complete it	as  having  been  can-
       celed.	An error queue entry will be available in the associated error
       queue with error	code FI_ECANCELED.  On the other hand, if  the	opera-
       tion completed before the call to fi_cancel, then the completion	status
       of that operation will be available in the associated completion	queue.
       No specific entry related to fi_cancel itself will be posted.

       Cancel uses the context parameter associated with an operation to iden-
       tify  the request to cancel.  Operations	posted without a valid context
       parameter - either no context parameter is  specified  or  the  context
       value  was  ignored  by the provider - cannot be	canceled.  If multiple
       outstanding operations match the	context	parameter, only	 one  will  be
       canceled.   In  this  case, the operation which is canceled is provider
       specific.  The cancel operation	is  asynchronous,  but	will  complete
       within a	bounded	period of time.

   fi_ep_alias
       This call creates an alias to the specified endpoint.  Conceptually, an
       endpoint	alias provides an alternate software path from the application
       to the underlying provider hardware.  An	alias EP differs from its par-
       ent  endpoint only by its default data transfer flags.  For example, an
       alias EP	may be configured to use a different completion	mode.  By  de-
       fault,  an alias	EP inherits the	same data transfer flags as the	parent
       endpoint.  An application can use fi_control to modify the alias	EP op-
       erational flags.

       When allocating an alias,  an  application  may	configure  either  the
       transmit	 or receive operational	flags.	This avoids needing a separate
       call to fi_control to set those flags.  The flags passed	to fi_ep_alias
       must include FI_TRANSMIT	or FI_RECV (not	both) with  other  operational
       flags  OR'ed in.	 This will override the	transmit or receive flags, re-
       spectively, for operations posted through the alias endpoint.  All  al-
       located	aliases	 must  be closed for the underlying endpoint to	be re-
       leased.

   fi_control
       The control operation is	used to	adjust the default behavior of an end-
       point.  It allows the underlying	provider to redirect function calls to
       implementations optimized to meet the desired application behavior.  As
       a result, calls to fi_ep_control	must be	serialized against  all	 other
       calls to	an endpoint.

       The  base  operation  of	 an endpoint is	selected during	creation using
       struct fi_info.	The following control commands and  arguments  may  be
       assigned	to an endpoint.

       **FI_BACKLOG - int *value**
	      This  option  only  applies to passive endpoints.	 It is used to
	      set the connection request backlog for listening endpoints.

       **FI_GETOPSFLAG - uint64_t *flags**
	      Used to retrieve the current value of flags associated with  the
	      data transfer operations initiated on the	endpoint.  The control
	      argument must include FI_TRANSMIT	or FI_RECV (not	both) flags to
	      indicate	the  type  of data transfer flags to be	returned.  See
	      below for	a list of control flags.

       FI_GETWAIT - void **
	      This command allows the user to retrieve the file	descriptor as-
	      sociated with a socket endpoint.	The fi_control	arg  parameter
	      should  be  an  address where a pointer to the returned file de-
	      scriptor will be written.	 See fi_eq.3 for addition details  us-
	      ing fi_control with FI_GETWAIT.  The file	descriptor may be used
	      for  notification	 that the endpoint is ready to send or receive
	      data.

       **FI_SETOPSFLAG - uint64_t *flags**
	      Used to change the data transfer operation flags associated with
	      an endpoint.  The	control	argument must include  FI_TRANSMIT  or
	      FI_RECV  (not  both)  to indicate	the type of data transfer that
	      the flags	should apply to, with other flags OR'ed	in.  The given
	      flags will override the previous transmit	and receive attributes
	      that were	set when the  endpoint	was  created.	Valid  control
	      flags are	defined	below.

   fi_getopt / fi_setopt
       Endpoint	 protocol  operations  may be retrieved	using fi_getopt	or set
       using fi_setopt.	 Applications specify the level	that a desired	option
       exists, identify	the option, and	provide	input/output buffers to	get or
       set  the	 option.   fi_setopt  provides	an application a way to	adjust
       low-level protocol and implementation specific details of an endpoint.

       The following option levels and option names  and  parameters  are  de-
       fined.

       FI_OPT_ENDPOINT • .RS 2

       FI_OPT_BUFFERED_LIMIT - size_t
	      Defines  the maximum size	of a buffered message that will	be re-
	      ported to	users  as  part	 of  a	receive	 completion  when  the
	      FI_BUFFERED_RECV mode is enabled on an endpoint.

       fi_getopt()  will  return  the  currently  configured threshold,	or the
       provider's default threshold if one has not be set by the  application.
       fi_setopt()  allows  an application to configure	the threshold.	If the
       provider	cannot support the  requested  threshold,  it  will  fail  the
       fi_setopt()  call  with	FI_EMSGSIZE.   Calling	fi_setopt()  with  the
       threshold set to	SIZE_MAX will set the threshold	to  the	 maximum  sup-
       ported  by  the provider.  fi_getopt() can then be used to retrieve the
       set size.

       In most cases, the sending and receiving	endpoints must	be  configured
       to use the same threshold value,	and the	threshold must be set prior to
       enabling	the endpoint.
       • .RS 2

       FI_OPT_BUFFERED_MIN - size_t
	      Defines  the minimum size	of a buffered message that will	be re-
	      ported.  Applications would set this to a	size that's big	enough
	      to decide	whether	to discard or claim a buffered receive or when
	      to claim a buffered receive on getting a buffered	 receive  com-
	      pletion.	The value is typically used by a provider when sending
	      a	 rendezvous  protocol  request	where  it  would send at least
	      FI_OPT_BUFFERED_MIN bytes	of application data along with it.   A
	      smaller  sized  rendezvous  protocol  message usually results in
	      better latency for the overall transfer of a large message.
       • .RS 2

       FI_OPT_CM_DATA_SIZE - size_t
	      Defines the size of available space in CM	messages for  user-de-
	      fined  data.  This value limits the amount of data that applica-
	      tions can	exchange between peer endpoints	using the  fi_connect,
	      fi_accept,  and  fi_reject operations.  The size returned	is de-
	      pendent upon the properties of the endpoint, except in the  case
	      of  passive  endpoints,  in  which the size reflects the maximum
	      size of the data that may	be present as part of a	connection re-
	      quest event.  This option	is read	only.
       • .RS 2

       FI_OPT_MIN_MULTI_RECV - size_t
	      Defines the minimum receive buffer space available when the  re-
	      ceive  buffer  is	 released by the provider (see FI_MULTI_RECV).
	      Modifying	this value is  only  guaranteed	 to  set  the  minimum
	      buffer  space needed on receives posted after the	value has been
	      changed.	It is recommended that applications that want to over-
	      ride the default MIN_MULTI_RECV value set	this option before en-
	      abling the corresponding endpoint.
       • .RS 2

       FI_OPT_FI_HMEM_P2P - int
	      Defines how the provider should  handle  peer  to	 peer  FI_HMEM
	      transfers	 for  this  endpoint.	By  default, the provider will
	      chose whether to use peer	to peer	support	based on the  type  of
	      transfer (FI_HMEM_P2P_ENABLED).  Valid values defined in fi_end-
	      point.h are:

	      •	FI_HMEM_P2P_ENABLED:  Peer  to peer support may	be used	by the
		provider to handle FI_HMEM transfers, and which	transfers  are
		initiated using	peer to	peer is	subject	to the provider	imple-
		mentation.

	      •	FI_HMEM_P2P_REQUIRED:  Peer  to	 peer support must be used for
		transfers, transfers that cannot be performed using  p2p  will
		be reported as failing.

	      •	FI_HMEM_P2P_PREFERRED:	Peer to	peer support should be used by
		the provider for all transfers if available, but the  provider
		may  choose  to	copy the data to initiate the transfer if peer
		to peer	support	is unavailable.

	      •	FI_HMEM_P2P_DISABLED: Peer to peer support should not be used.
       fi_setopt() will	return -FI_EOPNOTSUPP if the mode requested cannot  be
       supported  by  the provider.  The FI_HMEM_DISABLE_P2P environment vari-
       able discussed in fi_mr(3) takes	precedence over	this setopt option.
       • .RS 2

       FI_OPT_XPU_TRIGGER - struct fi_trigger_xpu *
	      This option only applies to the fi_getopt() call.	 It is used to
	      query the	maximum	number of variables required  to  support  XPU
	      triggered	operations, along with the size	of each	variable.

       The  user  provides  a  filled out struct fi_trigger_xpu	on input.  The
       iface and device	fields	should	reference  an  HMEM  domain.   If  the
       provider	 does  not support XPU triggered operations from the given de-
       vice, fi_getopt() will return -FI_EOPNOTSUPP.   On  input,  var	should
       reference an array of struct fi_trigger_var data	structures, with count
       set  to the size	of the referenced array.  If count is 0, the var field
       will be ignored,	and the	provider will return the  number  of  fi_trig-
       ger_var	structures  needed.   If  count	 is > 0, the provider will set
       count to	the needed value, and for each fi_trigger_var  available,  set
       the datatype and	count of the variable used for the trigger.

   fi_tc_dscp_set
       This  call converts a DSCP defined value	into a libfabric traffic class
       value.  It should be used when assigning	a DSCP value when setting  the
       tclass field in either domain or	endpoint attributes

   fi_tc_dscp_get
       This  call  returns the DSCP value associated with the tclass field for
       the domain or endpoint attributes.

   fi_rx_size_left (DEPRECATED)
       This function has been deprecated and will be removed in	a future  ver-
       sion of the library.  It	may not	be supported by	all providers.

       The fi_rx_size_left call	returns	a lower	bound on the number of receive
       operations that may be posted to	the given endpoint without that	opera-
       tion  returning	-FI_EAGAIN.   Depending	on the specific	details	of the
       subsequently posted receive operations (e.g., number  of	 iov  entries,
       which  receive  function	 is  called, etc.), it may be possible to post
       more receive operations than originally indicated by fi_rx_size_left.

   fi_tx_size_left (DEPRECATED)
       This function has been deprecated and will be removed in	a future  ver-
       sion of the library.  It	may not	be supported by	all providers.

       The  fi_tx_size_left call returns a lower bound on the number of	trans-
       mit operations that may be posted to the	given  endpoint	 without  that
       operation  returning  -FI_EAGAIN.  Depending on the specific details of
       the subsequently	posted transmit	operations (e.g., number  of  iov  en-
       tries,  which transmit function is called, etc.), it may	be possible to
       post  more   transmit   operations   than   originally	indicated   by
       fi_tx_size_left.

ENDPOINT ATTRIBUTES
       The  fi_ep_attr structure defines the set of attributes associated with
       an endpoint.  Endpoint attributes may  be  further  refined  using  the
       transmit	and receive context attributes as shown	below.

	      struct fi_ep_attr	{
		  enum fi_ep_type type;
		  uint32_t	  protocol;
		  uint32_t	  protocol_version;
		  size_t	  max_msg_size;
		  size_t	  msg_prefix_size;
		  size_t	  max_order_raw_size;
		  size_t	  max_order_war_size;
		  size_t	  max_order_waw_size;
		  uint64_t	  mem_tag_format;
		  size_t	  tx_ctx_cnt;
		  size_t	  rx_ctx_cnt;
		  size_t	  auth_key_size;
		  uint8_t	  *auth_key;
	      };

   type	- Endpoint Type
       If  specified, indicates	the type of fabric interface communication de-
       sired.  Supported types are:

       FI_EP_DGRAM
	      Supports a connectionless,  unreliable  datagram	communication.
	      Message  boundaries are maintained, but the maximum message size
	      may be limited to	the fabric MTU.	 Flow control is  not  guaran-
	      teed.

       FI_EP_MSG
	      Provides	a  reliable, connection-oriented data transfer service
	      with flow	control	that maintains message boundaries.

       FI_EP_RDM
	      Reliable datagram	message.  Provides a reliable,	connectionless
	      data  transfer  service with flow	control	that maintains message
	      boundaries.

       FI_EP_SOCK_DGRAM
	      A	connectionless,	unreliable datagram endpoint  with  UDP	 sock-
	      et-like semantics.  FI_EP_SOCK_DGRAM is most useful for applica-
	      tions  designed  around  using UDP sockets.  See the SOCKET END-
	      POINT section for	additional details and restrictions that apply
	      to datagram socket endpoints.

       FI_EP_SOCK_STREAM
	      Data streaming endpoint with TCP	socket-like  semantics.	  Pro-
	      vides a reliable,	connection-oriented data transfer service that
	      does not maintain	message	boundaries.  FI_EP_SOCK_STREAM is most
	      useful  for applications designed	around using TCP sockets.  See
	      the SOCKET ENDPOINT section for additional details and  restric-
	      tions that apply to stream endpoints.

       FI_EP_UNSPEC
	      The type of endpoint is not specified.  This is usually provided
	      as  input, with other attributes of the endpoint or the provider
	      selecting	the type.

   Protocol
       Specifies the low-level end to end protocol employed by	the  provider.
       A  matching  protocol must be used by communicating endpoints to	ensure
       interoperability.  The following	protocol values	are defined.  Provider
       specific	protocols are also allowed.  Provider specific protocols  will
       be indicated by having the upper	bit of the protocol value set to one.

       FI_PROTO_GNI
	      Protocol runs over Cray GNI low-level interface.

       FI_PROTO_IB_RDM
	      Reliable-datagram	 protocol  implemented	over  InfiniBand reli-
	      able-connected queue pairs.

       FI_PROTO_IB_UD
	      The protocol runs	 over  Infiniband  unreliable  datagram	 queue
	      pairs.

       FI_PROTO_IWARP
	      The  protocol  runs  over	 the  Internet wide area RDMA protocol
	      transport.

       FI_PROTO_IWARP_RDM
	      Reliable-datagram	protocol implemented over iWarp	 reliable-con-
	      nected queue pairs.

       FI_PROTO_NETWORKDIRECT
	      Protocol	runs over Microsoft NetworkDirect service provider in-
	      terface.	This adds reliable-datagram semantics  over  the  Net-
	      workDirect connection- oriented endpoint semantics.

       FI_PROTO_PSMX
	      The  protocol is based on	an Intel proprietary protocol known as
	      PSM, performance scaled messaging.  PSMX is an extended  version
	      of the PSM protocol to support the libfabric interfaces.

       FI_PROTO_PSMX2
	      The  protocol is based on	an Intel proprietary protocol known as
	      PSM2, performance	scaled messaging version 2.  PSMX2 is  an  ex-
	      tended version of	the PSM2 protocol to support the libfabric in-
	      terfaces.

       FI_PROTO_PSMX3
	      The  protocol  is	 Intel's  protocol  known as PSM3, performance
	      scaled messaging version 3.  PSMX3 is  implemented  over	RoCEv2
	      and verbs.

       FI_PROTO_RDMA_CM_IB_RC
	      The  protocol  runs  over	 Infiniband  reliable-connected	 queue
	      pairs, using the RDMA CM protocol	for connection establishment.

       FI_PROTO_RXD
	      Reliable-datagram	protocol implemented over datagram  endpoints.
	      RXD  is a	libfabric utility component that adds RDM endpoint se-
	      mantics over DGRAM endpoint semantics.

       FI_PROTO_RXM
	      Reliable-datagram	protocol implemented over  message  endpoints.
	      RXM  is a	libfabric utility component that adds RDM endpoint se-
	      mantics over MSG endpoint	semantics.

       FI_PROTO_SOCK_TCP
	      The protocol is layered over TCP packets.

       FI_PROTO_UDP
	      The protocol sends and receives UDP datagrams.  For example,  an
	      endpoint	using  FI_PROTO_UDP will be able to communicate	with a
	      remote peer that is using	Berkeley SOCK_DGRAM sockets using  IP-
	      PROTO_UDP.

       FI_PROTO_UNSPEC
	      The  protocol is not specified.  This is usually provided	as in-
	      put, with	other attributes of the	socket or the provider select-
	      ing the actual protocol.

   protocol_version - Protocol Version
       Identifies which	version	of the protocol	is employed by	the  provider.
       The  protocol  version allows providers to extend an existing protocol,
       by adding support for additional	features or functionality for example,
       in a backward compatible	manner.	 Providers that	support	different ver-
       sions of	the same protocol should inter-operate,	but  only  when	 using
       the capabilities	defined	for the	lesser version.

   max_msg_size	- Max Message Size
       Defines	the  maximum size for an application data transfer as a	single
       operation.

   msg_prefix_size - Message Prefix Size
       Specifies the size of any required message prefix buffer	 space.	  This
       field  will be 0	unless the FI_MSG_PREFIX mode is enabled.  If msg_pre-
       fix_size	is > 0 the specified value will	be a multiple of 8-bytes.

   Max RMA Ordered Size
       The maximum ordered size	specifies the delivery order of	transport data
       into target memory for RMA and atomic  operations.   Data  ordering  is
       separate,  but dependent	on message ordering (defined below).  Data or-
       dering is unspecified where message order is not	defined.

       Data ordering refers to the access of the same target memory by	subse-
       quent  operations.   When back to back RMA read or write	operations ac-
       cess the	same  registered  memory  location,  data  ordering  indicates
       whether	the  second  operation reads or	writes the target memory after
       the first operation has completed.  For example,	will an	RMA read  that
       follows	an  RMA	write read back	the data that was written?  Similarly,
       will an RMA write that follows an RMA read update the target buffer af-
       ter the read has	transferred the	original data?	Data ordering  answers
       these  questions,  even	in the presence	of errors, such	as the need to
       resend data because of lost or corrupted	network	traffic.

       RMA ordering applies between two	operations, and	not  within  a	single
       data  transfer.	 Therefore,  ordering  is defined per byte-addressable
       memory location.	 I.e.  ordering	specifies whether location  X  is  ac-
       cessed  by  the second operation	after the first	operation.  Nothing is
       implied about the completion of the first operation before  the	second
       operation  is  initiated.   For example,	if the first operation updates
       locations X and Y, but the second operation only	accesses  location  X,
       there  are  no guarantees defined relative to location Y	and the	second
       operation.

       In order	to support large data transfers	 being	broken	into  multiple
       packets and sent	using multiple paths through the fabric, data ordering
       may  be	limited	 to  transfers	of a specific size or less.  Providers
       specify when data ordering is maintained	through	the following  values.
       Note that even if data ordering is not maintained, message ordering may
       be.

       max_order_raw_size
	      Read  after write	size.  If set, an RMA or atomic	read operation
	      issued after an RMA or atomic write operation, both of which are
	      smaller than the size, will be ordered.  Where the target	memory
	      locations	overlap, the RMA or atomic read	operation will see the
	      results of the previous RMA or atomic write.

       max_order_war_size
	      Write after read size.  If set, an RMA or	atomic write operation
	      issued after an RMA or atomic read operation, both of which  are
	      smaller  than the	size, will be ordered.	The RMA	or atomic read
	      operation	will see the initial value of the target memory	 loca-
	      tion before a subsequent RMA or atomic write updates the value.

       max_order_waw_size
	      Write  after  write size.	 If set, an RMA	or atomic write	opera-
	      tion issued after	an RMA or  atomic  write  operation,  both  of
	      which  are  smaller  than	the size, will be ordered.  The	target
	      memory location will reflect the results of the  second  RMA  or
	      atomic write.

       An  order size value of 0 indicates that	ordering is not	guaranteed.  A
       value of	-1 guarantees ordering for any data size.

   mem_tag_format - Memory Tag Format
       The memory tag format is	a bit array  used  to  convey  the  number  of
       tagged  bits  supported by a provider.  Additionally, it	may be used to
       divide the bit array into separate fields.  The mem_tag_format  option-
       ally  begins  with a series of bits set to 0, to	signify	bits which are
       ignored by the provider.	 Following the initial prefix of ignored bits,
       the array will consist of alternating groups of bits set	to all 1's  or
       all 0's.	 Each group of bits corresponds	to a tagged field.  The	impli-
       cation of defining a tagged field is that when a	mask is	applied	to the
       tagged  bit  array, all bits belonging to a single field	will either be
       set to 1	or 0, collectively.

       For example, a mem_tag_format of	0x30FF indicates support for 14	tagged
       bits, separated into 3 fields.  The first field consists	of 2-bits, the
       second field 4-bits, and	the final field	8-bits.	 Valid masks for  such
       a tagged	field would be a bitwise OR'ing	of zero	or more	of the follow-
       ing  values: 0x3000, 0x0F00, and	0x00FF.	 The provider may not validate
       the mask	provided by the	application for	performance reasons.

       By identifying fields within a tag, a provider may be able to  optimize
       their  search  routines.	 An application	which requests tag fields must
       provide tag masks that either set all  mask  bits  corresponding	 to  a
       field  to  all 0	or all 1.  When	negotiating tag	fields,	an application
       can request a specific number of	fields of a given  size.   A  provider
       must  return a tag format that supports the requested number of fields,
       with each field being at	least the size requested, or fail the request.
       A provider may increase the size	of the fields.	When reporting comple-
       tions (see FI_CQ_FORMAT_TAGGED),	it is not guaranteed that the provider
       would clear out any unsupported tag bits	in the tag field of  the  com-
       pletion entry.

       It is recommended that field sizes be ordered from smallest to largest.
       A  generic,  unstructured  tag and mask can be achieved by requesting a
       bit array consisting of alternating 1's and 0's.

   tx_ctx_cnt -	Transmit Context Count
       Number of transmit contexts to associate	with  the  endpoint.   If  not
       specified (0), 1	context	will be	assigned if the	endpoint supports out-
       bound  transfers.   Transmit  contexts  are independent transmit	queues
       that may	be separately configured.  Each	transmit context may be	 bound
       to  a  separate CQ, and no ordering is defined between contexts.	 Addi-
       tionally, no synchronization is needed when accessing contexts in  par-
       allel.

       If  the	count is set to	the value FI_SHARED_CONTEXT, the endpoint will
       be configured to	use a shared transmit context,	if  supported  by  the
       provider.   Providers that do not support shared	transmit contexts will
       fail the	request.

       See the scalable	endpoint and shared contexts sections  for  additional
       details.

   rx_ctx_cnt -	Receive	Context	Count
       Number  of  receive  contexts  to  associate with the endpoint.	If not
       specified, 1 context will be assigned if	the endpoint supports  inbound
       transfers.  Receive contexts are	independent processing queues that may
       be separately configured.  Each receive context may be bound to a sepa-
       rate CQ,	and no ordering	is defined between contexts.  Additionally, no
       synchronization is needed when accessing	contexts in parallel.

       If  the	count is set to	the value FI_SHARED_CONTEXT, the endpoint will
       be configured to	use a shared receive  context,	if  supported  by  the
       provider.   Providers  that do not support shared receive contexts will
       fail the	request.

       See the scalable	endpoint and shared contexts sections  for  additional
       details.

   auth_key_size - Authorization Key Length
       The  length of the authorization	key in bytes.  This field will be 0 if
       authorization keys are not available or used.  This  field  is  ignored
       unless the fabric is opened with	API version 1.5	or greater.

   auth_key - Authorization Key
       If  supported  by the fabric, an	authorization key (a.k.a.  job key) to
       associate with the endpoint.  An	authorization key  is  used  to	 limit
       communication  between  endpoints.   Only  peer endpoints that are pro-
       grammed to use the same authorization key may communicate.   Authoriza-
       tion keys are often used	to implement job keys, to ensure that process-
       es  running  in	different jobs do not accidentally cross traffic.  The
       domain authorization key	will be	used if	auth_key_size  is  set	to  0.
       This  field is ignored unless the fabric	is opened with API version 1.5
       or greater.

TRANSMIT CONTEXT ATTRIBUTES
       Attributes specific to the transmit capabilities	 of  an	 endpoint  are
       specified using struct fi_tx_attr.

	      struct fi_tx_attr	{
		  uint64_t  caps;
		  uint64_t  mode;
		  uint64_t  op_flags;
		  uint64_t  msg_order;
		  uint64_t  comp_order;
		  size_t    inject_size;
		  size_t    size;
		  size_t    iov_limit;
		  size_t    rma_iov_limit;
		  uint32_t  tclass;
	      };

   caps	- Capabilities
       The  requested capabilities of the context.  The	capabilities must be a
       subset of those requested of the	associated endpoint.  See the CAPABIL-
       ITIES section of	fi_getinfo(3) for capability  details.	 If  the  caps
       field  is  0  on	input to fi_getinfo(3),	the applicable capability bits
       from the	fi_info	structure will be used.

       The following capabilities apply	to the	transmit  attributes:  FI_MSG,
       FI_RMA,	FI_TAGGED,  FI_ATOMIC,	FI_READ,  FI_WRITE,  FI_SEND, FI_HMEM,
       FI_TRIGGER,  FI_FENCE,  FI_MULTICAST,   FI_RMA_PMEM,   FI_NAMED_RX_CTX,
       FI_COLLECTIVE, and FI_XPU.

       Many  applications will be able to ignore this field and	rely solely on
       the fi_info::caps field.	 Use of	this field provides fine grained  con-
       trol over the transmit capabilities associated with an endpoint.	 It is
       useful  when  handling  scalable	endpoints, with	multiple transmit con-
       texts, for example, and allows configuring a specific transmit  context
       with  fewer  capabilities  than that supported by the endpoint or other
       transmit	contexts.

   mode
       The operational mode bits of the	context.  The mode bits	will be	a sub-
       set of those associated with the	endpoint.  See	the  MODE  section  of
       fi_getinfo(3)  for details.  A mode value of 0 will be ignored on input
       to fi_getinfo(3), with the mode value of	the fi_info structure used in-
       stead.  On return from fi_getinfo(3), the mode  will  be	 set  only  to
       those constraints specific to transmit operations.

   op_flags - Default transmit operation flags
       Flags  that  control  the operation of operations submitted against the
       context.	 Applicable flags are listed in	the Operation Flags section.

   msg_order - Message Ordering
       Message ordering	refers to the order in which transport	layer  headers
       (as  viewed  by the application)	are identified and processed.  Relaxed
       message order enables data transfers to be sent and received out	of or-
       der, which may improve performance by utilizing multiple	paths  through
       the  fabric from	the initiating endpoint	to a target endpoint.  Message
       order applies only between a single  source  and	 destination  endpoint
       pair.  Ordering between different target	endpoints is not defined.

       Message order is	determined using a set of ordering bits.  Each set bit
       indicates  that	ordering  is  maintained between data transfers	of the
       specified type.	Message	order is defined for [read | write | send] op-
       erations	submitted by an	application after [read	| write	| send]	opera-
       tions.

       Message ordering	only applies to	the end	to end transmission of	trans-
       port  headers.	Message	ordering is necessary, but does	not guarantee,
       the order in which message data is sent or received  by	the  transport
       layer.	Message	 ordering  requires matching ordering semantics	on the
       receiving side of a data	transfer operation in order to guarantee  that
       ordering	is met.

       FI_ORDER_ATOMIC_RAR
	      Atomic  read  after  read.   If set, atomic fetch	operations are
	      transmitted in the order	submitted  relative  to	 other	atomic
	      fetch operations.	 If not	set, atomic fetches may	be transmitted
	      out of order from	their submission.

       FI_ORDER_ATOMIC_RAW
	      Atomic  read  after  write.  If set, atomic fetch	operations are
	      transmitted in the order submitted relative to atomic update op-
	      erations.	 If not	set, atomic fetches may	be  transmitted	 ahead
	      of atomic	updates.

       FI_ORDER_ATOMIC_WAR
	      RMA  write  after	 read.	 If  set, atomic update	operations are
	      transmitted in the order submitted relative to atomic fetch  op-
	      erations.	  If  not set, atomic updates may be transmitted ahead
	      of atomic	fetches.

       FI_ORDER_ATOMIC_WAW
	      RMA write	after write.  If set,  atomic  update  operations  are
	      transmitted  in the order	submitted relative to other atomic up-
	      date operations.	If not atomic updates may be  transmitted  out
	      of order from their submission.

       FI_ORDER_NONE
	      No  ordering  is	specified.  This value may be used as input in
	      order to obtain the  default  message  order  supported  by  the
	      provider.	 FI_ORDER_NONE is an alias for the value 0.

       FI_ORDER_RAR
	      Read  after  read.   If  set, RMA	and atomic read	operations are
	      transmitted in the order submitted relative  to  other  RMA  and
	      atomic read operations.  If not set, RMA and atomic reads	may be
	      transmitted out of order from their submission.

       FI_ORDER_RAS
	      Read  after  send.   If  set, RMA	and atomic read	operations are
	      transmitted in the order submitted relative to message send  op-
	      erations,	 including  tagged  sends.  If not set,	RMA and	atomic
	      reads may	be transmitted ahead of	sends.

       FI_ORDER_RAW
	      Read after write.	 If set, RMA and atomic	 read  operations  are
	      transmitted  in  the  order submitted relative to	RMA and	atomic
	      write operations.	 If not	set,  RMA  and	atomic	reads  may  be
	      transmitted ahead	of RMA and atomic writes.

       FI_ORDER_RMA_RAR
	      RMA  read	after read.  If	set, RMA read operations are transmit-
	      ted in the order submitted relative to  other  RMA  read	opera-
	      tions.   If  not	set, RMA reads may be transmitted out of order
	      from their submission.

       FI_ORDER_RMA_RAW
	      RMA read after write.  If	set, RMA read operations are transmit-
	      ted in the order submitted relative to RMA write operations.  If
	      not set, RMA reads may be	transmitted ahead of RMA writes.

       FI_ORDER_RMA_WAR
	      RMA write	after read.  If	set, RMA write operations  are	trans-
	      mitted  in  the order submitted relative to RMA read operations.
	      If not set, RMA writes may be transmitted	ahead of RMA reads.

       FI_ORDER_RMA_WAW
	      RMA write	after write.  If set, RMA write	operations are	trans-
	      mitted in	the order submitted relative to	other RMA write	opera-
	      tions.   If  not set, RMA	writes may be transmitted out of order
	      from their submission.

       FI_ORDER_SAR
	      Send after read.	If set,	 message  send	operations,  including
	      tagged sends, are	transmitted in order submitted relative	to RMA
	      and  atomic  read	 operations.  If not set, message sends	may be
	      transmitted ahead	of RMA and atomic reads.

       FI_ORDER_SAS
	      Send after send.	If set,	 message  send	operations,  including
	      tagged sends, are	transmitted in the order submitted relative to
	      other  message send.  If not set,	message	sends may be transmit-
	      ted out of order from their submission.

       FI_ORDER_SAW
	      Send after write.	 If set, message  send	operations,  including
	      tagged sends, are	transmitted in order submitted relative	to RMA
	      and  atomic  write operations.  If not set, message sends	may be
	      transmitted ahead	of RMA and atomic writes.

       FI_ORDER_WAR
	      Write after read.	 If set, RMA and atomic	write  operations  are
	      transmitted  in  the  order submitted relative to	RMA and	atomic
	      read operations.	If not set,  RMA  and  atomic  writes  may  be
	      transmitted ahead	of RMA and atomic reads.

       FI_ORDER_WAS
	      Write  after  send.  If set, RMA and atomic write	operations are
	      transmitted in the order submitted relative to message send  op-
	      erations,	 including  tagged  sends.  If not set,	RMA and	atomic
	      writes may be transmitted	ahead of sends.

       FI_ORDER_WAW
	      Write after write.  If set, RMA and atomic write operations  are
	      transmitted  in  the  order  submitted relative to other RMA and
	      atomic write operations.	If not set, RMA	and atomic writes  may
	      be transmitted out of order from their submission.

   comp_order -	Completion Ordering
       Completion ordering refers to the order in which	completed requests are
       written	into  the completion queue.  Completion	ordering is similar to
       message order.  Relaxed completion order	may enable faster reporting of
       completed transfers, allow acknowledgments to be	 sent  over  different
       fabric  paths,  and  support more sophisticated retry mechanisms.  This
       can result in lower-latency completions,	particularly when  using  con-
       nectionless  endpoints.	 Strict	 completion  ordering may require that
       providers queue completed operations or limit available optimizations.

       For transmit requests, completion ordering depends on the endpoint com-
       munication type.	 For unreliable	communication, completion ordering ap-
       plies to	all data transfer requests submitted to	an endpoint.  For  re-
       liable communication, completion	ordering only applies to requests that
       target  a single	destination endpoint.  Completion ordering of requests
       that target different endpoints over a reliable transport  is  not  de-
       fined.

       Applications  should  specify the completion ordering that they support
       or require.  Providers should return the	completion order that they ac-
       tually provide, with the	 constraint  that  the	returned  ordering  is
       stricter	 than that specified by	the application.  Supported completion
       order values are:

       FI_ORDER_NONE
	      No ordering is defined for completed operations.	Requests  sub-
	      mitted to	the transmit context may complete in any order.

       FI_ORDER_STRICT
	      Requests	complete  in  the order	in which they are submitted to
	      the transmit context.

   inject_size
       The requested inject operation size (see	the FI_INJECT flag)  that  the
       context	will support.  This is the maximum size	data transfer that can
       be associated with an inject operation (such as fi_inject)  or  may  be
       used with the FI_INJECT data transfer flag.

   size
       The size	of the transmit	context.  The mapping of the size value	to re-
       sources	is provider specific, but it is	directly related to the	number
       of command entries allocated for	the endpoint.  A  smaller  size	 value
       consumes	fewer hardware and software resources, while a larger size al-
       lows queuing more transmit requests.

       While  the size attribute guides	the size of underlying endpoint	trans-
       mit queue, there	is not necessarily  a  one-to-one  mapping  between  a
       transmit	 operation and a queue entry.  A single	transmit operation may
       consume multiple	queue entries; for example, one	per scatter-gather en-
       try.  Additionally, the size field is intended to guide the  allocation
       of  the	endpoint's transmit context.  Specifically, for	connectionless
       endpoints, there	may be lower-level queues use to  track	 communication
       on  a  per peer basis.  The sizes of any	lower-level queues may only be
       significantly smaller than the endpoint's transmit size,	 in  order  to
       reduce resource utilization.

   iov_limit
       This is the maximum number of IO	vectors	(scatter-gather	elements) that
       a single	posted operation may reference.

   rma_iov_limit
       This  is	the maximum number of RMA IO vectors (scatter-gather elements)
       that an RMA or atomic operation may reference.  The rma_iov_limit  cor-
       responds	to the rma_iov_count values in RMA and atomic operations.  See
       struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
       for  additional	details.  This limit applies to	both the number	of RMA
       IO vectors that may be specified	when initiating	an operation from  the
       local endpoint, as well as the maximum number of	IO vectors that	may be
       carried in a single request from	a remote endpoint.

   Traffic Class (tclass)
       Traffic classes can be a	differentiated services	code point (DSCP) val-
       ue, one of the following	defined	labels,	or a provider-specific defini-
       tion.  If tclass	is unset or set	to FI_TC_UNSPEC, the endpoint will use
       the default traffic class associated with the domain.

       FI_TC_BEST_EFFORT
	      This  is the default in the absence of any other local or	fabric
	      configuration.  This class carries the traffic for a  number  of
	      applications executing concurrently over the same	network	infra-
	      structure.   Even	 though	it is shared, network capacity and re-
	      source allocation	are distributed	 fairly	 across	 the  applica-
	      tions.

       FI_TC_BULK_DATA
	      This  class is intended for large	data transfers associated with
	      I/O and is present to separate sustained I/O transfers from oth-
	      er application inter-process communications.

       FI_TC_DEDICATED_ACCESS
	      This class operates at the highest priority, except the  manage-
	      ment class.  It carries a	high bandwidth allocation, minimum la-
	      tency targets, and the highest scheduling	and arbitration	prior-
	      ity.

       FI_TC_LOW_LATENCY
	      This  class supports low latency,	low jitter data	patterns typi-
	      cally caused by transactional data exchanges,  barrier  synchro-
	      nizations, and collective	operations that	are typical of HPC ap-
	      plications.   This class often requires maximum tolerable	laten-
	      cies that	data transfers must achieve for	correct	or performance
	      operations.  Fulfillment of such requests	 in  this  class  will
	      typically	 require accompanying bandwidth	and message size limi-
	      tations so as not	to consume excessive bandwidth at high priori-
	      ty.

       FI_TC_NETWORK_CTRL
	      This class is intended for traffic directly  related  to	fabric
	      (network)	management, which is critical to the correct operation
	      of  the  network.	 Its use is typically restricted to privileged
	      network management applications.

       FI_TC_SCAVENGER
	      This class is used for data that is desired but  does  not  have
	      strict  delivery requirements, such as in-band network or	appli-
	      cation level monitoring data.  Use of this class indicates  that
	      the  traffic  is considered lower	priority and should not	inter-
	      fere with	higher priority	workflows.

       fi_tc_dscp_set /	fi_tc_dscp_get
	      DSCP values are supported	via the	DSCP get  and  set  functions.
	      The definitions for DSCP values are outside the scope of libfab-
	      ric.  See	the fi_tc_dscp_set and fi_tc_dscp_get function defini-
	      tions for	details	on their use.

RECEIVE	CONTEXT	ATTRIBUTES
       Attributes  specific  to	 the  receive  capabilities of an endpoint are
       specified using struct fi_rx_attr.

	      struct fi_rx_attr	{
		  uint64_t  caps;
		  uint64_t  mode;
		  uint64_t  op_flags;
		  uint64_t  msg_order;
		  uint64_t  comp_order;
		  size_t    total_buffered_recv;
		  size_t    size;
		  size_t    iov_limit;
	      };

   caps	- Capabilities
       The requested capabilities of the context.  The capabilities must be  a
       subset of those requested of the	associated endpoint.  See the CAPABIL-
       ITIES  section  if  fi_getinfo(3)  for capability details.  If the caps
       field is	0 on input to fi_getinfo(3), the  applicable  capability  bits
       from the	fi_info	structure will be used.

       The  following  capabilities  apply  to the receive attributes: FI_MSG,
       FI_RMA, FI_TAGGED, FI_ATOMIC, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_RECV,
       FI_HMEM,	FI_TRIGGER,  FI_RMA_PMEM,  FI_DIRECTED_RECV,  FI_VARIABLE_MSG,
       FI_MULTI_RECV,  FI_SOURCE,  FI_RMA_EVENT, FI_SOURCE_ERR,	FI_COLLECTIVE,
       and FI_XPU.

       Many applications will be able to ignore	this field and rely solely  on
       the  fi_info::caps field.  Use of this field provides fine grained con-
       trol over the receive capabilities associated with an endpoint.	It  is
       useful  when  handling  scalable	 endpoints, with multiple receive con-
       texts, for example, and allows configuring a specific  receive  context
       with  fewer  capabilities  than that supported by the endpoint or other
       receive contexts.

   mode
       The operational mode bits of the	context.  The mode bits	will be	a sub-
       set of those associated with the	endpoint.  See	the  MODE  section  of
       fi_getinfo(3)  for details.  A mode value of 0 will be ignored on input
       to fi_getinfo(3), with the mode value of	the fi_info structure used in-
       stead.  On return from fi_getinfo(3), the mode  will  be	 set  only  to
       those constraints specific to receive operations.

   op_flags - Default receive operation	flags
       Flags  that  control  the operation of operations submitted against the
       context.	 Applicable flags are listed in	the Operation Flags section.

   msg_order - Message Ordering
       For a description of message ordering, see the msg_order	field  in  the
       Transmit	 Context  Attribute section.  Receive context message ordering
       defines the order in  which  received  transport	 message  headers  are
       processed when received by an endpoint.	When ordering is set, it indi-
       cates that message headers will be processed in order, based on how the
       transmit	 side has identified the messages.  Typically, this means that
       messages	will be	handled	in order based on  a  message  level  sequence
       number.

       The  following  ordering	 flags,	as defined for transmit	ordering, also
       apply to	the processing of received operations:	FI_ORDER_NONE,	FI_OR-
       DER_RAR,	FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW,	FI_OR-
       DER_WAS,	 FI_ORDER_SAR,	FI_ORDER_SAW,  FI_ORDER_SAS, FI_ORDER_RMA_RAR,
       FI_ORDER_RMA_RAW,  FI_ORDER_RMA_WAR,  FI_ORDER_RMA_WAW,	FI_ORDER_ATOM-
       IC_RAR,	FI_ORDER_ATOMIC_RAW,  FI_ORDER_ATOMIC_WAR,  and	FI_ORDER_ATOM-
       IC_WAW.

   comp_order -	Completion Ordering
       For a description of completion ordering, see the comp_order  field  in
       the Transmit Context Attribute section.

       FI_ORDER_DATA
	      When  set, this bit indicates that received data is written into
	      memory in	order.	Data ordering applies to  memory  accessed  as
	      part of a	single operation and between operations	if message or-
	      dering is	guaranteed.

       FI_ORDER_NONE
	      No ordering is defined for completed operations.	Receive	opera-
	      tions  may complete in any order,	regardless of their submission
	      order.

       FI_ORDER_STRICT
	      Receive operations complete in  the  order  in  which  they  are
	      processed	 by  the  receive  context,  based on the receive side
	      msg_order	attribute.

   total_buffered_recv
       This field is supported for backwards compatibility purposes.  It is  a
       hint to the provider of the total available space that may be needed to
       buffer  messages	 that  are received for	which there is no matching re-
       ceive operation.	 The provider may adjust or ignore  this  value.   The
       allocation  of  internal	 network  buffering  among received message is
       provider	specific.  For instance, a provider may	limit the size of mes-
       sages which can be buffered or the amount of buffering allocated	 to  a
       single message.

       If  receive  side buffering is disabled (total_buffered_recv = 0) and a
       message is received by an endpoint, then	the behavior is	 dependent  on
       whether	resource management has	been enabled (FI_RM_ENABLED has	be set
       or not).	 See the Resource Management section of	fi_domain.3  for  fur-
       ther  clarification.   It  is  recommended that applications enable re-
       source management if they  anticipate  receiving	 unexpected  messages,
       rather than modifying this value.

   size
       The  size of the	receive	context.  The mapping of the size value	to re-
       sources is provider specific, but it is directly	related	to the	number
       of  command  entries  allocated for the endpoint.  A smaller size value
       consumes	fewer hardware and software resources, while a larger size al-
       lows queuing more transmit requests.

       While the size attribute	guides the size	of underlying endpoint receive
       queue, there is not necessarily a one-to-one mapping between a  receive
       operation  and  a  queue	entry.	A single receive operation may consume
       multiple	queue entries; for example, one	per scatter-gather entry.  Ad-
       ditionally, the size field is intended to guide the allocation  of  the
       endpoint's  receive  context.   Specifically,  for  connectionless end-
       points, there may be lower-level	queues use to track communication on a
       per peer	basis.	The sizes of any lower-level queues may	only  be  sig-
       nificantly smaller than the endpoint's receive size, in order to	reduce
       resource	utilization.

   iov_limit
       This is the maximum number of IO	vectors	(scatter-gather	elements) that
       a single	posted operating may reference.

SCALABLE ENDPOINTS
       A  scalable  endpoint  is a communication portal	that supports multiple
       transmit	and receive contexts.  Scalable	endpoints are loosely  modeled
       after  the  networking  concept	of transmit/receive side scaling, also
       known as	multi-queue.  Support for scalable endpoints is	domain specif-
       ic.  Scalable endpoints may improve the performance  of	multi-threaded
       and  parallel  applications,  by	allowing threads to access independent
       transmit	and receive queues.  A scalable	endpoint has a	single	trans-
       port  level address, which can reduce the memory	requirements needed to
       store remote addressing data, versus using standard  endpoints.	 Scal-
       able  endpoints	cannot	be used	directly for communication operations,
       and require the application to explicitly create	transmit  and  receive
       contexts	as described below.

   fi_tx_context
       Transmit	 contexts  are independent transmit queues.  Ordering and syn-
       chronization between contexts are not defined.  Conceptually a transmit
       context behaves similar to a send-only endpoint.	  A  transmit  context
       may  be	configured  with fewer capabilities than the base endpoint and
       with different attributes (such as  ordering  requirements  and	inject
       size)  than  other contexts associated with the same scalable endpoint.
       Each transmit context has its own  completion  queue.   The  number  of
       transmit	 contexts associated with an endpoint is specified during end-
       point creation.

       The fi_tx_context call is used to retrieve a specific context,  identi-
       fied  by	 an index (see above for details on transmit context attribut-
       es).  Providers may dynamically allocate	contexts when fi_tx_context is
       called, or may statically create	all contexts when fi_endpoint  is  in-
       voked.	By  default, a transmit	context	inherits the properties	of its
       associated endpoint.  However, applications may request context specif-
       ic attributes through the attr parameter.   Support  for	 per  transmit
       context	attributes is provider specific	and not	guaranteed.  Providers
       will return the actual attributes assigned to the context  through  the
       attr parameter, if provided.

   fi_rx_context
       Receive	contexts are independent receive queues	for receiving incoming
       data.  Ordering and synchronization between contexts  are  not  guaran-
       teed.  Conceptually a receive context behaves similar to	a receive-only
       endpoint.   A receive context may be configured with fewer capabilities
       than the	base endpoint and with different attributes (such as  ordering
       requirements  and  inject size) than other contexts associated with the
       same scalable endpoint.	Each receive context has  its  own  completion
       queue.	The  number of receive contexts	associated with	an endpoint is
       specified during	endpoint creation.

       Receive contexts	are often associated with steering flows, that specify
       which incoming packets targeting	a scalable endpoint to process.	  How-
       ever,  receive  contexts	 may be	targeted directly by the initiator, if
       supported by the	underlying protocol.  Such contexts are	referred to as
       `named'.	 Support for named contexts must be indicated by  setting  the
       caps FI_NAMED_RX_CTX capability when the	corresponding endpoint is cre-
       ated.   Support	for named receive contexts is coordinated with address
       vectors.	 See fi_av(3) and fi_rx_addr(3).

       The fi_rx_context call is used to retrieve a specific context,  identi-
       fied by an index	(see above for details on receive context attributes).
       Providers  may  dynamically  allocate  contexts	when  fi_rx_context is
       called, or may statically create	all contexts when fi_endpoint  is  in-
       voked.	By  default,  a	receive	context	inherits the properties	of its
       associated endpoint.  However, applications may request context specif-
       ic attributes through the attr parameter.  Support for per receive con-
       text attributes is provider specific  and  not  guaranteed.   Providers
       will  return  the actual	attributes assigned to the context through the
       attr parameter, if provided.

SHARED CONTEXTS
       Shared contexts are transmit and	 receive  contexts  explicitly	shared
       among one or more endpoints.  A shareable context allows	an application
       to  use	a  single dedicated provider resource among multiple transport
       addressable endpoints.  This can	greatly	reduce the resources needed to
       manage communication over multiple endpoints by	multiplexing  transmit
       and/or  receive	processing, with the potential cost of serializing ac-
       cess across multiple endpoints.	Support	for shareable contexts is  do-
       main specific.

       Conceptually,  shareable	transmit contexts are transmit queues that may
       be accessed by many endpoints.  The use of a shared transmit context is
       mostly opaque to	an application.	 Applications must allocate  and  bind
       shared  transmit	 contexts  to endpoints, but operations	are posted di-
       rectly to the endpoint.	Shared transmit	contexts  are  not  associated
       with completion queues or counters.  Completed operations are posted to
       the CQs bound to	the endpoint.  An endpoint may only be associated with
       a single	shared transmit	context.

       Unlike  shared  transmit	 contexts, applications	interact directly with
       shared receive contexts.	 Users post  receive  buffers  directly	 to  a
       shared  receive	context, with the buffers usable by any	endpoint bound
       to the shared receive context.  Shared receive contexts are not associ-
       ated with completion queues or counters.	 Completed receive  operations
       are  posted  to the CQs bound to	the endpoint.  An endpoint may only be
       associated with a single	receive	context, and all  connectionless  end-
       points  associated  with	 a  shared receive context must	also share the
       same address vector.

       Endpoints associated with a shared transmit context may	use  dedicated
       receive contexts, and vice-versa.  Or an	endpoint may use shared	trans-
       mit  and	 receive  contexts.  And there is no requirement that the same
       group of	endpoints sharing a context of one type	also share the context
       of an alternate type.  Furthermore, an endpoint may use a  shared  con-
       text of one type, but a scalable	set of contexts	of the alternate type.

   fi_stx_context
       This  call  is used to open a shareable transmit	context	(see above for
       details on the transmit context attributes).  Endpoints associated with
       a shared	transmit context must use a subset of the  transmit  context's
       attributes.   Note  that	 this  is  the	reverse	of the requirement for
       transmit	contexts for scalable endpoints.

   fi_srx_context
       This allocates a	shareable receive context (see above  for  details  on
       the  receive  context  attributes).  Endpoints associated with a	shared
       receive context must use	a subset of the	receive	context's  attributes.
       Note  that  this	is the reverse of the requirement for receive contexts
       for scalable endpoints.

SOCKET ENDPOINTS
       The following feature and description should be	considered  experimen-
       tal.  Until the experimental tag	is removed, the	interfaces, semantics,
       and data	structures associated with socket endpoints may	change between
       library versions.

       This  section  applies  to  endpoints  of  type	FI_EP_SOCK_STREAM  and
       FI_EP_SOCK_DGRAM, commonly referred to as socket	endpoints.

       Socket endpoints	are defined with semantics that	 allow	them  to  more
       easily  be  adopted by developers familiar with the UNIX	socket API, or
       by middleware that exposes the socket API, while	still taking advantage
       of high-performance hardware features.

       The key difference between socket endpoints and other active  endpoints
       are  socket  endpoints  use synchronous data transfers.	Buffers	passed
       into send and receive operations	revert to the control of the  applica-
       tion  upon  returning  from  the	 function  call.  As a result, no data
       transfer	completions are	reported to the	application, and  socket  end-
       points are not associated with completion queues	or counters.

       Socket  endpoints  support  a  subset  of  message operations: fi_send,
       fi_sendv, fi_sendmsg, fi_recv,  fi_recvv,  fi_recvmsg,  and  fi_inject.
       Because	data transfers are synchronous,	the return value from send and
       receive operations indicate the number of bytes transferred on success,
       or a negative value on error, including -FI_EAGAIN if the endpoint can-
       not send	or receive any data because of full or empty  queues,  respec-
       tively.

       Socket  endpoints are associated	with event queues and address vectors,
       and process connection management  events  asynchronously,  similar  to
       other  endpoints.   Unlike  UNIX	sockets, socket	endpoint must still be
       declared	as either active or passive.

       Socket endpoints	behave like non-blocking sockets.  In order to support
       select and poll semantics, active socket	endpoints are associated  with
       a  file	descriptor  that is signaled whenever the endpoint is ready to
       send and/or receive data.  The file descriptor may be  retrieved	 using
       fi_control.

OPERATION FLAGS
       Operation  flags	 are  obtained by OR-ing the following flags together.
       Operation flags define the default flags	applied	to an endpoint's  data
       transfer	 operations,  where  a flags parameter is not available.  Data
       transfer	operations that	take flags as input override the op_flags val-
       ue of transmit or receive context attributes of an endpoint.

       FI_COMMIT_COMPLETE
	      Indicates	that a completion should not be	generated (locally  or
	      at  the  peer)  until  the result	of an operation	have been made
	      persistent.  See fi_cq(3)	for additional details	on  completion
	      semantics.

       FI_COMPLETION
	      Indicates	 that  a  completion queue entry should	be written for
	      data transfer operations.	 This flag only	applies	to  operations
	      issued  on an endpoint that was bound to a completion queue with
	      the FI_SELECTIVE_COMPLETION flag set, otherwise, it is  ignored.
	      See the fi_ep_bind section above for more	detail.

       FI_DELIVERY_COMPLETE
	      Indicates	 that a	completion should be generated when the	opera-
	      tion has been processed by  the  destination  endpoint(s).   See
	      fi_cq(3) for additional details on completion semantics.

       FI_INJECT
	      Indicates	 that  all outbound data buffers should	be returned to
	      the user's control immediately after a data  transfer  call  re-
	      turns,  even  if	the operation is handled asynchronously.  This
	      may require that the provider copy the data into a local	buffer
	      and transfer out of that buffer.	A provider can limit the total
	      amount  of  send	data that may be buffered and/or the size of a
	      single send that can use this flag.  This	limit is indicated us-
	      ing inject_size (see inject_size above).

       FI_INJECT_COMPLETE
	      Indicates	that a completion should be generated when the	source
	      buffer(s)	may be reused.	See fi_cq(3) for additional details on
	      completion semantics.

       FI_MULTICAST
	      Indicates	that data transfers will target	multicast addresses by
	      default.	 Any  fi_addr_t	 passed	into a data transfer operation
	      will be treated as a multicast address.

       FI_MULTI_RECV
	      Applies to posted	receive	operations.  This flag allows the user
	      to post a	single buffer that will	receive	multiple incoming mes-
	      sages.  Received messages	will be	packed into the	receive	buffer
	      until the	buffer has been	consumed.  Use of this flag may	 cause
	      a	 single	 posted	receive	operation to generate multiple comple-
	      tions as messages	are placed into	the buffer.  The placement  of
	      received	data into the buffer may be subjected to provider spe-
	      cific alignment restrictions.  The buffer	will  be  released  by
	      the  provider  when  the	available buffer space falls below the
	      specified	minimum	(see FI_OPT_MIN_MULTI_RECV).

       FI_TRANSMIT_COMPLETE
	      Indicates	that a completion should be generated when the	trans-
	      mit operation has	completed relative to the local	provider.  See
	      fi_cq(3) for additional details on completion semantics.

NOTES
       Users  should  call  fi_close to	release	all resources allocated	to the
       fabric endpoint.

       Endpoints allocated with	the FI_CONTEXT or FI_CONTEXT2  mode  bits  set
       must typically provide struct fi_context(2) as their per	operation con-
       text  parameter.	  (See fi_getinfo.3 for	details.) However, when	FI_SE-
       LECTIVE_COMPLETION is enabled to	suppress CQ completion entries,	and an
       operation is initiated without the FI_COMPLETION	 flag  set,  then  the
       context	parameter is ignored.  An application does not need to pass in
       a valid struct fi_context(2) into such data transfers.

       Operations that complete	in error that are not  associated  with	 valid
       operational  context will use the endpoint context in any error report-
       ing structures.

       Although	applications typically associate individual  completions  with
       either  completion  queues  or counters,	an endpoint can	be attached to
       both a counter and completion queue.  When combined with	 using	selec-
       tive  completions,  this	allows an application to use counters to track
       successful completions, with a CQ used to  report  errors.   Operations
       that  complete with an error increment the error	counter	and generate a
       CQ completion event.

       As mentioned in fi_getinfo(3), the ep_attr structure  can  be  used  to
       query  providers	 that support various endpoint attributes.  fi_getinfo
       can return provider info	structures that	can support the	minimal	set of
       requirements (such that the application maintains correctness).	Howev-
       er, it can also return provider info structures that exceed application
       requirements.   As  an  example,	 consider  an  application  requesting
       msg_order  as  FI_ORDER_NONE.  The resulting output from	fi_getinfo may
       have all	the ordering bits set.	The application	can reset the ordering
       bits it does not	require	before creating	the endpoint.  The provider is
       free to implement a stricter ordering than is required by the  applica-
       tion.

RETURN VALUES
       Returns 0 on success.  On error,	a negative value corresponding to fab-
       ric  errno  is  returned.  For fi_cancel, a return value	of 0 indicates
       that the	cancel request was submitted for processing.

       Fabric errno values are defined in rdma/fi_errno.h.

ERRORS
       -FI_EDOMAIN
	      A	resource domain	was not	bound to the endpoint  or  an  attempt
	      was made to bind multiple	domains.

       -FI_ENOCQ
	      The endpoint has not been	configured with	necessary event	queue.

       -FI_EOPBADSTATE
	      The endpoint's state does	not permit the requested operation.

SEE ALSO
       fi_getinfo(3),	 fi_domain(3),	 fi_cq(3)   fi_msg(3),	 fi_tagged(3),
       fi_rma(3)

AUTHORS
       OpenFabrics.

Libfabric Programmer's Manual	  2021-11-20			fi_endpoint(3)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=fi_scalable_ep_bind&sektion=3&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages