FreeBSD Manual Pages

home | help
LLVMOPENMP(1)			  LLVM/OpenMP			 LLVMOPENMP(1)

NAME
       llvmopenmp - LLVM/OpenMP

       NOTE:
	  This document	is a work in progress and most of the expected content
	  is  not  yet available. While	you can	expect changes,	we always wel-
	  come	feedback  and  additions.  Please   contact,   e.g.,   through
	  openmp-dev@lists.llvm.org.

       OpenMP impacts various parts of the LLVM	project, from the frontends (-
       Clang and Flang), through middle-end optimizations, up to the multitude
       of available OpenMP runtimes.

       A high-level overview of	OpenMP in LLVM can be found here.

OPENMP IN LLVM --- DESIGN OVERVIEW
   Resources
       • OpenMP	  Booth	  @   SC19:   "OpenMP  clang  and  flang  Development"
	 https://youtu.be/6yOa-hRi63M

   LLVM/OpenMP Runtimes
       There are four distinct types of	LLVM/OpenMP runtimes

   LLVM/OpenMP Host Runtime (libomp)
       An early	(2015) design document for the LLVM/OpenMP host	runtime,  aka.
       libomp.so, is available as a pdf.

   LLVM/OpenMP Target Host Runtime (libomptarget)
   Environment Variables
       libomptarget  uses  environment variables to control different features
       of the library at runtime. This allows the user to obtain  useful  run-
       time  information as well as enable or disable certain features.	A full
       list of supported environment variables is defined below.

	  • LIBOMPTARGET_DEBUG=<Num>

	  • LIBOMPTARGET_PROFILE=<Filename>

	  • LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=<Num>

	  • LIBOMPTARGET_INFO=<Num>

	  • LIBOMPTARGET_HEAP_SIZE=<Num>

	  • LIBOMPTARGET_STACK_SIZE=<Num>

   LIBOMPTARGET_DEBUG
       LIBOMPTARGET_DEBUG controls whether or not debugging  information  will
       be  displayed. This feature is only availible if	libomptarget was built
       with -DOMPTARGET_DEBUG. The debugging output provided is	 intended  for
       use  by libomptarget developers.	More user-friendly output is presented
       when using LIBOMPTARGET_INFO.

   LIBOMPTARGET_PROFILE
       LIBOMPTARGET_PROFILE allows libomptarget	to generate time profile  out-
       put  similar to Clang's -ftime-trace option. This generates a JSON file
       based on	Chrome Tracing that can	be viewed with chrome://tracing	or the
       Speedscope App. Building	this feature depends on	the LLVM  Support  Li-
       brary  for  time	trace output. Using this library is enabled by default
       when building using the CMake option OPENMP_ENABLE_LIBOMPTARGET_PROFIL-
       ING. The	output will be saved to	the filename specified by the environ-
       ment variable. For multi-threaded applications, profiling in libomp  is
       also needed. Setting the	CMake option OPENMP_ENABLE_LIBOMP_PROFILING=ON
       to  enable  the feature.	Note that this will turn libomp	into a C++ li-
       brary.

   LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
       LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD sets the threshold	size for which
       the libomptarget	memory manager will handle the allocation. Any alloca-
       tions larger than this threshold	will not use the memory	manager	and be
       freed after the device kernel exits. The	 default  threshold  value  is
       8KB.  If	 LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD	is set to 0 the	memory
       manager will be completely disabled.

   LIBOMPTARGET_INFO
       LIBOMPTARGET_INFO allows	the user to request different types of runtime
       information from	libomptarget. LIBOMPTARGET_INFO	uses a 32-bit field to
       enable or disable different types of information. This includes	infor-
       mation  about  data-mappings and	kernel execution. It is	recommended to
       build your application with debugging information  enabled,  this  will
       enable filenames	and variable declarations in the information messages.
       OpenMP  Debugging information is	enabled	at any level of	debugging so a
       full debug runtime is not required.  For	minimal	debugging  information
       compile	with -gline-tables-only, or compile with -g for	full debug in-
       formation. A full list of flags supported by LIBOMPTARGET_INFO is given
       below.

	  • Print all data arguments upon entering an  OpenMP  device  kernel:
	    0x01

	  • Indicate  when  a mapped address already exists in the device map-
	    ping table:	0x02

	  • Dump the contents of the device pointer map	at kernel exit:	0x04

	  • Indicate when an entry is changed in  the  device  mapping	table:
	    0x08

	  • Print OpenMP kernel	information from device	plugins: 0x10

	  • Indicate when data is copied to and	from the device: 0x20

       Any  combination	 of these flags	can be used by setting the appropriate
       bits. For example, to enable printing all data active in	an OpenMP tar-
       get region along	with CUDA information, run the following bash command.

	  $ env	LIBOMPTARGET_INFO=$((1 << 0x1 |	1 << 0x10)) ./your-application

       Or, to enable every flag	run with every bit set.

	  $ env	LIBOMPTARGET_INFO=-1 ./your-application

       For example, given a small application implementing the ZAXPY BLAS rou-
       tine, Libomptarget can provide useful information about	data  mappings
       and thread usages.

	  #include <complex>

	  using	complex	= std::complex<double>;

	  void zaxpy(complex *X, complex *Y, complex D,	std::size_t N) {
	  #pragma omp target teams distribute parallel for
	    for	(std::size_t i = 0; i <	N; ++i)
	      Y[i] = D * X[i] +	Y[i];
	  }

	  int main() {
	    const std::size_t N	= 1024;
	    complex X[N], Y[N],	D;
	  #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])
	    zaxpy(X, Y,	D, N);
	  }

       Compiling this code targeting nvptx64 with all information enabled will
       provide the following output from the runtime library.

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only zaxpy.cpp -o zaxpy
	  $ env	LIBOMPTARGET_INFO=-1 ./zaxpy

	  Info:	Entering OpenMP	data region at zaxpy.cpp:14:1 with 2 arguments:
	  Info:	to(X[0:N])[16384]
	  Info:	tofrom(Y[0:N])[16384]
	  Info:	Creating new map entry with HstPtrBegin=0x00007fff0d259a40,
		TgtPtrBegin=0x00007fdba5800000,	Size=16384, RefCount=1,	Name=X[0:N]
	  Info:	Copying	data from host to device, HstPtr=0x00007fff0d259a40,
		TgtPtr=0x00007fdba5800000, Size=16384, Name=X[0:N]
	  Info:	Creating new map entry with HstPtrBegin=0x00007fff0d255a40,
		TgtPtrBegin=0x00007fdba5804000,	Size=16384, RefCount=1,	Name=Y[0:N]
	  Info:	Copying	data from host to device, HstPtr=0x00007fff0d255a40,
		TgtPtr=0x00007fdba5804000, Size=16384, Name=Y[0:N]
	  Info:	OpenMP Host-Device pointer mappings after block	at zaxpy.cpp:14:1:
	  Info:	Host Ptr	   Target Ptr	      Size (B) RefCount	Declaration
	  Info:	0x00007fff0d255a40 0x00007fdba5804000 16384    1	Y[0:N] at zaxpy.cpp:13:17
	  Info:	0x00007fff0d259a40 0x00007fdba5800000 16384    1	X[0:N] at zaxpy.cpp:13:11
	  Info:	Entering OpenMP	kernel at zaxpy.cpp:6:1	with 4 arguments:
	  Info:	firstprivate(N)[8] (implicit)
	  Info:	use_address(Y)[0] (implicit)
	  Info:	tofrom(D)[16] (implicit)
	  Info:	use_address(X)[0] (implicit)
	  Info:	Mapping	exists (implicit) with HstPtrBegin=0x00007fff0d255a40,
		TgtPtrBegin=0x00007fdba5804000,	Size=0,	RefCount=2 (incremented), Name=Y
	  Info:	Creating new map entry with HstPtrBegin=0x00007fff0d2559f0,
		TgtPtrBegin=0x00007fdba5808000,	Size=16, RefCount=1, Name=D
	  Info:	Copying	data from host to device, HstPtr=0x00007fff0d2559f0,
		TgtPtr=0x00007fdba5808000, Size=16, Name=D
	  Info:	Mapping	exists (implicit) with HstPtrBegin=0x00007fff0d259a40,
		TgtPtrBegin=0x00007fdba5800000,	Size=0,	RefCount=2 (incremented), Name=X
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d255a40,
		TgtPtrBegin=0x00007fdba5804000,	Size=0,	RefCount=2 (update suppressed)
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d2559f0,
		TgtPtrBegin=0x00007fdba5808000,	Size=16, RefCount=1 (update suppressed)
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d259a40,
		TgtPtrBegin=0x00007fdba5800000,	Size=0,	RefCount=2 (update suppressed)
	  Info:	Launching kernel __omp_offloading_10305_c08c86__Z5zaxpyPSt7complexIdES1_S0_m_l6
		with 8 blocks and 128 threads in SPMD mode
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d259a40,
		TgtPtrBegin=0x00007fdba5800000,	Size=0,	RefCount=1 (decremented)
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d2559f0,
		TgtPtrBegin=0x00007fdba5808000,	Size=16, RefCount=1 (deferred final decrement)
	  Info:	Copying	data from device to host, TgtPtr=0x00007fdba5808000,
		HstPtr=0x00007fff0d2559f0, Size=16, Name=D
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d255a40,
		TgtPtrBegin=0x00007fdba5804000,	Size=0,	RefCount=1 (decremented)
	  Info:	Removing map entry with	HstPtrBegin=0x00007fff0d2559f0,
		TgtPtrBegin=0x00007fdba5808000,	Size=16, Name=D
	  Info:	OpenMP Host-Device pointer mappings after block	at zaxpy.cpp:6:1:
	  Info:	Host Ptr	   Target Ptr	      Size (B) RefCount	Declaration
	  Info:	0x00007fff0d255a40 0x00007fdba5804000 16384    1	Y[0:N] at zaxpy.cpp:13:17
	  Info:	0x00007fff0d259a40 0x00007fdba5800000 16384    1	X[0:N] at zaxpy.cpp:13:11
	  Info:	Exiting	OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
	  Info:	to(X[0:N])[16384]
	  Info:	tofrom(Y[0:N])[16384]
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d255a40,
		TgtPtrBegin=0x00007fdba5804000,	Size=16384, RefCount=1 (deferred final decrement)
	  Info:	Copying	data from device to host, TgtPtr=0x00007fdba5804000,
		HstPtr=0x00007fff0d255a40, Size=16384, Name=Y[0:N]
	  Info:	Mapping	exists with HstPtrBegin=0x00007fff0d259a40,
		TgtPtrBegin=0x00007fdba5800000,	Size=16384, RefCount=1 (deferred final decrement)
	  Info:	Removing map entry with	HstPtrBegin=0x00007fff0d255a40,
		TgtPtrBegin=0x00007fdba5804000,	Size=16384, Name=Y[0:N]
	  Info:	Removing map entry with	HstPtrBegin=0x00007fff0d259a40,
		TgtPtrBegin=0x00007fdba5800000,	Size=16384, Name=X[0:N]

       From  this  information,	we can see the OpenMP kernel being launched on
       the CUDA	device with enough threads and blocks for all 1024  iterations
       of  the	loop  in simplified SPMD Mode. The information from the	OpenMP
       data region shows the two arrays	X and Y	being copied from the host  to
       the  device. This creates an entry in the host-device mapping table as-
       sociating the host pointers to the newly	created	device data. The  data
       mappings	 in  the  OpenMP device	kernel show the	default	mappings being
       used for	all the	variables used implicitly on the device. Because X and
       Y are already mapped in the device's table, no new entries are created.
       Additionally, the default mapping shows that D will be copied back from
       the device once the OpenMP device kernel	region	ends  even  though  it
       isn't written to. Finally, at the end of	the OpenMP data	region the en-
       tries for X and Y are removed from the table.

       The  information	 level	can be controlled at runtime using an internal
       libomptarget library call __tgt_set_info_flag. This allows for  differ-
       ent levels of information to be enabled or disabled for certain regions
       of  code.   Using  this requires	declaring the function signature as an
       external	function so it can be linked with the runtime library.

	  extern "C" void __tgt_set_info_flag(uint32_t);

	  extern foo();

	  int main() {
	    __tgt_set_info_flag(0x10);
	  #pragma omp target
	    foo();
	  }

   Errors:
       libomptarget provides error messages when the program fails inside  the
       OpenMP  target  region.	Common	causes	of failure could be an invalid
       pointer access, running out of device memory, or	trying to offload when
       the device is busy. If the application was built	with debugging symbols
       the error messages will additionally provide the	source location	of the
       OpenMP target region.

       For example, consider the following code	that implements	a simple  par-
       allel  reduction	on the GPU. This code has a bug	that causes it to fail
       in the offloading region.

	  #include <cstdio>

	  double sum(double *A,	std::size_t N) {
	    double sum = 0.0;
	  #pragma omp target teams distribute parallel for reduction(+:sum)
	    for	(int i = 0; i <	N; ++i)
	      sum += A[i];

	    return sum;
	  }

	  int main() {
	    const int N	= 1024;
	    double A[N];
	    sum(A, N);
	  }

       If this code is compiled	and run, there will be an error	message	 indi-
       cating what is going wrong.

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o	sum
	  $ ./sum

	  CUDA error: an illegal memory	access was encountered
	  Libomptarget error: Copying data from	device failed.
	  Libomptarget error: Call to targetDataEnd failed, abort target.
	  Libomptarget error: Failed to	process	data after launching the kernel.
	  Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
	  sum.cpp:5:1: Libomptarget error 1: failure of	target construct while offloading is mandatory

       This  shows  that there is an illegal memory access occuring inside the
       OpenMP target region once execution has moved to	the CUDA device,  sug-
       gesting	a  segmentation	 fault.	 This  then causes a chain reaction of
       failures	in libomptarget. Another message suggests using	the LIBOMPTAR-
       GET_INFO	environment variable as	described in Environment Variables. If
       we do this it will print	the sate of the	host-target  pointer  mappings
       at the time of failure.

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o	sum
	  $ env	LIBOMPTARGET_INFO=4 ./sum

	  info:	OpenMP Host-Device pointer mappings after block	at sum.cpp:5:1:
	  info:	Host Ptr	   Target Ptr	      Size (B) RefCount	Declaration
	  info:	0x00007ffc058280f8 0x00007f4186600000 8	       1	sum at sum.cpp:4:10

       This tells us that the only data	mapped between the host	and the	device
       is  the	sum variable that will be copied back from the device once the
       reduction has ended. There is no	entry mapping the host array A to  the
       device.	In  this  situation, the compiler cannot determine the size of
       the array at compile time so it will simply assume that the pointer  is
       mapped  on the device already by	default. The solution is to add	an ex-
       plicit map clause in the	target region.

	  double sum(double *A,	std::size_t N) {
	    double sum = 0.0;
	  #pragma omp target teams distribute parallel for reduction(+:sum) map(to:A[0 : N])
	    for	(int i = 0; i <	N; ++i)
	      sum += A[i];

	    return sum;
	  }

   LIBOMPTARGET_STACK_SIZE
       This environment	variable sets the stack	size in	 bytes	for  the  CUDA
       plugin. This can	be used	to increase or decrease	the standard amount of
       memory reserved for each	thread's stack.

   LIBOMPTARGET_HEAP_SIZE
       This  environment  variable sets	the amount of memory in	bytes that can
       be allocated using malloc and free for the CUDA plugin. This is	neces-
       sary for	some applications that allocate	too much memory	either through
       the user	or globalization.

   OpenMP in LLVM --- Offloading Design
   OpenMP Target Offloading ---	SPMD Mode
   OpenMP Target Offloading ---	Generic	Mode
   LLVM/OpenMP Target Host Runtime Plugins (libomptarget.rtl.XXXX)
   Remote Offloading Plugin:
       The remote offloading plugin permits the	execution of OpenMP target re-
       gions  on  devices in remote hosts in addition to the devices connected
       to the local host. All target devices on	the remote host	 will  be  ex-
       posed  to  the  application as if they were local devices, that is, the
       remote host CPU or its GPUs can be offloaded to	with  the  appropriate
       device  number.	If the server is running on the	same host, each	device
       may be identified twice:	once  through  the  device  plugins  and  once
       through the device plugins that the server application has access to.

       This  plugin  consists  of  libomptarget.rtl.rpc.so and openmp-offload-
       ing-server which	should be running on the (remote) host.	The server ap-
       plication does not have to be running on	a remote host, and can instead
       be used on the same host	in order to debug memory  mapping  during  of-
       floading.   These  are implemented via gRPC/protobuf so these libraries
       are required to build and use this plugin. The server  must  also  have
       access to the necessary target-specific plugins in order	to perform the
       offloading.

       Due  to	the experimental nature	of this	plugin,	the CMake variable LI-
       BOMPTARGET_ENABLE_EXPERIMENTAL_REMOTE_PLUGIN must be set	 in  order  to
       build  this  plugin.  For example, the rpc plugin is not	designed to be
       thread-safe, the	server cannot concurrently handle offloading from mul-
       tiple applications at once (it is synchronous) and will terminate after
       a single	execution. Note	that openmp-offloading-server is unable	to re-
       mote offload onto a remote host itself and will error out  if  this  is
       attempted.

       Remote offloading is configured via environment variables at runtime of
       the OpenMP application:

	      •	LIBOMPTARGET_RPC_ADDRESS=<Address>:<Port>

	      •	LIBOMPTARGET_RPC_ALLOCATOR_MAX=<NumBytes>

	      •	LIBOMPTARGET_BLOCK_SIZE=<NumBytes>

	      •	LIBOMPTARGET_RPC_LATENCY=<Seconds>

   LIBOMPTARGET_RPC_ADDRESS
       The  address  and port at which the server is running. This needs to be
       set for the server and the application, the default is 0.0.0.0:50051. A
       single OpenMP executable	can offload onto multiple remote hosts by set-
       ting this to comma-seperated values of the addresses.

   LIBOMPTARGET_RPC_ALLOCATOR_MAX
       After allocating	this size, the protobuf	allocator will clear. This can
       be set for both endpoints.

   LIBOMPTARGET_BLOCK_SIZE
       This is the maximum size	of  a  single  message	while  streaming  data
       transfers between the two endpoints and can be set for both endpoints.

   LIBOMPTARGET_RPC_LATENCY
       This  is	the maximum amount of time the client will wait	for a response
       from the	server.

   LLVM/OpenMP Target Device Runtime (libomptarget-ARCH-SUBARCH.bc)
       LLVM, since version 11 (12 Oct 2020), has an OpenMP-Aware  optimization
       pass  as	 well  as the ability to perform "scalar optimizations"	across
       OpenMP region boundaries.

       In-depth	discussion of the topic	can be found here.

OPENMP OPTIMIZATIONS IN	LLVM
       LLVM, since version 11 (12 Oct 2020), has an OpenMP-Aware  optimization
       pass  as	 well  as the ability to perform "scalar optimizations"	across
       OpenMP region boundaries.

   OpenMP-Aware	Optimizations
       LLVM, since version 11 (12 Oct 2020), supports  an  OpenMP-Aware	 opti-
       mization	pass. This optimization	pass will attempt to optimize the mod-
       ule  with OpenMP-specific domain-knowledge. This	pass is	enabled	by de-
       fault at	high optimization levels (O2 / O3) if  compiling  with	OpenMP
       support enabled.

   OpenMPOpt
       • OpenMP	Runtime	Call Deduplication

       • Globalization

       OpenMPOpt contains several OpenMP-Aware optimizations. This pass	is run
       early  on  the  entire Module, and later	on the entire call graph. Most
       optimizations done by OpenMPOpt support remarks.	 Optimization  remarks
       can be enabled by compiling with	the following flags.

	  $ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt

   OpenMP Runtime Call Deduplication
       The OpenMP runtime library contains several functions used to implement
       features	 of the	OpenMP standard. Several of the	runtime	calls are con-
       stant within a parallel region. A common	optimization is	to replace in-
       variant code with a single reference, but in  this  case	 the  compiler
       will  only  see	an opaque call into the	runtime	library. To get	around
       this, OpenMPOpt maintains a list	of OpenMP runtime functions  that  are
       constant	and will manually deduplicate them.

   Globalization
       The  OpenMP standard requires that data can be shared between different
       threads.	 This requirement poses	a unique challenge when	offloading  to
       GPU  accelerators.   Data cannot	be shared between the threads in a GPU
       by default, in order to do this it must either be placed	in  global  or
       shared  memory.	This needs to be done every time a variable may	poten-
       tially be shared	in order to create correct OpenMP  programs.  Unfortu-
       nately, this has	significant performance	implications and is not	needed
       in  the	majority  of cases. For	example, when Clang is generating code
       for this	offloading region, it will see that the	variable x escapes and
       is potentially shared. This  will  require  globalizing	the  variable,
       which means it cannot reside in the registers on	the device.

	  void use(void	*) { }

	  void foo() {
	    int	x;
	    use(&x);
	  }

	  int main() {
	  #pragma omp target parallel
	    foo();
	  }

       In  many	cases, this transformation is not actually necessary but still
       carries a significant performance penalty. Because of  this,  OpenMPOpt
       can perform and inter-procedural	optimization and scan each known usage
       of  the globalized variable and determine if it is potentially captured
       and shared by another thread. If	it is not actually  captured,  it  can
       safely be moved back to fast register memory.

       Another	case  is  memory  that	is  intentionally  shared  between the
       threads,	but is shared from one thread to all the  others.  Such	 vari-
       ables can be moved to shared memory when	compiled without needing to go
       through	the runtime library.  This allows for users to confidently de-
       clare shared memory on the device without needing to use	custom	OpenMP
       allocators or rely on the runtime.

	  static void share(void *);

	  static void foo() {
	    int	x[64];
	  #pragma omp parallel
	    share(x);
	  }

	  int main() {
	    #pragma omp	target
	    foo();
	  }

       These  optimizations can	have very large	performance implications. Both
       of these	optimizations rely heavily on inter-procedural	analysis.  Be-
       cause of	this, offloading applications should ideally be	contained in a
       single  translation unit	and functions should not be externally visible
       unless needed. OpenMPOpt	will inform  the  user	if  any	 globalization
       calls remain if remarks are enabled. This should	be treated as a	defect
       in the program.

   Resources
       • 2021	 OpenMP	   Webinar:    "A    Compiler's	   View	  of   OpenMP"
	 https://youtu.be/eIMpgez61r4

       • 2020 LLVM Developers Meeting: "(OpenMP)  Parallelism-Aware  Optimiza-
	 tions"	https://youtu.be/gtxWkeLCxmU

       • 2019	EuroLLVM   Developers  Meeting:	 "Compiler  Optimizations  for
	 (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30

   OpenMP-Unaware Optimizations
   Resources
       • 2018 LLVM Developers Meeting:	"Optimizing  Indirections,  using  ab-
	 stractions without remorse" https://youtu.be/zfiHaPaoQPc

       • 2019 LLVM Developers Meeting: "The Attributor:	A Versatile Inter-pro-
	 cedural Fixpoint Iteration Framework" https://youtu.be/CzWkc_JcfS0

       LLVM  has  an  elaborate	ecosystem around analysis and optimization re-
       marks issues during compilation.	The remarks can	be  enabled  from  the
       clang  frontend [1] [2] in various formats [3] [4] to be	used by	tools,
       i.a., opt-viewer	or llvm-opt-report (dated).

       The OpenMP optimizations	in LLVM	have been developed with  remark  sup-
       port  as	a priority. For	a list of OpenMP specific remarks and more in-
       formation on them, please refer to OpenMP Optimization Remarks.

       • [1]
	 https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports

       • [2]
	 https://clang.llvm.org/docs/ClangCommandLineReference.html#diagnostic-flags

       • [3]
	 https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-foptimization-record-file

       • [4]
	 https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang1-fsave-optimization-record

OPENMP OPTIMIZATION REMARKS
       The OpenMP-Aware	optimization pass is able to generate compiler remarks
       for  performed  and  missed   optimisations.   To   emit	  them,	  pass
       -Rpass=openmp-opt,	     -Rpass-analysis=openmp-opt,	   and
       -Rpass-missed=openmp-opt	to the Clang invocation.  For more information
       and features of the remark system the  clang  documentation  should  be
       consulted:

       • Clang options to emit optimization reports

       • Clang diagnostic and remark flags

       • The	    -foptimization-record-file	      flag	 and	   the
	 -fsave-optimization-record flag

   OpenMP Remarks
   Potentially unknown OpenMP target region caller [OMP100]
       A function remark that indicates	the function, when compiled for	a GPU,
       is potentially called from outside the translation unit.	 Note  that  a
       remark  is  only	 issued	 if  we	tried to perform an optimization which
       would require us	to know	all callers on the GPU.

       To facilitate OpenMP semantics on GPUs we provide a  runtime  mechanism
       through	which  the code	that makes up the body of a parallel region is
       shared with the threads in the team. Generally we use  the  address  of
       the outlined parallel region to identify	the code that needs to be exe-
       cuted.  If we know all target regions that reach	the parallel region we
       can avoid this function pointer passing scheme and  often  improve  the
       register	 usage on the GPU. However, If a parallel region on the	GPU is
       in a function with external linkage we may not know all callers	stati-
       cally.  If there	are outside callers within target regions, this	remark
       is to be	ignored. If there are no such callers, users  can  modify  the
       linkage	 and   thereby	help  optimization  with  a  static  or	 __at-
       tribute__((internal)) function annotation. If changing the  linkage  is
       impossible,  e.g.,  because  there are outside callers on the host, one
       can split the function into an external visible interface which is  not
       compiled	 for  the  target and an internal implementation which is com-
       piled for the target and	should be called from within  the  target  re-
       gion.

   Parallel  region  is	used in	unknown	/ unexpected ways. Will	not attempt to
       rewrite the state machine. [OMP101]
       An analysis remark that indicates that a	parallel  region  has  unknown
       calls.

   Parallel  region  is	 not  called from a unique kernel. Will	not attempt to
       rewrite the state machine. [OMP102]
       This analysis remark indicates that a given parallel region  is	called
       by multiple kernels. This prevents the compiler from optimizing it to a
       single kernel and rewrite the state machine.

   Moving globalized variable to the stack. [OMP110]
       This optimization remark	indicates that a globalized variable was moved
       back  to	 thread-local stack memory on the device. This occurs when the
       optimization pass can determine that a globalized variable cannot  pos-
       sibly be	shared between threads and globalization was ultimately	unnec-
       essary.	Using  stack memory is the best-case scenario for data global-
       ization as the variable can now be stored in fast register files	on the
       device. This optimization requires full visibility of each variable.

       Globalization typically occurs when a pointer to	a  thread-local	 vari-
       able  escapes  the  current scope. The compiler needs to	be pessimistic
       and assume that the pointer could be shared  between  multiple  threads
       according  to the OpenMP	standard. This is expensive on target offload-
       ing devices that	do not allow threads to	share  data  by	 default.  In-
       stead,  this  data  must	be moved to memory that	can be shared, such as
       shared or global	memory.	This optimization moves	 the  data  back  from
       shared or global	memory to thread-local stack memory if the data	is not
       actually	shared between the threads.

   Examples
       A  trivial example of globalization occurring can be seen with this ex-
       ample. The compiler sees	that a pointer to the thread-local variable  x
       escapes	the  current scope and must globalize it even though it	is not
       actually	necessary.  Fortunately, this optimization can	undo  this  by
       looking at its usage.

	  void use(int *x) { }

	  void foo() {
	    int	x;
	    use(&x);
	  }

	  int main() {
	  #pragma omp target parallel
	    foo();
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
	  omp110.cpp:6:7: remark: Moving globalized variable to	the stack. [OMP110]
	    int	x;
		^

       A less trivial example can be seen using	C++'s complex numbers. In this
       case  the overloaded arithmetic operators cause pointers	to the complex
       numbers to escape the current scope, but	they can again be removed once
       the usage is visible.

	  #include <complex>

	  using	complex	= std::complex<double>;

	  void zaxpy(complex *X, complex *Y, const complex D, int N) {
	  #pragma omp target teams distribute parallel for firstprivate(D)
	    for	(int i = 0; i <	N; ++i)
	      Y[i] = D * X[i] +	Y[i];
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
	  In file included from	omp110.cpp:1:
	  In file included from	/usr/bin/clang/lib/clang/13.0.0/include/openmp_wrappers/complex:27:
	  /usr/include/c++/8/complex:328:20: remark: Moving globalized variable	to the stack. [OMP110]
		complex<_Tp> __r = __x;
			     ^
	  /usr/include/c++/8/complex:388:20: remark: Moving globalized variable	to the stack. [OMP110]
		complex<_Tp> __r = __x;
			     ^

   Diagnostic Scope
       OpenMP target offloading	optimization remark.

   Replaced globalized variable	with X bytes of	shared memory. [OMP111]
       This optimization occurs	when a globalized variable's  data  is	shared
       between multiple	threads, but requires a	constant amount	of memory that
       can  be determined at compile time. This	is the case when only a	single
       thread creates the memory and is	then shared between every thread.  The
       memory  can  then  be pushed to a static	buffer of shared memory	on the
       device. This optimization allows	users to declare shared	memory on  the
       device without using OpenMP's custom allocators.

       Globalization  occurs when a pointer to a thread-local variable escapes
       the current scope. If a single thread is	known to  be  responsible  for
       creating	 and sharing the data it can instead be	mapped directly	to the
       device's	shared memory. Checking	if only	a single thread	can execute an
       instruction requires that the parent functions have  internal  linkage.
       Otherwise, an external caller could invalidate this analysis but	having
       multiple	 threads  call that function.  The optimization	pass will make
       internal	copies of each function	to use for  this  reason,  but	it  is
       still  recommended  to mark them	as internal using keywords like	static
       whenever	possible.

   Example
       This optimization should	apply to any variable declared	in  an	OpenMP
       target  region  that is then shared with	every thread in	a parallel re-
       gion. This allows the user to declare shared memory without using  cus-
       tom  allocators.	 A  simple  stencil  calculation shows how this	can be
       used.

	  void stencil(int M, int N, double *X,	double *Y) {
	  #pragma omp target teams distribute collapse(2) \
	    map(to : X [0:M * N]) map(tofrom : Y [0:M *	N])
	    for	(int i0	= 0; i0	< M; i0	+= MC) {
	      for (int j0 = 0; j0 < N; j0 += NC) {
		double sX[MC][NC];

	  #pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
		for (int i1 = 0; i1 < MC; ++i1)
		  for (int j1 =	0; j1 <	NC; ++j1)
		    sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];

	  #pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
		for (int i1 = 1; i1 < MC - 1; ++i1)
		  for (int j1 =	1; j1 <	NC - 1;	++j1)
		    Y[(i0 + i1)	* N + j0 * j1] = (sX[i1	+ 1][j1] + sX[i1 - 1][j1] +
						  sX[i1][j1 + 1] + sX[i1][j1 - 1] +
						  -4.0 * sX[i1][j1]) / (dX * dX);
	      }
	    }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51	omp111.cpp
	  omp111.cpp:10:14: remark: Replaced globalized	variable with 8192 bytes of shared memory. [OMP111]
	      double sX[MC][NC];
		     ^

       The default mapping for variables captured in an	OpenMP parallel	region
       is shared. This means taking a pointer to the object which  will	 ulti-
       mately  result  in  globalization  that will be mapped to shared	memory
       when it could have been placed in registers. To avoid this,  make  sure
       each variable that can be copied	into the region	is marked firstprivate
       either  explicitly  or  using  the OpenMP 5.1 feature default(firstpri-
       vate).

   Diagnostic Scope
       OpenMP target offloading	optimization remark.

   Found thread	data sharing on	the GPU. Expect	degraded  performance  due  to
       data globalization. [OMP112]
       This  missed  remark indicates that a globalized	value was found	on the
       target device that was not either replaced with stack memory by	OMP110
       or  shared  memory  by  OMP111. Globalization that has not been removed
       will need to be handled by the runtime and  will	 significantly	impact
       performance.

       The  OpenMP standard requires that threads are able to share their data
       between each-other. However, this is not	true by	default	when  offload-
       ing  to	a  target device such as a GPU.	Threads	on a GPU cannot	shared
       their data unless it is first placed in global or shared	memory.	In or-
       der to create standards complaint code, the Clang compiler will global-
       ize any variables that could potentially	be shared between the threads.
       In the majority of cases, globalized variables can either be returns to
       a thread-local stack, or	pushed to shared memory.  However,  in	a  few
       cases it	is necessary and will cause a performance penalty.

   Examples
       This  example shows legitimate data sharing on the device. It is	a con-
       voluted example,	but is completely complaint with the OpenMP  standard.
       If  globalization  was not added	this would result in different results
       on different target devices.

	  #include <omp.h>
	  #include <cstdio>

	  #pragma omp declare target
	  static int *p;
	  #pragma omp end declare target

	  void foo() {
	    int	x = omp_get_thread_num();
	    if (omp_get_thread_num() ==	1)
	      p	= &x;

	  #pragma omp barrier

	    printf ("Thread %d:	%d\n", omp_get_thread_num(), *p);
	  }

	  int main() {
	  #pragma omp target parallel
	    foo();
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp
	  omp112.cpp:9:7: remark: Found	thread data sharing on the GPU.	Expect degraded	performance
	  due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
	  int x	= omp_get_thread_num();
	      ^

       A less convoluted example globalization that cannot be  removed	occurs
       when calling functions that aren't visible from the current translation
       unit.

	  extern void use(int *x);

	  void foo() {
	    int	x;
	    use(&x);
	  }

	  int main() {
	  #pragma omp target parallel
	    foo();
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp
	  omp112.cpp:4:7: remark: Found	thread data sharing on the GPU.	Expect degraded	performance
	  due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
	  int x;
	      ^

   Diagnostic Scope
       OpenMP target offloading	missed remark.

   Could  not  move  globalized	variable to the	stack. Variable	is potentially
       captured	in call. Mark parameter	as __attribute__((noescape)) to	 over-
       ride. [OMP113]
       This missed remark indicates that a globalized value could not be moved
       to the stack because it is potentially captured by a call to a function
       we  cannot  analyze.  In	order for a globalized variable	to be moved to
       the stack, copies to its	pointer	cannot be stored. Otherwise it is con-
       sidered captured	and could potentially be shared	between	 the  threads.
       This  can  be overridden	using a	parameter level	attribute as suggested
       in the remark text.

       Globalization will occur	when a pointer to a thread-local variable  es-
       capes  the  current  scope. In most cases it can	be determined that the
       variable	cannot be shared if a copy of its pointer is never made.  How-
       ever,  this  remark  indicates a	copy of	the pointer is present or that
       sharing is possible because it is used outside the current  translation
       unit.

   Examples
       If  a  pointer  to  a thread-local variable is passed to	a function not
       visible in the current translation unit we need to  assume  a  copy  is
       made of it that can be shared between the threads. This prevents	OMP110
       from  triggering,  which	will result in a performance penalty when exe-
       cuting on the target device.

	  extern void use(int *x);

	  void foo() {
	    int	x;
	    use(&x);
	  }

	  int main() {
	  #pragma omp target parallel
	    foo();
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-missed=openmp-opt omp113.cpp
	  missed.cpp:4:7: remark: Could	not move globalized variable to	the stack. Variable is
	  potentially captured in call.	Mark parameter as `__attribute__((noescape))` to
	  override. [OMP113]
	    int	x;
		^

       As the remark suggests, this behaviour  can  be	overridden  using  the
       noescape	 attribute.  This  tells the compiler that no reference	to the
       object the pointer points to that is derived from the  parameter	 value
       will  survive  after  the function returns. The user is responsible for
       verifying that this assertion is	correct.

	  extern void use(__attribute__((noescape)) int	*x);

	  void foo() {
	    int	x;
	    use(&x);
	  }

	  int main() {
	  #pragma omp target parallel
	    foo();
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp113.cpp
	  missed.cpp:4:7: remark: Moving globalized variable to	the stack. [OMP110]
	  int x;
	      ^

   Diagnostic Scope
       OpenMP target offloading	missed remark.

   Transformed generic-mode kernel to SPMD-mode	[OMP120]
       This optimization remark	indicates that the execution strategy for  the
       OpenMP  target  offloading kernel was changed. Generic-mode kernels are
       executed	by a single thread that	schedules parallel worker threads  us-
       ing  a  state  machine. This code transformation	can move a kernel that
       was initially generated in generic mode to SPMD-mode where all  threads
       are  active  at	the  same  time	 with no state machine.	This execution
       strategy	is closer to how the threads are actually executed  on	a  GPU
       target.	This  is only possible if the instructions previously executed
       by a single thread have no side-effects or can be guarded. If  the  in-
       structions  have	 no  side-effects  they	 are simply recomputed by each
       thread.

       Generic-mode is often considerably slower than SPMD-mode	because	of the
       extra overhead required to separately schedule worker threads and  pass
       data  between  them.This	 optimization allows users to use generic-mode
       semantics while achieving the performance of  SPMD-mode.	 This  can  be
       helpful when defining shared memory between the threads using OMP111.

   Examples
       Normally, any kernel that contains split	OpenMP target and parallel re-
       gions  will  be executed	in generic-mode. Sometimes it is easier	to use
       generic-mode semantics to define	shared memory, or more tightly control
       the distribution	of the threads.	This shows a naive matrix-matrix  mul-
       tiplication that	contains code that will	need to	be guarded.

	  void matmul(int M, int N, int	K, double *A, double *B, double	*C) {
	  #pragma omp target teams distribute collapse(2) \
	    map(to:A[0:	M*K]) map(to:B[0: K*N])	map(tofrom:C[0 : M*N])
	    for	(int i = 0; i <	M; i++)	{
	      for (int j = 0; j	< N; j++) {
		double sum = 0.0;

	  #pragma omp parallel for reduction(+:sum) default(firstprivate)
		for (int k = 0;	k < K; k++)
		  sum += A[i*K + k] * B[k*N + j];

		C[i*N +	j] = sum;
	      }
	    }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt	omp120.cpp
	  omp120.cpp:6:14: remark: Replaced globalized variable	with 8 bytes of	shared memory. [OMP111]
	       double sum = 0.0;
		      ^
	  omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode.	[OMP120]
	  #pragma omp target teams distribute collapse(2) \
	  ^

       This  requires  guarding	 the  store to the shared variable sum and the
       store to	the matrix C. This can be thought of as	 generating  the  code
       below.

	  void matmul(int M, int N, int	K, double *A, double *B, double	*C) {
	  #pragma omp target teams distribute collapse(2) \
	    map(to:A[0:	M*K]) map(to:B[0: K*N])	map(tofrom:C[0 : M*N])
	    for	(int i = 0; i <	M; i++)	{
	      for (int j = 0; j	< N; j++) {
	      double sum;
	  #pragma omp parallel default(firstprivate) shared(sum)
	      {
	      #pragma omp barrier
	      if (omp_get_thread_num() == 0)
		sum = 0.0;
	      #pragma omp barrier

	  #pragma omp for reduction(+:sum)
		for (int k = 0;	k < K; k++)
		  sum += A[i*K + k] * B[k*N + j];

	      #pragma omp barrier
	      if (omp_get_thread_num() == 0)
		C[i*N +	j] = sum;
	      #pragma omp barrier
	      }
	      }
	    }
	  }

   Diagnostic Scope
       OpenMP target offloading	optimization remark.

   Value  has potential	side effects preventing	SPMD-mode execution. Add __at-
       tribute__((assume("ompx_spmd_amenable"))) to  the  called  function  to
       override. [OMP121]
       This  analysis remarks indicates	that a potential side-effect that can-
       not be guarded prevents the target region from executing	in  SPMD-mode.
       SPMD-mode  requires  that  each thread is active	inside the region. Any
       instruction that	cannot be either recomputed by	each  thread  indepen-
       dently  or  guarded and executed	by a single thread prevents the	region
       from executing in SPMD-mode.

       This remark will	attempt	to print out the instructions  preventing  the
       region from being executed in SPMD-mode.	Calls to functions outside the
       current	translation  unit will prevent this transformation from	occur-
       ring as well, but can be	overridden using an assumption stating that it
       contains	no calls that prevent SPMD execution.

   Examples
       Calls to	functions outside the current translation unit may contain in-
       structions or operations	that cannot be executed	in SPMD-mode.

	  extern int work();

	  void use(int x);

	  void foo() {
	  #pragma omp target teams
	    {
	      int x = work();
	  #pragma omp parallel
		use(x);

	    }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp121.cpp
	  omp121.cpp:8:13: remark: Value has potential side effects preventing SPMD-mode
	  execution.  Add `__attribute__((assume("ompx_spmd_amenable")))` to the called	function
	  to override. [OMP121]
	  int x	= work();
		   ^

       As the remark suggests, the problem is caused by	the  unknown  call  to
       the external function work. This	can be overridden by asserting that it
       does not	contain	any code that prevents SPMD-mode execution.

	  __attribute__((assume("ompx_spmd_amenable")))	extern int work();

	  void use(int x);

	  void foo() {
	  #pragma omp target teams
	    {
	      int x = work();
	  #pragma omp parallel
		use(x);

	    }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp121.cpp
	  omp121.cpp:6:1: remark: Transformed generic-mode kernel to SPMD-mode.	[OMP120]
	  #pragma omp target teams
	  ^

   Diagnostic Scope
       OpenMP target offloading	analysis remark.

   Removing unused state machine from generic-mode kernel. [OMP130]
       This optimization remark	indicates that an unused state machine was re-
       moved  from a target region. This occurs	when there are no parallel re-
       gions inside of a target	construct. Normally, a state  machine  is  re-
       quired  to  schedule  the threads inside	of a parallel region. If there
       are no parallel regions,	the state machine is unnecessary because there
       is only a single	thread active at any time.

   Examples
       This optimization should	occur on any target region that	does not  con-
       tain any	parallel work.

	  void copy(int	N, double *X, double *Y) {
	  #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N])
	    for	(int i = 0; i <	N; ++i)
	      Y[i] = X[i];
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp130.cpp
	  omp130.cpp:2:1: remark: Removing unused state	machine	from generic-mode kernel. [OMP130]
	  #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N])
	  ^

   Diagnostic Scope
       OpenMP target offloading	optimization remark.

   Rewriting generic-mode kernel with a	customized state machine. [OMP131]
       This  optimization  remark  indicates that a generic-mode kernel	on the
       device was specialized for the given target region. When	offloading  in
       generic-mode,  a	state machine is required to schedule the work between
       the parallel worker threads. This optimization  specializes  the	 state
       machine	in cases where there is	a known	number of parallel regions in-
       side the	kernel.	A much simpler state machine can  be  used  if	it  is
       known  that there is no nested parallelism and the number of regions to
       schedule	is a static amount.

   Examples
       This optimization should	occur on any generic-mode kernel that has vis-
       ibility on all parallel regions,	but cannot be moved to	SPMD-mode  and
       has no nested parallelism.

	  #pragma omp declare target
	  int TID;
	  #pragma omp end declare target

	  void foo() {
	  #pragma omp target
	  {
	   TID = omp_get_thread_num();
	   #pragma omp parallel
	   {
	     work();
	   }
	  }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp131.cpp
	  omp131.cpp:8:1: remark: Rewriting generic-mode kernel	with a customized state	machine. [OMP131]
	  #pragma omp target
	  ^

   Diagnostic Scope
       OpenMP target offloading	optimization remark.

   Generic-mode	 kernel	 is  executed with a customized	state machine that re-
       quires a	fallback. [OMP132]
       This analysis remark indicates that a state machine  rewrite  occurred,
       but  could not be done fully because of unknown calls to	functions that
       may contain parallel regions. The state machine handles scheduling work
       between parallel	 worker	 threads  on  the  device  when	 operating  in
       generic-mode. If	there are unknown parallel regions it prevents the op-
       timization from fully rewriting the state machine.

   Examples
       This  will  occur  for any generic-mode kernel that may contain unknown
       parallel	regions. This is typically coupled with	the OMP133 remark.

	  extern void setup();

	  void foo() {
	  #pragma omp target
	  {
	    setup();
	    #pragma omp	parallel
	    {
	      work();
	    }
	  }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp132.cpp
	  omp133.cpp:4:1: remark: Generic-mode kernel is executed with a customized state machine
	  that requires	a fallback. [OMP132]
	  #pragma omp target
	  ^

   Diagnostic Scope
       OpenMP target offloading	analysis remark.

   Call	 may  contain  unknown	parallel   regions.   Use   __attribute__((as-
       sume("omp_no_parallelism"))) to override. [OMP133]
       This  analysis  remark identifies calls that prevented OMP131 from pro-
       viding the generic-mode kernel with a fully specialized state  machine.
       This  remark  will identify each	call that may contain unknown parallel
       regions that caused the kernel to require a fallback.

   Examples
       This will occur for any generic-mode kernel that	 may  contain  unknown
       parallel	regions. This is typically coupled with	the OMP132 remark.

	  extern void setup();

	  void foo() {
	  #pragma omp target
	  {
	    setup();
	    #pragma omp	parallel
	    {
	      work();
	    }
	  }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp133.cpp
	  omp133.cpp:6:5: remark: Call may contain unknown parallel regions. Use
	  `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133]
	  setup();
	  ^

       The  remark  suggests  marking the function with	the assumption that it
       contains	no parallel regions. If	this is	done then the kernel  will  be
       rewritten with a	fully specialized state	machine.

	  __attribute__((assume("omp_no_parallelism")))	extern void setup();

	  void foo() {
	  #pragma omp target
	  {
	    setup();
	    #pragma omp	parallel
	    {
	      work();
	    }
	  }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp133.cpp
	  omp133.cpp:4:1: remark: Rewriting generic-mode kernel	with a customized state	machine. [OMP131]
	  #pragma omp target
	  ^

   Diagnostic Scope
       OpenMP target offloading	analysis remark.

   Could  not  internalize  function.  Some optimizations may not be possible.
       [OMP140]
       This analysis remark indicates that function internalization failed for
       the given function. Internalization occurs when a call  to  a  function
       that  ordinarily	 has  external visibility is replaced with a call to a
       copy of that function with only internal	visibility.  This  allows  the
       compiler	 to make strong	static assertions about	the context a function
       is called in. Without internalization this analysis would always	be in-
       validated by the	possibility of someone calling the function in a  dif-
       ferent context outside of the current translation unit.	This is	neces-
       sary  for optimizations like OMP111 and OMP120. If a function failed to
       be internalized it most likely has linkage that cannot be  copied.  In-
       ternalization  is  currently  only enabled by default for OpenMP	target
       offloading.

   Examples
       This will occur for any	function  declaration  that  has  incompatible
       linkage.

	  __attribute__((weak))	void setup();

	  void foo() {
	  #pragma omp target
	  {
	    setup();
	    #pragma omp	parallel
	    {
	      work();
	    }
	  }
	  }

	  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-analysis=openmp-opt omp140.cpp
	  omp140.cpp:1:1: remark: Could	not internalize	function. Some optimizations may not
	  be possible. [OMP140]
	  __attribute__((weak))	void setup() {
	  ^

   Diagnostic Scope
       OpenMP analysis remark.

   Parallel region merged with parallel	region at <location>. [OMP150]
       This  optimization  remark  indicates that a parallel region was	merged
       with others into	a single  parallel  region.  Parallel  region  merging
       fuses  consecutive parallel regions to reduce the team activation over-
       head of forking and increases the scope of possible OpenMP-specific op-
       timizations within merged parallel regions. This	optimization can  also
       guard sequential	code between two parallel regions if applicable.

   Example
       This optimization should	apply to any compatible	and consecutive	paral-
       lel  regions.  In  this case the	sequential region between the parallel
       regions will be guarded so it is	only executed by a  single  thread  in
       the new merged region.

	  void foo() {
	  #pragma omp parallel
	    parallel_work();

	    sequential_work();

	  #pragma omp parallel
	    parallel_work();
	  }

	  $ clang++ -fopenmp -O2 -Rpass=openmp-opt -mllvm -openmp-opt-enable-merging omp150.cpp
	  omp150.cpp:2:1: remark: Parallel region merged with parallel region at merge.cpp:7:1.	[OMP150]
	  #pragma omp parallel
	  ^

   Diagnostic Scope
       OpenMP optimization remark.

   Removing parallel region with no side-effects. [OMP160]
       This  optimization  remark indicates that a parallel region was deleted
       because it was not found	to have	any side-effects. This	can  occur  if
       the  region does	not write any of its results to	memory visible outside
       the region. This	optimization is	necessary because the barrier  between
       sequential  and	parallel code typically	prevents dead code elimination
       from completely removing	the region.  Otherwise	there  will  still  be
       overhead	to fork	and merge the threads with no work done.

   Example
       This  optimization  occurs  whenever a parallel region was not found to
       have any	side-effects. This can occur if	the parallel region only reads
       memory or is simply empty.

	  void foo() {
	  #pragma omp parallel
	    { }
	  #pragma omp parallel
	    { int x = 1; }
	  }
	  }

	  $ clang++ -fopenmp -O2 -Rpass=openmp-opt omp160.cpp
	  omp160.cpp:4:1: remark: Removing parallel region with	no side-effects. [OMP160] [-Rpass=openmp-opt]
	  #pragma omp parallel
	  ^
	  delete.cpp:2:1: remark: Removing parallel region with	no side-effects. [OMP160] [-Rpass=openmp-opt]
	  #pragma omp parallel
	  ^
	  ^

   Diagnostic Scope
       OpenMP optimization remark.

   OpenMP runtime call <call> deduplicated. [OMP170]
       This optimization remark	indicates that a call  to  an  OpenMP  runtime
       call  was replaced with the result of an	existing one. This occurs when
       the compiler knows that the result of a runtime call is immutable.  Re-
       moving  duplicate calls is done by replacing all	calls to that function
       with the	result of the first call. This cannot be done automatically by
       the compiler because the	implementations	of the	OpenMP	runtime	 calls
       live in a separate library the compiler cannot see.

   Example
       This optimization will trigger for known	OpenMP runtime calls whose re-
       turn value will not change.

	  void foo(int N) {
	    double *A =	malloc(N * omp_get_thread_limit());
	    double *B =	malloc(N * omp_get_thread_limit());

	  #pragma omp parallel
	    work(&A[omp_get_thread_num() * N]);
	  #pragma omp parallel
	    work(&B[omp_get_thread_num() * N]);
	  }

	  $ clang -fopenmp -O2 -Rpass=openmp-opt omp170.c
	  ompi170.c:2:26: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170]
	  double *A = malloc(N * omp_get_thread_limit());
				 ^

   Diagnostic Scope
       OpenMP optimization remark.
      +--------------------+------------------+------------------------------+
      |	Diagnostics Number | Diagnostics Kind |	Diagnostics	De-	     |
      |			   |		      |	scription		     |
      +--------------------+------------------+------------------------------+
      |	OMP100		   | Analysis	      |	Potentially unknown	     |
      |			   |		      |	OpenMP	target	re-	     |
      |			   |		      |	gion caller.		     |
      +--------------------+------------------+------------------------------+
      |	OMP101		   | Analysis	      |	Parallel  region is	     |
      |			   |		      |	used in	 unknown  /	     |
      |			   |		      |	unexpected    ways.	     |
      |			   |		      |	Will not attempt to	     |
      |			   |		      |	rewrite	 the  state	     |
      |			   |		      |	machine.		     |
      +--------------------+------------------+------------------------------+
      |	OMP102		   | Analysis	      |	Parallel  region is	     |
      |			   |		      |	not called  from  a	     |
      |			   |		      |	unique kernel. Will	     |
      |			   |		      |	not    attempt	 to	     |
      |			   |		      |	rewrite	 the  state	     |
      |			   |		      |	machine.		     |
      +--------------------+------------------+------------------------------+
      |	OMP110		   | Optimization     |	Moving	 globalized	     |
      |			   |		      |	variable   to	the	     |
      |			   |		      |	stack.			     |
      +--------------------+------------------+------------------------------+
      |	OMP111		   | Optimization     |	Replaced globalized	     |
      |			   |		      |	variable   with	  X	     |
      |			   |		      |	bytes	of   shared	     |
      |			   |		      |	memory.			     |
      +--------------------+------------------+------------------------------+
      |	OMP112		   | Missed	      |	Found  thread  data	     |
      |			   |		      |	sharing	on the GPU.	     |
      |			   |		      |	Expect	   degraded	     |
      |			   |		      |	performance  due to	     |
      |			   |		      |	data globalization.	     |
      +--------------------+------------------+------------------------------+
      |	OMP113		   | Missed	      |	Could	not    move	     |
      |			   |		      |	globalized variable	     |
      |			   |		      |	to the stack. Vari-	     |
      |			   |		      |	able is	potentially	     |
      |			   |		      |	captured  in  call.	     |
      |			   |		      |	Mark  parameter	 as	     |
      |			   |		      |	__at-			     |
      |			   |		      |	trib-			     |
      |			   |		      |	ute__((noescape))	     |
      |			   |		      |	to override.		     |
      +--------------------+------------------+------------------------------+
      |	OMP120		   | Optimization     |	Transformed		     |
      |			   |		      |	generic-mode kernel	     |
      |			   |		      |	to SPMD-mode.		     |
      +--------------------+------------------+------------------------------+
      |	OMP121		   | Analysis	      |	Value has potential	     |
      |			   |		      |	side  effects  pre-	     |
      |			   |		      |	venting	  SPMD-mode	     |
      |			   |		      |	execution.	Add	     |
      |			   |		      |	__attribute__((as-	     |
      |			   |		      |	sume("ompx_spmd_amenable"))) |
      |			   |		      |	to the called func-	     |
      |			   |		      |	tion to	override.	     |
      +--------------------+------------------+------------------------------+
      |	OMP130		   | Optimization     |	Removing  unused  state	 ma- |
      |			   |		      |	chine from generic-mode	ker- |
      |			   |		      |	nel.			     |
      +--------------------+------------------+------------------------------+
      |	OMP131		   | Optimization     |	Rewriting  generic-mode	ker- |
      |			   |		      |	nel with a customized  state |
      |			   |		      |	machine.		     |
      +--------------------+------------------+------------------------------+
      |	OMP132		   | Analysis	      |	Generic-mode  kernel is	exe- |
      |			   |		      |	cuted  with   a	  customized |
      |			   |		      |	state  machine that requires |
      |			   |		      |	a fallback.		     |
      +--------------------+------------------+------------------------------+
      |	OMP133		   | Analysis	      |	Call  may  contain   unknown |
      |			   |		      |	parallel  regions. Use __at- |
      |			   |		      |	tribute__((as-		     |
      |			   |		      |	sume("omp_no_parallelism"))) |
      |			   |		      |	to override.		     |
      +--------------------+------------------+------------------------------+
      |	OMP140		   | Analysis	      |	Could not internalize  func- |
      |			   |		      |	tion. Some optimizations may |
      |			   |		      |	not be possible.	     |
      +--------------------+------------------+------------------------------+
      |	OMP150		   | Optimization     |	Parallel  region merged	with |
      |			   |		      |	parallel  region  at  <loca- |
      |			   |		      |	tion>.			     |
      +--------------------+------------------+------------------------------+
      |	OMP160		   | Optimization     |	Removing   parallel   region |
      |			   |		      |	with no	side-effects.	     |
      +--------------------+------------------+------------------------------+
      |	OMP170		   | Optimization     |	OpenMP runtime	call  <call> |
      |			   |		      |	deduplicated.		     |
      +--------------------+------------------+------------------------------+

       Dealing	with  OpenMP can be complicated. For help with the setup of an
       OpenMP (offload)	capable	compiler  toolchain,  its  usage,  and	common
       problems, consult the Support and FAQ page.

       We  also	 encourage  everyone  interested  in OpenMP in LLVM to get in-
       volved.

SUPPORT, GETTING INVOLVED, AND FAQ
       Please do not hesitate to reach out to us via openmp-dev@lists.llvm.org
       or join one of our regular calls. Some common questions are answered in
       the FAQ.

   Calls
   OpenMP in LLVM Technical Call
       • Development updates on	OpenMP (and OpenACC) in	the LLVM Project,  in-
	 cluding Clang,	optimization, and runtime work.

       • Join OpenMP in	LLVM Technical Call.

       • Time: Weekly call on every Wednesday 7:00 AM Pacific time.

       • Meeting minutes are here.

       • Status	tracking page.

   OpenMP in Flang Technical Call
       • Development updates on	OpenMP and OpenACC in the Flang	Project.

       • Join OpenMP in	Flang Technical	Call

       • Time: Weekly call on every Thursdays 8:00 AM Pacific time.

       • Meeting minutes are here.

       • Status	tracking page.

   FAQ
       NOTE:
	  The  FAQ  is	a work in progress and most of the expected content is
	  not yet available. While you can expect changes, we  always  welcome
	  feedback    and    additions.	   Please   contact,   e.g.,   through
	  openmp-dev@lists.llvm.org.

   Q: How to contribute	a patch	to the webpage or any other part?
       All patches go through the regular LLVM review process.

   Q: How to build an OpenMP GPU offload capable compiler?
       To build	an effective OpenMP offload capable compiler, only  one	 extra
       CMake  option,  LLVM_ENABLE_RUNTIMES="openmp",  is needed when building
       LLVM (Generic information about	building  LLVM	is  available  here.).
       Make  sure  all	backends that are targeted by OpenMP to	be enabled. By
       default,	Clang will be built with all backends enabled.	When  building
       with  LLVM_ENABLE_RUNTIMES="openmp"  OpenMP  should  not	 be enabled in
       LLVM_ENABLE_PROJECTS because it is enabled by default.

       For Nvidia offload, please see Q: How to	build an OpenMP	NVidia offload
       capable compiler?.  For AMDGPU offload, please see Q: How to  build  an
       OpenMP AMDGPU offload capable compiler?.

       NOTE:
	  The  compiler	 that  generates  the  offload code should be the same
	  (version) as the compiler that builds	the  OpenMP  device  runtimes.
	  The OpenMP host runtime can be built by a different compiler.

   Q: How to build an OpenMP NVidia offload capable compiler?
       The  Cuda  SDK  is required on the machine that will execute the	openmp
       application.

       If your build machine is	not the	target machine or automatic  detection
       of the available	GPUs failed, you should	also set:

       • CLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_XX where XX	is the architecture of
	 your GPU, e.g,	80.

       • LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=YY  where  YY	is the numeric
	 compute capacity of your GPU, e.g., 75.

   Q: How to build an OpenMP AMDGPU offload capable compiler?
       A subset	of the ROCm toolchain is required to build the LLVM  toolchain
       and  to	execute	the openmp application.	 Either	install	ROCm somewhere
       that cmake's find_package can locate it,	or build the required  subcom-
       ponents ROCt and	ROCr from source.

       The  two	 components used are ROCT-Thunk-Interface, roct, and ROCR-Run-
       time, rocr.  Roct is the	userspace part of the linux driver.  It	 calls
       into the	driver which ships with	the linux kernel. It is	an implementa-
       tion  detail  of	Rocr from OpenMP's perspective.	Rocr is	an implementa-
       tion of HSA.

	  SOURCE_DIR=same-as-llvm-source # e.g.	the checkout of	llvm-project, next to openmp
	  BUILD_DIR=somewhere
	  INSTALL_PREFIX=same-as-llvm-install

	  cd $SOURCE_DIR
	  git clone git@github.com:RadeonOpenCompute/ROCT-Thunk-Interface.git -b roc-4.1.x \
	    --single-branch
	  git clone git@github.com:RadeonOpenCompute/ROCR-Runtime.git -b rocm-4.1.x \
	    --single-branch

	  cd $BUILD_DIR	&& mkdir roct && cd roct
	  cmake	$SOURCE_DIR/ROCT-Thunk-Interface/ -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \
	    -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
	  make && make install

	  cd $BUILD_DIR	&& mkdir rocr && cd rocr
	  cmake	$SOURCE_DIR/ROCR-Runtime/src -DIMAGE_SUPPORT=OFF \
	    -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DCMAKE_BUILD_TYPE=Release \
	    -DBUILD_SHARED_LIBS=ON
	  make && make install

       IMAGE_SUPPORT requires building rocr with clang	and  is	 not  used  by
       openmp.

       Provided	 cmake's  find_package can find	the ROCR-Runtime package, LLVM
       will build a tool bin/amdgpu-arch which will print a string like	gfx906
       when run	if it recognises a GPU on the local  system.  LLVM  will  also
       build  a	 shared	 library,  libomptarget.rtl.amdgpu.so, which is	linked
       against rocr.

       With those libraries installed, then LLVM build and installed, try:

	  clang	-O2 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa	example.c -o example &&	./example

   Q: What are the known limitations of	OpenMP AMDGPU offload?
       LD_LIBRARY_PATH is presently required to	find the openmp	libraries.

       There is	no libc. That is, malloc and printf  do	 not  exist.  Also  no
       libm, so	functions like cos(double) will	not work from target regions.

       Cards  from  the	gfx10 line, 'navi', that use wave32 are	not yet	imple-
       mented.

       Some versions of	the driver for the radeon vii (gfx906) will error  un-
       less  the  environment variable 'export HSA_IGNORE_SRAMECC_MISREPORT=1'
       is set.

       It is a recent addition to LLVM and  the	 implementation	 differs  from
       that  which  has	 been  shipping	 in ROCm and AOMP for some time. Early
       adopters	will encounter bugs.

   Q: Does OpenMP offloading support work in pre-packaged LLVM releases?
       For now,	the answer is most likely no. Please see Q: How	 to  build  an
       OpenMP GPU offload capable compiler?.

   Q:  Does  OpenMP offloading support work in packages	distributed as part of
       my OS?
       For now,	the answer is most likely no. Please see Q: How	 to  build  an
       OpenMP GPU offload capable compiler?.

   Q:  Does Clang support <math.h> and <complex.h> operations in OpenMP	target
       on GPUs?
       Yes, LLVM/Clang allows math functions and complex arithmetic inside  of
       OpenMP target regions that are compiled for GPUs.

       Clang  provides	a  set	of  wrapper  headers that are found first when
       math.h and complex.h, for C, cmath and complex,	for  C++,  or  similar
       headers are included by the application.	These wrappers will eventually
       include	the system version of the corresponding	header file after set-
       ting up a target	device specific	environment. The fact that the	system
       header is included is important because they differ based on the	archi-
       tecture	and  operating	system and may contain preprocessor, variable,
       and function definitions	that need to be	available in the target	region
       regardless of the targeted device architecture.	However, various func-
       tions may require specialized device versions, e.g.,  sin,  and	others
       are  only  available  on	 certain devices, e.g.,	__umul64hi. To provide
       "native"	support	for math and complex on	the  respective	 architecture,
       Clang  will  wrap the "native" math functions, e.g., as provided	by the
       device vendor, in an OpenMP begin/end declare variant. These  functions
       will  then  be  picked  up instead of the host versions while host only
       variables and function definitions are still available. Complex	arith-
       metic  and  functions  are  support  through a similar mechanism. It is
       worth noting that this support requires extensions to  the  OpenMP  be-
       gin/end	declare	 variant  context  selector  that  are exposed through
       LLVM/Clang to the user as well.

   Q: What is a	way to debug errors from mapping memory	to a target device?
       An experimental way to debug these errors is to use remote process  of-
       floading.    By	 using	 libomptarget.rtl.rpc.so  and  openmp-offload-
       ing-server, it is possible to explicitly	perform	memory	transfers  be-
       tween  processes	 on  the host CPU and run sanitizers while doing so in
       order to	catch these errors.

   Q: Why does my application say "Named symbol	not found" and	abort  when  I
       run it?
       This is most likely caused by trying to use OpenMP offloading with sta-
       tic libraries. Static libraries do not contain any device code, so when
       the  runtime attempts to	execute	the target region it will not be found
       and you will get	an an error like this.

	  CUDA error: Loading '__omp_offloading_fd02_3231c15__Z3foov_l2' Failed
	  CUDA error: named symbol not found
	  Libomptarget error: Unable to	generate entries table for device id 0.

       Currently, the only solution is to change how the application is	 built
       and avoid the use of static libraries.

   Q: Can I use	dynamically linked libraries with OpenMP offloading?
       Dynamically  linked  libraries  can  be only used if there is no	device
       code split between the library and application.	Anything  declared  on
       the  device inside the shared library will not be visible to the	appli-
       cation when it's	linked.

   Q: How to build an OpenMP offload capable compiler with  an	outdated  host
       compiler?
       Enabling	the OpenMP runtime will	perform	a two-stage build for you.  If
       your host compiler is different from your system-wide compiler, you may
       need to set the CMake variable GCC_INSTALL_PREFIX so clang will be able
       to find the correct GCC toolchain in the	second stage of	the build.

       For  example,  if your system-wide GCC installation is too old to build
       LLVM and	you would like to use a	newer  GCC,  set  the  CMake  variable
       GCC_INSTALL_PREFIX  to  inform  clang of	the GCC	installation you would
       like to use in the second stage.

   Q: How can I	include	OpenMP offloading support in my	CMake project?
       Currently, there	is an experimental CMake find module for OpenMP	target
       offloading provided by LLVM. It will attempt to find OpenMP target  of-
       floading	support	for your compiler. The flags necessary for OpenMP tar-
       get  offloading will be loaded into the OpenMPTarget::OpenMPTarget_<de-
       vice> target or the OpenMPTarget_<device>_FLAGS variable	if successful.
       Currently supported devices are AMDGPU and NVPTX.

       To use this module, simply add the path to CMake's current module  path
       and  call  find_package.	 The module will be installed with your	OpenMP
       installation by default.	Including OpenMP offloading support in an  ap-
       plication should	now only require a few additions.

	  cmake_minimum_required(VERSION 3.13.4)
	  project(offloadTest VERSION 1.0 LANGUAGES CXX)

	  list(APPEND CMAKE_MODULE_PATH	"${PATH_TO_OPENMP_INSTALL}/lib/cmake/openmp")

	  find_package(OpenMPTarget REQUIRED NVPTX)

	  add_executable(offload)
	  target_link_libraries(offload	PRIVATE	OpenMPTarget::OpenMPTarget_NVPTX)
	  target_sources(offload PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src/Main.cpp)

       Using  this  module  requires  at least CMake version 3.13.4. Supported
       languages are C and C++ with Fortran support  planned  in  the  future.
       Compiler	 support  is  best  for	 Clang but this	module should work for
       other compiler vendors such as IBM, GNU.

       The current (in-progress) release notes can be found here while release
       notes for releases, starting with LLVM 12, will	be  available  on  the
       Download	Page.

OPENMP 12.0.0 RELEASE NOTES
       WARNING:
	  These	 are  in-progress  notes for the upcoming LLVM 12.0.0 release.
	  Release notes	for previous releases can be  found  on	 the  Download
	  Page.

   Introduction
       This  document  contains	 the release notes for the OpenMP runtime, re-
       lease 12.0.0.  Here we describe the status of OpenMP,  including	 major
       improvements  from  the	previous  release.  All	OpenMP releases	may be
       downloaded from the LLVM	releases web site.

   Non-comprehensive list of changes in	this release
       • Extended the libomptarget API functions to  include  source  location
	 information  and OpenMP target	mapper support.	This allows libomptar-
	 get to	know the source	location of the	OpenMP region it is executing,
	 as well as the	name and declarations of all the variables used	inside
	 the region. Each function generated now uses its mapper variant.  The
	 old  API calls	now call into the new API functions with nullptr argu-
	 ments for backwards compatibility with	old binaries. Source  location
	 information  for  libomptarget	is now generated by Clang at any level
	 of debugging information.

       • Added improved	error messages for libomptarget	and CUDA plugins.  Er-
	 ror messages are now presented	without	requiring a debug build	of li-
	 bomptarget.  The  newly added source location information can also be
	 used to identify which	OpenMP target region the failure occurred  in.
	 More information can be found here.

       • Added additional environment variables	to control output from the li-
	 bomptarget  runtime  library.	LIBOMPTARGET_PROFILE  to generate time
	 profile output	similar	to Clang's  -ftime-trace  option.   LIBOMPTAR-
	 GET_MEMORY_MANAGER_THRESHOLD  sets  the  threshold size for which the
	 libomptarget memory manager will handle the  allocation.   LIBOMPTAR-
	 GET_INFO  allows the user to request certain information from the li-
	 bomptarget runtime using a 32-bit field. A full description  of  each
	 environment variable is described here.

       • target	 nowait	 was supported via hidden helper task, which is	a task
	 not bound to any parallel region. A hidden helper team	with a	number
	 of  threads  is  created when the first hidden	helper task is encoun-
	 tered.	The number of threads can be configured	 via  the  environment
	 variable LIBOMP_NUM_HIDDEN_HELPER_THREADS. By default it is 8.	If LI-
	 BOMP_NUM_HIDDEN_HELPER_THREADS=0,  hidden helper task is disabled and
	 falls back to a regular OpenMP	task. It can also be disabled by  set-
	 ting the environment variable LIBOMP_USE_HIDDEN_HELPER_TASK=OFF.

       • deviceRTLs  for  NVPTX	 platform  is  CUDA  free now. It is generally
	 OpenMP	 code.	 Target	  dependent   parts   are   implemented	  with
	 Clang/LLVM/NVVM  intrinsics. CUDA SDK is also dropped as a dependence
	 to build the device runtime, which means device runtime can  also  be
	 built	on a CUDA free system. However,	it is disabled by default. Set
	 the CMake variable LIBOMPTARGET_BUILD_NVPTX_BCLIB=ON  to  enable  the
	 build of NVPTX	device runtime on a CUDA free system. gcc-multilib and
	 g++-multilib  are required. If	CUDA is	found, the device runtime will
	 be built by default.

       • Static	 NVPTX	device	runtime	 library  (libomptarget-nvptx.a)   was
	 dropped.   A  bitcode library is required to build an OpenMP program.
	 If the	library	is not found in	the default path or any	of  the	 paths
	 defined by LIBRARY_PATH, an error will	be raised. User	can also spec-
	 ify   the  path  to  the  bitcode  device  library  via  --libomptar-
	 get-nvptx-bc-path=.

AUTHOR
       unknown

COPYRIGHT
       2013-2025, LLVM/OpenMP

				 Apr 17, 2025			 LLVMOPENMP(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=llvmopenmp13&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages