FreeBSD Manual Pages

home | help

LLVMOPENMP(1) LLVM/OpenMP LLVMOPENMP(1)

NAME
llvmopenmp - LLVM/OpenMP

NOTE:
This document is a work in progress and most of the expected content
is not yet available. While you can expect changes, we always wel-
come feedback and additions. Please contact, e.g., through
openmp-dev@lists.llvm.org.

OpenMP impacts various parts of the LLVM project, from the frontends (-
Clang and Flang), through middle-end optimizations, up to the multitude
of available OpenMP runtimes.

A high-level overview of OpenMP in LLVM can be found here.

OPENMP IN LLVM --- DESIGN OVERVIEW
Resources
• OpenMP Booth @ SC19: "OpenMP clang and flang Development"
https://youtu.be/6yOa-hRi63M

LLVM/OpenMP Runtimes
There are four distinct types of LLVM/OpenMP runtimes

LLVM/OpenMP Host Runtime (libomp)
An early (2015) design document for the LLVM/OpenMP host runtime, aka.
libomp.so, is available as a pdf.

LLVM/OpenMP Target Host Runtime (libomptarget)
Environment Variables
libomptarget uses environment variables to control different features
of the library at runtime. This allows the user to obtain useful run-
time information as well as enable or disable certain features. A full
list of supported environment variables is defined below.

• LIBOMPTARGET_DEBUG=<Num>

• LIBOMPTARGET_PROFILE=<Filename>

• LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=<Num>

• LIBOMPTARGET_INFO=<Num>

LIBOMPTARGET_DEBUG
LIBOMPTARGET_DEBUG controls whether or not debugging information will
be displayed. This feature is only availible if libomptarget was built
with -DOMPTARGET_DEBUG. The debugging output provided is intended for
use by libomptarget developers. More user-friendly output is presented
when using LIBOMPTARGET_INFO.

LIBOMPTARGET_PROFILE
LIBOMPTARGET_PROFILE allows libomptarget to generate time profile out-
put similar to Clang's -ftime-trace option. This generates a JSON file
based on Chrome Tracing that can be viewed with chrome://tracing or the
Speedscope App. Building this feature depends on the LLVM Support Li-
brary for time trace output. Using this library is enabled by default
when building using the CMake option OPENMP_ENABLE_LIBOMPTARGET_PROFIL-
ING. The output will be saved to the filename specified by the environ-
ment variable. For multi-threaded applications, profiling in libomp is
also needed. Setting the CMake option OPENMP_ENABLE_LIBOMP_PROFILING=ON
to enable the feature. Note that this will turn libomp into a C++ li-
brary.

LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD sets the threshold size for which
the libomptarget memory manager will handle the allocation. Any alloca-
tions larger than this threshold will not use the memory manager and be
freed after the device kernel exits. The default threshold value is
8KB. If LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD is set to 0 the memory
manager will be completely disabled.

LIBOMPTARGET_INFO
LIBOMPTARGET_INFO allows the user to request different types of runtime
information from libomptarget. LIBOMPTARGET_INFO uses a 32-bit field to
enable or disable different types of information. This includes infor-
mation about data-mappings and kernel execution. It is recommended to
build your application with debugging information enabled, this will
enable filenames and variable declarations in the information messages.
OpenMP Debugging information is enabled at any level of debugging so a
full debug runtime is not required. For minimal debugging information
compile with -gline-tables-only, or compile with -g for full debug in-
formation. A full list of flags supported by LIBOMPTARGET_INFO is given
below.

• Print all data arguments upon entering an OpenMP device kernel:
0x01

• Indicate when a mapped address already exists in the device map-
ping table: 0x02

• Dump the contents of the device pointer map at kernel exit: 0x04

• Print OpenMP kernel information from device plugins: 0x10

Any combination of these flags can be used by setting the appropriate
bits. For example, to enable printing all data active in an OpenMP tar-
get region along with CUDA information, run the following bash command.

$ env LIBOMPTARGET_INFO=$((1 << 0x1 | 1 << 0x10)) ./your-application

Or, to enable every flag run with every bit set.

$ env LIBOMPTARGET_INFO=-1 ./your-application

For example, given a small application implementing the ZAXPY BLAS rou-
tine, Libomptarget can provide useful information about data mappings
and thread usages.

#include <complex>

using complex = std::complex<double>;

void zaxpy(complex *X, complex *Y, complex D, std::size_t N) {
#pragma omp target teams distribute parallel for
for (std::size_t i = 0; i < N; ++i)
Y[i] = D * X[i] + Y[i];
}

int main() {
const std::size_t N = 1024;
complex X[N], Y[N], D;
#pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])
zaxpy(X, Y, D, N);
}

Compiling this code targeting nvptx64 with all information enabled will
provide the following output from the runtime library.

$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only zaxpy.cpp -o zaxpy
$ env LIBOMPTARGET_INFO=-1 ./zaxpy

Info: Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32
Info: Entering OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
Info: to(X[0:N])[16384]
Info: tofrom(Y[0:N])[16384]
Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:14:1:
Info: Host Ptr Target Ptr Size (B) RefCount Declaration
Info: 0x00007fff963f4000 0x00007fd225004000 16384 1 Y[0:N] at zaxpy.cpp:13:17
Info: 0x00007fff963f8000 0x00007fd225000000 16384 1 X[0:N] at zaxpy.cpp:13:11
Info: Entering OpenMP kernel at zaxpy.cpp:6:1 with 4 arguments:
Info: firstprivate(N)[8] (implicit)
Info: use_address(Y)[0] (implicit)
Info: tofrom(D)[16] (implicit)
Info: use_address(X)[0] (implicit)
Info: Mapping exists (implicit) with HstPtrBegin=0x00007ffe37d8be80,
TgtPtrBegin=0x00007f90ff004000, Size=0, updated RefCount=2, Name=Y
Info: Mapping exists (implicit) with HstPtrBegin=0x00007ffe37d8fe80,
TgtPtrBegin=0x00007f90ff000000, Size=0, updated RefCount=2, Name=X
Info: Launching kernel __omp_offloading_fd02_c2c4ac1a__Z5daxpyPNSt3__17complexIdEES2_S1_m_l6
with 8 blocks and 128 threads in SPMD mode
Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:6:1:
Info: Host Ptr Target Ptr Size (B) RefCount Declaration
Info: 0x00007fff963f4000 0x00007fd225004000 16384 1 Y[0:N] at zaxpy.cpp:13:17
Info: 0x00007fff963f8000 0x00007fd225000000 16384 1 X[0:N] at zaxpy.cpp:13:11
Info: Exiting OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
Info: to(X[0:N])[16384]
Info: tofrom(Y[0:N])[16384]

From this information, we can see the OpenMP kernel being launched on
the CUDA device with enough threads and blocks for all 1024 iterations
of the loop in simplified SPMD Mode. The information from the OpenMP
data region shows the two arrays X and Y being copied from the host to
the device. This creates an entry in the host-device mapping table as-
sociating the host pointers to the newly created device data. The data
mappings in the OpenMP device kernel show the default mappings being
used for all the variables used implicitly on the device. Because X and
Y are already mapped in the device's table, no new entries are created.
Additionally, the default mapping shows that D will be copied back from
the device once the OpenMP device kernel region ends even though it
isn't written to. Finally, at the end of the OpenMP data region the en-
tries for X and Y are removed from the table.

Errors:
libomptarget provides error messages when the program fails inside the
OpenMP target region. Common causes of failure could be an invalid
pointer access, running out of device memory, or trying to offload when
the device is busy. If the application was built with debugging symbols
the error messages will additionally provide the source location of the
OpenMP target region.

For example, consider the following code that implements a simple par-
allel reduction on the GPU. This code has a bug that causes it to fail
in the offloading region.

#include <cstdio>

double sum(double *A, std::size_t N) {
double sum = 0.0;
#pragma omp target teams distribute parallel for reduction(+:sum)
for (int i = 0; i < N; ++i)
sum += A[i];

return sum;
}

int main() {
const int N = 1024;
double A[N];
sum(A, N);
}

If this code is compiled and run, there will be an error message indi-
cating what is going wrong.

$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum
$ ./sum

CUDA error: Error when copying data from device to host.
CUDA error: an illegal memory access was encountered
Libomptarget error: Copying data from device failed.
Libomptarget error: Call to targetDataEnd failed, abort target.
Libomptarget error: Failed to process data after launching the kernel.
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
sum.cpp:5:1: Libomptarget error 1: failure of target construct while offloading is mandatory

This shows that there is an illegal memory access occuring inside the
OpenMP target region once execution has moved to the CUDA device, sug-
gesting a segmentation fault. This then causes a chain reaction of
failures in libomptarget. Another message suggests using the LIBOMPTAR-
GET_INFO environment variable as described in Environment Variables. If
we do this it will print the sate of the host-target pointer mappings
at the time of failure.

$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum
$ env LIBOMPTARGET_INFO=4 ./sum

info: OpenMP Host-Device pointer mappings after block at sum.cpp:5:1:
info: Host Ptr Target Ptr Size (B) RefCount Declaration
info: 0x00007ffc058280f8 0x00007f4186600000 8 1 sum at sum.cpp:4:10

This tells us that the only data mapped between the host and the device
is the sum variable that will be copied back from the device once the
reduction has ended. There is no entry mapping the host array A to the
device. In this situation, the compiler cannot determine the size of
the array at compile time so it will simply assume that the pointer is
mapped on the device already by default. The solution is to add an ex-
plicit map clause in the target region.

double sum(double *A, std::size_t N) {
double sum = 0.0;
#pragma omp target teams distribute parallel for reduction(+:sum) map(to:A[0 : N])
for (int i = 0; i < N; ++i)
sum += A[i];

return sum;
}

OpenMP in LLVM --- Offloading Design
OpenMP Target Offloading --- SPMD Mode
OpenMP Target Offloading --- Generic Mode
LLVM/OpenMP Target Host Runtime Plugins (libomptarget.rtl.XXXX)
Remote Offloading Plugin:
The remote offloading plugin permits the execution of OpenMP target re-
gions on devices in remote hosts in addition to the devices connected
to the local host. All target devices on the remote host will be ex-
posed to the application as if they were local devices, that is, the
remote host CPU or its GPUs can be offloaded to with the appropriate
device number. If the server is running on the same host, each device
may be identified twice: once through the device plugins and once
through the device plugins that the server application has access to.

This plugin consists of libomptarget.rtl.rpc.so and openmp-offload-
ing-server which should be running on the (remote) host. The server ap-
plication does not have to be running on a remote host, and can instead
be used on the same host in order to debug memory mapping during of-
floading. These are implemented via gRPC/protobuf so these libraries
are required to build and use this plugin. The server must also have
access to the necessary target-specific plugins in order to perform the
offloading.

Due to the experimental nature of this plugin, the CMake variable LI-
BOMPTARGET_ENABLE_EXPERIMENTAL_REMOTE_PLUGIN must be set in order to
build this plugin. For example, the rpc plugin is not designed to be
thread-safe, the server cannot concurrently handle offloading from mul-
tiple applications at once (it is synchronous) and will terminate after
a single execution. Note that openmp-offloading-server is unable to re-
mote offload onto a remote host itself and will error out if this is
attempted.

Remote offloading is configured via environment variables at runtime of
the OpenMP application:

• LIBOMPTARGET_RPC_ADDRESS=<Address>:<Port>

• LIBOMPTARGET_RPC_ALLOCATOR_MAX=<NumBytes>

• LIBOMPTARGET_BLOCK_SIZE=<NumBytes>

• LIBOMPTARGET_RPC_LATENCY=<Seconds>

LIBOMPTARGET_RPC_ADDRESS
The address and port at which the server is running. This needs to be
set for the server and the application, the default is 0.0.0.0:50051. A
single OpenMP executable can offload onto multiple remote hosts by set-
ting this to comma-seperated values of the addresses.

LIBOMPTARGET_RPC_ALLOCATOR_MAX
After allocating this size, the protobuf allocator will clear. This can
be set for both endpoints.

LIBOMPTARGET_BLOCK_SIZE
This is the maximum size of a single message while streaming data
transfers between the two endpoints and can be set for both endpoints.

LIBOMPTARGET_RPC_LATENCY
This is the maximum amount of time the client will wait for a response
from the server.

LLVM/OpenMP Target Device Runtime (libomptarget-ARCH-SUBARCH.bc)
LLVM, since version 11 (12 Oct 2020), has an OpenMP-Aware optimization
pass as well as the ability to perform "scalar optimizations" across
OpenMP region boundaries.

In-depth discussion of the topic can be found here.

OPENMP OPTIMIZATIONS IN LLVM
LLVM, since version 11 (12 Oct 2020), has an OpenMP-Aware optimization
pass as well as the ability to perform "scalar optimizations" across
OpenMP region boundaries.

OpenMP-Aware Optimizations
OpenMPOpt
Resources
• 2020 LLVM Developers Meeting: "(OpenMP) Parallelism-Aware Optimiza-
tions" https://youtu.be/gtxWkeLCxmU

• 2019 EuroLLVM Developers Meeting: "Compiler Optimizations for
(OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30

OpenMP-Unaware Optimizations
Resources
• 2018 LLVM Developers Meeting: "Optimizing Indirections, using ab-
stractions without remorse" https://youtu.be/zfiHaPaoQPc

• 2019 LLVM Developers Meeting: "The Attributor: A Versatile Inter-pro-
cedural Fixpoint Iteration Framework" https://youtu.be/CzWkc_JcfS0

LLVM has an elaborate ecosystem around analysis and optimization re-
marks issues during compilation. The remarks can be enabled from the
clang frontend [1] [2] in various formats [3] [4] to be used by tools,
i.a., opt-viewer or llvm-opt-report (dated).

The OpenMP optimizations in LLVM have been developed with remark sup-
port as a priority. For a list of OpenMP specific remarks and more in-
formation on them, please refer to OpenMP Optimization Remarks.

• [1]
https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports

• [2]
https://clang.llvm.org/docs/ClangCommandLineReference.html#diagnostic-flags

• [3]
https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-foptimization-record-file

• [4]
https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang1-fsave-optimization-record

OPENMP OPTIMIZATION REMARKS
The OpenMP-Aware optimization pass is able to generate compiler remarks
for performed and missed optimisations. To emit them, pass
-Rpass=openmp-opt, -Rpass-analysis=openmp-opt, and
-Rpass-missed=openmp-opt to the Clang invocation. For more information
and features of the remark system the clang documentation should be
consulted:

• Clang options to emit optimization reports

• Clang diagnostic and remark flags

• The -foptimization-record-file flag and the
-fsave-optimization-record flag

Some OpenMP remarks start with a "tag", like [OMP100], which indicates
that there is further information about them on this page. To directly
jump to the respective entry, navigate to
https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX
where XXX is the three digit code shown in the tag.

----

[OMP100] Potentially unknown OpenMP target region caller
A function remark that indicates the function, when compiled for a GPU,
is potentially called from outside the translation unit. Note that a
remark is only issued if we tried to perform an optimization which
would require us to know all callers on the GPU.

To facilitate OpenMP semantics on GPUs we provide a runtime mechanism
through which the code that makes up the body of a parallel region is
shared with the threads in the team. Generally we use the address of
the outlined parallel region to identify the code that needs to be exe-
cuted. If we know all target regions that reach the parallel region we
can avoid this function pointer passing scheme and often improve the
register usage on the GPU. However, If a parallel region on the GPU is
in a function with external linkage we may not know all callers stati-
cally. If there are outside callers within target regions, this remark
is to be ignored. If there are no such callers, users can modify the
linkage and thereby help optimization with a static or __at-
tribute__((internal)) function annotation. If changing the linkage is
impossible, e.g., because there are outside callers on the host, one
can split the function into an external visible interface which is not
compiled for the target and an internal implementation which is com-
piled for the target and should be called from within the target re-
gion.

Dealing with OpenMP can be complicated. For help with the setup of an
OpenMP (offload) capable compiler toolchain, its usage, and common
problems, consult the Support and FAQ page.

We also encourage everyone interested in OpenMP in LLVM to get in-
volved.

SUPPORT, GETTING INVOLVED, AND FAQ
Please do not hesitate to reach out to us via openmp-dev@lists.llvm.org
or join one of our regular calls. Some common questions are answered in
the FAQ.

Calls
OpenMP in LLVM Technical Call
• Development updates on OpenMP (and OpenACC) in the LLVM Project, in-
cluding Clang, optimization, and runtime work.

• Join OpenMP in LLVM Technical Call.

• Time: Weekly call on every Wednesday 7:00 AM Pacific time.

• Meeting minutes are here.

• Status tracking page.

OpenMP in Flang Technical Call
• Development updates on OpenMP and OpenACC in the Flang Project.

• Join OpenMP in Flang Technical Call

• Time: Weekly call on every Thursdays 8:00 AM Pacific time.

• Meeting minutes are here.

• Status tracking page.

FAQ
NOTE:
The FAQ is a work in progress and most of the expected content is
not yet available. While you can expect changes, we always welcome
feedback and additions. Please contact, e.g., through
openmp-dev@lists.llvm.org.

Q: How to contribute a patch to the webpage or any other part?
All patches go through the regular LLVM review process.

Q: How to build an OpenMP offload capable compiler?
To build an effective OpenMP offload capable compiler, only one extra
CMake option, LLVM_ENABLE_RUNTIMES="openmp", is needed when building
LLVM (Generic information about building LLVM is available here.).
Make sure all backends that are targeted by OpenMP to be enabled. By
default, Clang will be built with all backends enabled.

If your build machine is not the target machine or automatic detection
of the available GPUs failed, you should also set:

• CLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_XX where XX is the architecture of
your GPU, e.g, 80.

• LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=YY where YY is the numeric
compute capacity of your GPU, e.g., 75.

NOTE:
The compiler that generates the offload code should be the same
(version) as the compiler that builds the OpenMP device runtimes.
The OpenMP host runtime can be built by a different compiler.

Q: Does OpenMP offloading support work in pre-packaged LLVM releases?
For now, the answer is most likely no. Please see Q: How to build an
OpenMP offload capable compiler?.

Q: Does OpenMP offloading support work in packages distributed as part of
my OS?
For now, the answer is most likely no. Please see Q: How to build an
OpenMP offload capable compiler?.

Q: Does Clang support <math.h> and <complex.h> operations in OpenMP target
on GPUs?
Yes, LLVM/Clang allows math functions and complex arithmetic inside of
OpenMP target regions that are compiled for GPUs.

Clang provides a set of wrapper headers that are found first when
math.h and complex.h, for C, cmath and complex, for C++, or similar
headers are included by the application. These wrappers will eventually
include the system version of the corresponding header file after set-
ting up a target device specific environment. The fact that the system
header is included is important because they differ based on the archi-
tecture and operating system and may contain preprocessor, variable,
and function definitions that need to be available in the target region
regardless of the targeted device architecture. However, various func-
tions may require specialized device versions, e.g., sin, and others
are only available on certain devices, e.g., __umul64hi. To provide
"native" support for math and complex on the respective architecture,
Clang will wrap the "native" math functions, e.g., as provided by the
device vendor, in an OpenMP begin/end declare variant. These functions
will then be picked up instead of the host versions while host only
variables and function definitions are still available. Complex arith-
metic and functions are support through a similar mechanism. It is
worth noting that this support requires extensions to the OpenMP be-
gin/end declare variant context selector that are exposed through
LLVM/Clang to the user as well.

Q: What is a way to debug errors from mapping memory to a target device?
An experimental way to debug these errors is to use remote process of-
floading. By using libomptarget.rtl.rpc.so and openmp-offload-
ing-server, it is possible to explicitly perform memory transfers be-
tween processes on the host CPU and run sanitizers while doing so in
order to catch these errors.

The current (in-progress) release notes can be found here while release
notes for releases, starting with LLVM 12, will be available on the
Download Page.

OPENMP 12.0.0 RELEASE NOTES
WARNING:
These are in-progress notes for the upcoming LLVM 12.0.0 release.
Release notes for previous releases can be found on the Download
Page.

Introduction
This document contains the release notes for the OpenMP runtime, re-
lease 12.0.0. Here we describe the status of OpenMP, including major
improvements from the previous release. All OpenMP releases may be
downloaded from the LLVM releases web site.

Non-comprehensive list of changes in this release
• Extended the libomptarget API functions to include source location
information and OpenMP target mapper support. This allows libomptar-
get to know the source location of the OpenMP region it is executing,
as well as the name and declarations of all the variables used inside
the region. Each function generated now uses its mapper variant. The
old API calls now call into the new API functions with nullptr argu-
ments for backwards compatibility with old binaries. Source location
information for libomptarget is now generated by Clang at any level
of debugging information.

• Added improved error messages for libomptarget and CUDA plugins. Er-
ror messages are now presented without requiring a debug build of li-
bomptarget. The newly added source location information can also be
used to identify which OpenMP target region the failure occurred in.
More information can be found here.

• Added additional environment variables to control output from the li-
bomptarget runtime library. LIBOMPTARGET_PROFILE to generate time
profile output similar to Clang's -ftime-trace option. LIBOMPTAR-
GET_MEMORY_MANAGER_THRESHOLD sets the threshold size for which the
libomptarget memory manager will handle the allocation. LIBOMPTAR-
GET_INFO allows the user to request certain information from the li-
bomptarget runtime using a 32-bit field. A full description of each
environment variable is described here.

• target nowait was supported via hidden helper task, which is a task
not bound to any parallel region. A hidden helper team with a number
of threads is created when the first hidden helper task is encoun-
tered. The number of threads can be configured via the environment
variable LIBOMP_NUM_HIDDEN_HELPER_THREADS. By default it is 8. If LI-
BOMP_NUM_HIDDEN_HELPER_THREADS=0, hidden helper task is disabled and
falls back to a regular OpenMP task. It can also be disabled by set-
ting the environment variable LIBOMP_USE_HIDDEN_HELPER_TASK=OFF.

• deviceRTLs for NVPTX platform is CUDA free now. It is generally
OpenMP code. Target dependent parts are implemented with
Clang/LLVM/NVVM intrinsics. CUDA SDK is also dropped as a dependence
to build the device runtime, which means device runtime can also be
built on a CUDA free system. However, it is disabled by default. Set
the CMake variable LIBOMPTARGET_BUILD_NVPTX_BCLIB=ON to enable the
build of NVPTX device runtime on a CUDA free system. gcc-multilib and
g++-multilib are required. If CUDA is found, the device runtime will
be built by default.

• Static NVPTX device runtime library (libomptarget-nvptx.a) was
dropped.

A bitcode library is required to build an OpenMP program. If the li-
brary is not found in the default path or any of the paths defined by
LIBRARY_PATH, an error will be raised. User can also specify the path
to the bitcode device library via --libomptarget-nvptx-bc-path=.

AUTHOR
unknown

Nov 01, 2025 LLVMOPENMP(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=llvmopenmp12&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help

Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages