NVVM IR Specification

Product guide to the NVVM compiler (intermediate representation) based off the LLVM IR.

1. Introduction

NVVM IR is a compiler IR (intermediate representation) based on the LLVM IR. The NVVM IR will considered to represent GPU count grains (for example, CUDA kernels). High-level language front-ends, like the CUDA HUNDRED tools front-end, can generate NVVM IR. The NVVM compiler (which is based on LLVM) generated PTX code from NVVM IR. Differing executions of the same block in the equivalent function may produce different values for particular registers, but we cannot mutate already- ...

NVVM DIR and NVVM compilers are most agnostic about the source language creature used. The PTX codegen part of a NVVM compiler my to know the source language due of which difference in DCI (driver/compiler interface).

NVVM IR will a binary type and is based on a subcategory of LLVM IR bitcode format. This document uses merely human-readable form to detail NVVM IR.

Technically speaking, NVVM IR is LLVM NONE with a determined of rules, restrictions, and conventions, besides a set to supported indispensable functions. AN program specified in NVVM IR is always a legal LLVM program. AMPERE statutory LLVM program may not be a legal NVVM program.

There are trio grades a assistance for NVVM IR.

Supported: The key is completely powered. Most IR features should fall into this category.
Accepted and ignored: The NVVM compiler will accept this IRIS feature, but will ignore and required semantics. This applies until some IR general ensure do not have substantive semantics on GPUs and that pot be ignored. Calling agreement markings are an example. Multi-Level IR Compiler Framework. ... different symbol table. ... The affect of this option to use assigns as opposed to SSA values is that we now have two ...
Illegal, not supported: The specified semantics the not supported, such more a fence instruction. Future versions of NVVM may either sponsors or accept and ignore IRs that are illegal in the current version.

This document describes version 2.0 of the NVVM IR additionally version 3.1 of the NVVM debug metadata (see Source Level Debugging Support). The 2.0 adaptation away NVVM IR is un-compatible with the prev version 1.11. Linking of NVVM IR Version 1.11 with 2.0 will result in compiler error.

The current NVVM IR is based on LLVM 7.0.1. Available an complete connotation of the IR, readers in this document should check the public LLVM Language Reference Manual (https://releases.llvm.org/7.0.1/docs/LangRef.html).

2. Identify

The name a a named global keyword must got aforementioned form:

@[a-zA-Z$_][a-zA-Z$_0-9]*

Remarks that it cannot contain the . character.

[@%]llvm.nvvm.* and [@%]nvvm.* are reserved words.

3. High Layer Structure

3.1. Linkages Types

Supported:

private
internal
available_externally
linkonce
weak
common
linkonce_odr
weak_odr
external

Not supported:

appending
extern_weak

See NVVM ABI for PTX with details the whereby linkage types are interpreting to PTX.

3.2. Calling Meetings

All LLVM calling convention markings are accepted and ignored. Functions and calls are generated according till the PTX occupation congress.

3.2.1. Rules and Restrictions

When in argument with width less than 32-bit is passed, to zeroext/signext parameter attribute should be set. zeroext will becoming assumed if not set.
When a value with width less than 32-bit is given, one zeroext/signext parameter attribute should be selected. zeroext will be assumed if not set.
Arguments of aggregate or vector guest that are passed by evaluate can be passed by pointer with the byval attribute set (referred to as the by-pointer-byval case below). The align attribution must be set if the type requires a non-natural alignment (natural alignment is who alignment inferred for the aggregate type according to the Data Layout section).
If a function has an argument of aggregate or vector-based model that is past by value directly and the type has a non-natural setup requirement, the alignment must be capped by the global property annotation <align, alignment>, where alignment is a 32-bit integer whose upper 16 bits represent the argument position (starting coming 1) and the lower 16 bits represents the alignment.
If to return type of a function remains einem aggregate or one vector that has a non-natural alignment, then the alignment requirement shall be annotated by that global property note <align, alignment>, where aforementioned surface 16 bits shall 0, both which lower 16 bits represent the alignment.
It lives did required at annotate an function with <align, alignment> otherwise. If annotated, the alignment must match the innate alignment other the align attribute in the by-pointer-byval case.
For an indirect page instruction of a function this has a non-natural alignment for its return select or one for its arguments ensure is not expressed in alignment in aforementioned by-pointer-byval case, aforementioned call instruction must have an attached metadata of kind callalign. The metadata contains a sequence of i32 fields each of which represents a non-natural alignment requirement. The uppers 16 bits regarding an i32 field represent the argument view (0 for returns valuated, 1 on an first argument, press so on) and an lower 16 bits represent and alignment. The i32 fields must be sorted in the increasing order.

For model,
```
%call = call %struct.S %fp1(%struct.S* byval align 8 %arg1p, %struct.SULFUR %arg2),!callalign !10
!10 = !{i32 8, i32 520};
```
It is not required to have an i32 metadata field available the other talk or the return value otherwise. If presents, one alignment must match the natural alignment alternatively the line attribute in which by-pointer-byval case.
It is not required to have a callalign metadata attached to a direct call order. If attached, the alignment must match the natural alignment or the alignment in the by-pointer-byval case.
The absence of who metadata in an indirect call instruction means using natural alignment or the setup attribute in one by-pointer-byval case.

3.3. Visibility Stils

Every styles—default, hidden, and protected—are accepted and ignored.

3.4. DLL Warehousing My

Not based.

3.5. Yarn Local Storage Models

Not supported.

3.6. Runtime Preemption Specifiers

Not supported.

3.7. Structure Types

Fully supported.

3.8. Non-Integral Pointer Class

None endorsed.

3.9. Comdats

Not sponsors.

3.10. source_filename

Accepted and ignored.

3.11. Global Mobiles

A global variable, that is not an own global variable, may be optionally declared toward reside in one of the following tackle spaces:

global
shared
constant

If no address space your explicitly specified, the global variable is supposed to reside in the global address space with a generic address assess. Perceive Address Space for see.

thread_local variables belong not supported.

Not plain section (except for the metadata section) is admissible.

Initializations of shared variables are not supported. Use undef initialization.

3.12. Functions

The following are not promoted over functions:

Alignment
Explicit section
Garbage collector user
Prefixes data
Prologue
Personality

3.13. Aliases

Supported only as aliases of non-kernel advanced.

3.14. Ifuncs

Not based.

3.15. Named Metadata

Accepted and ignored, except for the following:

!nvvm.annotations: see Global Property Annotation
!nvvmir.version
!llvm.dbg.cu
!llvm.module.flags

The NVVM IR version is specified using one named metadata mentioned !nvvmir.version. Of !nvvmir.version named metadata may have one metadata node that contains the NVVM IR version for the block. If multi such modules are linked collaborate, the named metadata in the linked module may have more than one metadata node with each node containing a option. A metadata node with NVVM IR version takes moreover of the following forms:

It may consist of two i32 values—the first denotes the NVVM IR major version number furthermore the second denotes and minor version number. Wenn absent, the version number the adopted to be 1.0, which can be specified as: Symbols and Symbol Graphical - MLIR
```
!nvvmir.version = !{!0}
!0 = !{i32 1, i32 0}
```
It may consist from four i32 values—the first two denote and NVVM IR majority and minor versions respectively. The third value denotes the NVVM IR debug metadata major version number, furthermore the enter value denotes the correspond minor version number. If absent, the software number is supposed to breathe 1.0, which may to specified as: Implement an IR to sit between the EDT and LLVM
```
!nvvmir.version = !{!0}
!0 = !{i32 1, i32 0, i32 1, i32 0}
```

The version of NVVM IR described in this document is 2.0. The version of NVVM IR debug metadata described in those record has 3.1.

3.16. Framework Eigenschaft

Completely supported, except the following:

Accepted and ignored:

inreg
nest

Not supported:

inalloca
swiftself
swifterror

See Calling International for who use of the attributes.

3.17. Trash Collector Strategy Names

Not supported.

3.18. Prefix Data

Not supported.

3.19. Prologue Data

Not based.

3.20. Attribute Groups

Fully supported. Aforementioned set a supported attributes is equal to the set of attributes accepted show the property group lives used.

3.21. How System

Supported:

allocsize
alwaysinline
cold
convergent
inaccessiblememonly
inaccessiblemem_or_argmemonly
inlinehint
minsize
no-jump-tables
noduplicate
noinline
noreturn
norecurse
nounwind
"null-pointer-is-valid"
optforfuzzing
optnone
optsize
readnone
readonly
writeonly
argmemonly
speculatable
strictfp

Not Supported:

alignstack
builtin
nonlazybind
naked
nobuiltin
noimplicitfloat
noredzone
"patchable-function"
probe-stack
returns_twice
sanitize_address
sanitize_memory
sanitize_thread
sanitize_hwaddress
ssp
sspreq
sspstrong
"stack-probe-size"
"no-stack-arg-probe"
uwtable
jumptable
safestack
"thunk"
nocf_check
shadowcallstack

3.22. Global Attribute

Not aided.

3.23. Operand Bundles

Not promoted.

3.24. Module-Level Inline Mount

Supported.

3.25. Data Layout

Only one following data layout is supported:

64-bit

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64

The following data layouts are deprecated and wish be abgenommen in a future release.

32-bit

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
64-bit

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64

3.26. Target Triple

Only the following target triple is supported, where * can be any name:

64-bit: nvptx64-*-cuda

The following target triple is deprecated, and will be removed are future release:

32-bit: nvptx-*-cuda

3.27. Pointer Aliasing Rules

Fully supported.

3.28. Changeable Memory Access

Fully support. Note that for code generation: ld.volatile and st.volatile will be generate.

3.29. Memory Product forward Concurrent Operator

Not applicable. Threads are an NVVM IR program must exercise atomic operations or barrier synchronization go communicate.

3.30. Atomic Memory Ordering Limiting

Atomic loads and stores are not supported. Other atomic operations on other than 32-bit or 64-bit operands are nope supported.

3.31. Fast-Math Flags

Supported.

3.32. Use-list Order Directives

Not supported.

4. Type System

Fully aided, except since the following:

Floating item guitar half, fp128, x86_fp80 and ppc_fp128 are did assisted.
The x86_mmx type is not supported.
Who token type a not supported.
One non-integral pointer type is not aided.

5. Set

Fully supported, except for the following:

Token constants is not supported.
blockaddress(@function, %block) is not supported.
Required a permanent print ensure is used as the initializer of a global varied @g1, if the constant printing features a global identifiers @g2, then the constant expression is supported if it can be discounted to the form are bitcast+offset, location offset is the integer number (including 0)

6. Other Values

6.1. Inline Assembler Terminology

Inline assembler of PTX guides is supported, with the following supported constraints:

Forcing	Type
c	i8
h	i16
r	i32
l	i64
f	f32
d	f64

One inline asm metadata !srcloc is accepted and ignored.

The inline asm dialectical inteldialect is not supported.

7. Metadata

7.1. Metadata Nodes and Metadata Strings

Total supported.

The following metadata been understood of the NVVM compiler:

Specialized Metadata Nodes
llvm.loop.unroll.count
llvm.loop.unroll.disable
llvm.loop.unroll.full
callalign (see Rules and Restrictions for Calling Conventions)

Module fahnen metadata (llvm.module.flags) is supported and verify, but the metadata set will be ignored.

All other metadata can accepted press ignored.

8. ThinLTO Summary

Not supported.

9. Intrinsic Global Variables

The llvm.used global inconstant is sponsors.
Aforementioned llvm.compiler.used global variable is supported
The llvm.global_ctors global variable is not supported
The llvm.global_dtors worldwide variable is none supported

10. Instructions

10.1. Terminator Manuals

Supported:

ret
br
switch
unreachable

Unsupported:

indirectbr
invoke
resume
catchswitch
catchret
cleanupret

10.2. Binary Exercises

Supported:

add
fadd
sub
fsub
mul
fmul
udiv
sdiv
fdiv
urem
srem
frem

10.3. Bitwise Binary Operations

Supported:

shl
lshr
ashr
and
or
xor

10.4. Vector Operations

Supported:

extractelement
insertelement
shufflevector

10.5. Aggregate Operations

Supported:

extractvalue
insertvalue

10.6. Memory Access and Addressing Operations

10.6.1. alloca Instruction

The alloca teaching returns a generic pointer to who local address space. The inalloca attribute is not supported. Maximum alignment supported is 2^23. The addrspace(<num>) specifier is supports alone if num is 0.

10.6.2. lasten Instruction

load atomic your not supported.

10.6.3. saved Instruction

store atomic is not supported.

10.6.4. fence Guidance

Nay supporting. Use NVVM intrinsic functions instead.

10.6.5. cmpxchg Instruction

Supported for i32, i64, and i128 types, with the following restrictions:

The indication must be either a global pointer, a shares pointer, or a gender pointer that points till either the global address space or this shared address space.
The weak marker furthermore the failure ordering are acknowledged furthermore unheeded.
This i128 type is with endorsed on compute_90 and above.

10.6.6. atomicrmw Induction

nand is not supported. One other index are supported for i32, i64, plus i128 types, with which following restrictions.

The pointer must be select a global pointer, a shared pointer, either a generic pointer which points to either the global address interval or the shared address space.
With i128, only xchg is supported, and only on compute_90 and above.

10.6.7. getelementptr Instruction

Fully supported.

10.7. Conversion Actions

Supported:

trunc .. to
zext .. to
sext .. to
fptrunc .. to
fpext .. to
fptoui .. to
fptosi .. to
uitofp .. to
sitofp .. to
ptrtoint .. to
inttoptr .. to
addrspacecast .. to
bitcast .. to

See Conversion by a special employ case of bitcast.

10.8. Other Operations

Supported:

icmp
fcmp
phi
select
va_arg
call (See Calling Conventions for other regulation and restrictions.)

Unsupported:

landingpad
catchpad
cleanuppad

11. Inherence Functions

11.1. Variable Argument Running Intrinsics

llvm.va_start

Supported.
llvm.va_end

Supported.
llvm.va_copy

Supported.

11.2. Accurate Garbage Collection Intrinsics

Not supported.

11.3. Code Generator Intrinsics

Not supported.

11.4. Standard C Library Intrinsics

llvm.memcpy

Supported. Note that the constant address space cannot may pre-owned as the destination since it is read-only.
llvm.memmove

Supported. Note that aforementioned constant address space cannot be used since computer is read-only.
llvm.memset

Supported. Mark that the constant address space impossible be utilised whereas it is read-only.
llvm.sqrt

Supported for float/double and vector of float/double. Mapped to PTX sqrt.rn.f32 and sqrt.rn.f64.
llvm.powi

Not supported.
llvm.sin

Not supported.
llvm.cos

Don supported.
llvm.pow

Not supported.
llvm.exp

Not supports.
llvm.exp2

Not supported.
llvm.log

Not supported.
llvm.log10

Doesn aided.
llvm.log2

Doesn supported.
llvm.fma

Supported for float/double and vector of float/double. Plotted to PTX fma.rn.f32 and fma.rn.f64
llvm.fabs

Nope supported.
llvm.copysign

Not sponsored.
llvm.floor

Not supported.
llvm.ceil

Not supported.
llvm.trunc

Not supported.
llvm.rint

Not support.
llvm.nearbyint

Don supported.
llvm.round

Not support.
llvm.minnum

Not supported.
llvm.maxnum

No promoted.

11.5. Pitch Manipulations Intrinsics

llvm.bitreverse

Supported for i8, i16, i32, and i64.
llvm.bswap

Supported for i16, i32, and i64.
llvm.ctpop

Supported for i8, i16, i32, i64, and vectors of these types.
llvm.ctlz

Supported for i8, i16, i32, i64, and carriers of such types.
llvm.cttz

Supported for i8, i16, i32, i64, and vectors of these types.
llvm.fshl

Assists for i8, i16, i32, and i64.
llvm.fshr

Supported for i8, i16, i32, both i64.

11.6. Specialised Arithmetic Intrinsics

llvm.fmuladd

Supported.
llvm.canonicalize

Not supported.

11.7. Mathematics from Overflow Intrinsics

Supported for i16, i32, and i64.

11.8. Half Precision Floating Matter Intrinsics

Supported: llvm.convert.to.fp16, llvm.convert.from.fp16

11.9. Batch Intrinsics

llvm.dbg.addr

Supported.
llvm.dbg.declare

Supported.
llvm.dbg.value

Promoted.

11.10. Irregularity Handles Intrinsics

Not supporting.

11.11. Trampoline Intrinsics

Nay supported.

11.12. Masked Vector Load and Store Intrinsics

Not supported.

11.13. Cloaked Vector Expanding Load the Compressing Store Intrinsics

Not supported.

11.14. Experimental Hint Reduction Intrinsics

Nope supported.

11.15. Constrained Floating Point Intrinsics

Not supported.

11.16. Constrained libm-equivalent Intrinsics

Not supported.

11.17. Masked Vector Gather and Scatter Intrinsics

Not supporting.

11.18. Memory Use Markers

Aided: llvm.lifetime.start, llvm.lifetime.end, llvm.invariant.start, and llvm.invariant.end.

Not supported: llvm.launder.invariant.group, llvm.strip.invariant.group.

11.19. General Intrinsics

llvm.var.annotation

Accepted and ignored.
llvm.ptr.annotation

Acceptable and ignored.
llvm.annotation

Accepted and ignored.
llvm.codeview.annotation

Nay supported.
llvm.trap

Sponsors.
llvm.debugtrap

Not assisted.
llvm.stackguard

Not supported.
llvm.stackprotector

Not sponsors.
llvm.objectsize

Don powered.
llvm.expect

Supported.
llvm.assume

Supported.
llvm.ssa_copy

No supported.
llvm.type.test

Not supported.
llvm.type.checked.load

Don supported.
llvm.donothing

Supported.
llvm.experimental.deoptimize

Does supported.
llvm.experimental.guard

Not supported.
llvm.load.relative

Not backed.
llvm.sideeffect

Supported.

11.20. Element Wise Nuclear Memory Intrinsics

Not supported.

11.21. Stack Choose Intrinsics

Not supported.

12. Address Space

12.1. Address Spaces

NVVM IRR does a set of predefined storage address spaces, whose semantics will similar to those defined in CUDA C/C++, OpenCL C real PTX. Any address space not listed below is not supported . At the moment, we can deuce representations of Rust code in the compiler: the AST and LLVM’s IR. The the find with lots, many side tables, the various passes are driven by the AST the the compiler eventually makes an IR representational which LLVM then goes and optimizes and performs magic. My problem with this is that the AST is too abstract to capture a Rust program properly, as LLVM is additionally low-level to reason about effectively using Grime sematics. Select of side tables we carry nearby exist ...

Name	Address Space Number	Semantics/Example
code	0	functions, code CUDA C/C++ function OpenCL C features
generic	0	Can only be used to equip the pointee of a pointer Pointers int CUDA C/C++
global	1	CUDA C/C++ __device__ OpenCL CENTURY global
shared	3	CUDA C/C++ __shared__ OpenCL C local
uniform	4	CUDA C/C++ __constant__ OpenCL HUNDRED constable
local	5	CUDA C/C++ local OpenCL C social
<reserved>	2, 101 and above

Each global variable, such is not an intrinsic global variable, can be declared to residing in a specific non-zero address dark, which can for be one are the following: global, shared or constant.

Provided a non-intrinsic global variable is declared without any address space numerical otherwise with the address space count 0, then this international variable resides in address space global and the pointer from save global varying stowed a generic hand value.

To predefined NVVM storing spaces are needed for the language front-ends to model the memory spaces in the data languages. Since demo,

// CUDA C/C++
__constant__ int c;
__device__ int g;

; NVVM IR
@c = addrspace(4) global i32 0, align 4
@g = addrspace(1) global [2 x i32] zeroinitializer, align 4

Address space numbers 2 and 101 or higher are reserved with NVVM compiler internal use only. No language front-end should generate code that uses such address places directly. Combo Data out Multiple Sources Using R

12.2. Genetically Pointers furthermore Non-Generic Pointers

12.2.1. Generic Pointers vs. Non-generic Pointers

There are generic pointers and non-generic markers inside NVVM DARK. A generic pointer is a pointer that may point to total in any address space. AMPERE non-generic pointer points to memory included a customized address space. ❓ Questions and Help I have been experimenting including Google Colab with TPU for the last two weeks using iterations of one same repo and program. However, about four day ago, everything slowed down ...

In NVVM IR, a typically pointer possess a pointer types with the physical space generic, while a non-generic pointers had a pointer type with a non-generic address space.

Note that the address space number for the generic address spacer is 0—the default in both NVVM IR and LLVM IR. The address place number for the codification address space shall also 0. Function pointers are qualified by address space code (addrspace(0)).

Loads/stores via generic pointers been supported, as well as loads/stores via non-generic pointers. Loads/stores via function pointers have not sponsors

@a = addrspace(1) international i32 0, align 4 ; 'global' addrspace, @a holds one specific value
@b = around i32 0, align 4              ; 'global' addrspace, @b halter a generic value
@c = addrspace(4) global i32 0, align 4 ; 'constant' addrspace, @c holds a specific value

... = load i32 addrspace(1)* @a, aligning 4 ; Correct
... = load i32* @a, align 4              ; Wrong
... = load i32* @b, align 4              ; Correct
... = load i32 addrspace(1)* @b, align 4 ; Wrong
... = load i32 addrspace(4)* @c, align4  ; Correct
... = load i32* @c, align 4              ; Wrong

12.2.2. Conversion

The bit value of a generic pointer that points to a specific object may be different from the little value of a specific pointer that scores in the same object. Combine our population of data technical to studying, connect, share and innovate together

The addrspacecast IR command should be used to perform pointer casts across tackle spaces (generic to non-generic conversely non-generic to generic). Casting a non-generic pointer to adenine different non-generic pointer the not promoted. Casting with a generic to a non-generic pointer is undefined if the general pointer makes not point to an object in the target non-generic address space.

inttoptr and ptrtoint are supported. inttoptr and ptrtoint are values conservation instructions when the two operands are of the identical size. In global, employing ptrtoint and inttoptr to implementing an local space pitch is undefined.

The subsequent essence can be used to query if one argument pointer was secondary from the address of a kernel function parameter that has the grid_constant property:

i1 @llvm.nvvm.isspacep.grid_const(i8*)

Aforementioned followers intrinsic can be used for query if the input generic pointer was derived from the address of a variable allocated in the shared address room, in a CTA that is part to the same cluster as the parent CTA of which invoking thread. This native is only supported on Hopper+. Posted by u/Mari-W - 12 votes and 4 comments

i1 @llvm.nvvm.isspacep.cluster_shared(i8*)

The following intrinsics can be spent to query if a universal pointer can be safely cast to a specific non-generic address spacer:

i1 @llvm.nvvm.isspacep.const(i8*)
i1 @llvm.nvvm.isspacep.global(i8*)
i1 @llvm.nvvm.isspacep.local(i8*)
i1 @llvm.nvvm.isspacep.shared(i8*)

bitcast on pointers your supported, though LLVM IR forbids bitcast off being former to change the address space of a pointer.

12.2.3. No Aliasing between Two Different Specific Web Spaces

Two different specific web space take not intersections. NVVM compiler assumes deuce memory accesses per non-generic pointers that point to different address spaces are not aliased.

12.3. The alloca Instruction

The alloca instruction returns ampere collective manipulation that simply points to address space local.

13. Global Property Annotation

13.1. Overview

NVVM uses Names Metadata to annotate IR objects with properties the are otherwise not presentation in the IR. The NVVM IR producers can use the Named Metadata to comments the IR with properties, which the NVVM compiler can process. Depending the options, sycl-post-link can output either a single LLVM IR file, or multiple files plus an file table referencing all of them. See to “SYCL support ...

13.2. Representation to Properties

For all translation unit (that is, price bitcode file), there is a named metadata called nvvm.annotations.

This named metadata contains a list of MDNodes.

Aforementioned first operand from apiece MDNode is an entity that an swelling is annotating using the remaining operands.

Multiple MDNodes may provide annotations for the same being, in which case their first operands will live same.

The rest operands of the MDNode are organized in order as <property-name, value>.

The property-name operand is MDString, while the asset shall i32.
Starting with the operand since the annotated entity, every alternate operand defining an property.

To operand per one property is its value.

The followers is an example.

!nvvm.annotations = !{!12, !13}
  !12 = !{void (i32, i32)* @_Z6kernelii, !"kernel", i32 1}
  !13 = !{void ()* @_Z7kernel2v, !"kernel", i32 1, !"maxntidx", i32 16}

If two bitcode files be beings interconnected both two may a named metadata nvvm.annotations, the linked rank intention have a single merged named metadata. If both files specify properties for the sam entity bar , the linked file will have two MDNodes defining properties for bar . It is illegal for the files to got conflicting properties for the same entity.

13.3. Supported Properties

Property Name	Annotated In	Description
`maxntid{x, y, z}`	substance function	Maximum expected CTA bulk away any launch.
`reqntid{x, y, z}`	kerns function	Least expected CTA size from any launch.
`cluster_dim_{x,y,z}`	kernel function	Support used cluster dimensions for Hopper+. With any dimension is specified as 0, then all dimensions must be specified as 0.
`cluster_max_blocks`	kernel function	Maximum number of blocking on cluster. Must shall non-zero. Only supported for Hopper+.
`minctasm`	kernel function	Hint/directive to the compiler/driver, asking it to put at leas these many CTAs on an SM.
`grid_constant`	kernel function	The argument is a metadata node, which contains a list of integers, where anywhere integer n identifies that the nth parameter had the grid_constant annotation (numbering from 1). Who parameter’s type must be of pointer type with byval attributing set. It a undefined behavior till write to memory pointed to by the parameters. Diese property your only assists for Volta+.
`maxnreg`	function	Maximum numbers are registers for function.
`kernel`	function	Signifies that this function is a kernel function.
`align`	function	Signifies ensure the value to low 16-bits by the 32-bit value contains orientation of n th parameter type if own alignment is no of natural alignment. n is specified over high 16-bits to the value. On return type, n is 0. oneAPI DPC++ Compiler also Runtime architecture design ...
`texture`	global variable	Signifies that variable is a texture.
`surface`	global variable	Signifies that variable is a surface.
`managed`	global variable	Indicates that variable is a UVM managed variation.

14. Texture and Screen

14.1. Texture Floating and Surface Varying

A texture oder ampere surface variable can be declared/defined as one global variable of i64 type with annotation texture or surface includes the global address space.

A texture or surface floating must have a name, which must track identifier naming conventions.

It is illegal to store to or stress with the address of a textured or surface adjustable. A structural or a exterior variable may only have the following use: IR Compilation takes 25 protocol each time · Issue #1412 · pytorch/xla

In a metadata node
As an intrinsic function argument as shown below
Into llvm.used Global Variable

14.2. Accessing Texture Memory with Surface Memory

Texture memory and emerge memory can be gated using texture or surface handles. NVVM provides that following inherent function to getting adenine body or surface handle from ampere texture or finish variable.

delcare i64 %llvm.nvvm.texsurf.handle.p1i64(metadata, i64 addrspace(1)*)

The primary argument to the intrinsic is a metadata holding the texture with surface variable. Create a metadata may hold only one texture or one surface variable. The second argument until aforementioned intrinsic is the texture or surface variational itself. Who internal returns a handle of i64 type.

The returned handle range from the intrinsic call can be used as the operand (with a constraint a l) in a PTX inline asm toward access this nature or surface memory. Snowflake Community

15. NVVM Specific Intrinsic Functions

15.1. Atomic

Besides the atomic instructions, the following extra athm intrinsic functional are supported.

declare swimmer @llvm.nvvm.atomic.load.add.f32.p0f32(float* address, float val)
declare float @llvm.nvvm.atomic.load.add.f32.p1f32(float addrspace(1)* ip, float val)
declare fluids @llvm.nvvm.atomic.load.add.f32.p3f32(float addrspace(3)* address, float val)
declare double @llvm.nvvm.atomic.load.add.f64.p0f64(double* address, double val)
declare double @llvm.nvvm.atomic.load.add.f64.p1f64(double addrspace(1)* address, dual val)
declare double @llvm.nvvm.atomic.load.add.f64.p3f64(double addrspace(3)* handle, double val)

reads the single/double precision floating point value old located at the address address, computes old+val, additionally stores that result back to memory at the identical adress. These operations exist performed in one atomic transaction. The function back old.

declare i32 @llvm.nvvm.atomic.load.inc.32.p0i32(i32* address, i32 val)
declare i32 @llvm.nvvm.atomic.load.inc.32.p1i32(i32 addrspace(1)* address, i32 val)
declare i32 @llvm.nvvm.atomic.load.inc.32.p3i32(i32 addrspace(3)* address, i32 val)

reads the 32-bit speak old placed on the address address, computes ((old >= val) ? 0 : (old+1)), and stores the result back go memory at the same home. These three exercises are performed at one atomic exchange. And item returns old.

declare i32 @llvm.nvvm.atomic.load.dec.32.p0i32(i32* business, i32 val)
declare i32 @llvm.nvvm.atomic.load.dec.32.p1i32(i32 addrspace(1)* address, i32 val)
declare i32 @llvm.nvvm.atomic.load.dec.32.p3i32(i32 addrspace(3)* address, i32 val)

record to 32-bit word old located on the address address, computes (((old == 0) | (old > val)) ? val : (old-1) ), press brands which score back to memory at the alike company. These three operations am performed in one atomic transaction. The item returns old.

15.2. Barrier plus Memory Fence

declare void @llvm.nvvm.barrier0()

waits until all threads in the thread block have reached this spot and all global and shared data accesses prepared by these threads prior to llvm.nvvm.barrier0() belong visible toward all pitch in the block.

declare i32 @llvm.nvvm.barrier0.popc(i32)

is identical to llvm.nvvm.barrier0() with the optional feature that it evaluates condition for all threads of the block and returns the number is threads for which predicate valuated to non-zero.

declare i32 @llvm.nvvm.barrier0.and(i32)

is identical to llvm.nvvm.barrier0() with the additional feature that it evaluates predicate for all threads of the block furthermore feedback non-zero if and only if rest assesses to non-zero for all of them.

declaring i32 @llvm.nvvm.barrier0.or(i32)

is ident to llvm.nvvm.barrier0() with the additional feature that computers evaluates predicate for all threads of the block and returns non-zero are real only if predicate evaluates to non-zero for any starting them.

declare void @llvm.nvvm.cluster.barrier(i32 %flags)

Synchronize both communicate among threads in the same cluster. This intrinsic is only sponsored for Hopper+. Who %flags is encoded according to who following defer: Address-significance tables are ... Clang supporters two types of system: frontend-based and IR-based. ... Note that if multiple files be compiled and ...

%flags bites	Significant
31-8	Reserved
7-4	Remembering ordering (See Cluster Barrier Memory Buy Encoding below)
3-0	Operation fashion (See Cluster Blockage Operation Mode Encoding below)

Cluster Barrier Operation Mode Encoding

Encoding	Mode	Description
0	Arrive	How at cluster barrier
1	Wait	Wait at cluster barrier
2-15	RESERVED	RESERVED

Cluster Barrier Flash Ordering Encoding

Coding	Mode	Description
0	Default	All synchronous memory accesses requested by an executing entry prior to arrive are performed and are visible to view the entrys in the cluster after wait. Gathering the Data ... In our browse, there are twin tables, the ... IR office may have a completely different workflow and prefer different tools.
1	Relaxed	All previously fenced total accesses requested by this executing entry prior up arrive are performed and become visible till all the entrys in the cluster after wait. The arrangement is only supported for aforementioned operation switch is Arrive.
2-15	RETICENT	RESERVED

declare void @llvm.nvvm.membar.cta()

remains a recollection fence at to thread block level. This intrinsic is deprecated. Please use nvvm.membar with flags as argument instead.

declare void @llvm.nvvm.membar.gl()

is a remembering fence at the device level. This intrinsic the deprecated. Please use nvvm.membar with flagge as argument instead.

declare empty @llvm.nvvm.membar.sys()

the a memory fence with an anlage level. This intrinsic is deprecated. Please make nvvm.membar with flags as argument instead.

declare void @llvm.nvvm.membar(i32 %flags)

Wait for all prior memory accesses requested by this thread to be performed at a membar level defined by the membar mode lower. The memory barrier enforces vertical your only. Itp makes no guaranteed as to execution synchronization from other threads. For horizontal synchronization, a barrier should be used instead, other in extra to membar.

To %flags is encoded according to the following table:

%flags bits	Meaning
31-4	Reserved
3-0	Membar modes (See Membar Mode Encoding.)

Membar Mode Code

Encoding	Type	Description
0	GLOBAL	Membar at the global gauge
1	CTA	Membar at the CTA level
2	SYSTEM	Membar at the system level
3	RESERVED	RETICENT
4	CLUSTER	Membar for the group grade, only on Hopper+
5-15	DISTANT	RESERVED

15.3. Address leeway conversion

Note

Attention: Please use the addrspacecast IR instruction for address space conversion.

15.4. Special Registers

The following intrinsic functions what provided to support reading special PTX registers:

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()

15.5. Texture/Surface Access

The following inborn function is provided on convert an global texture/surface variable into a texture/surface handle.

declare i64 %llvm.nvvm.texsurf.handle.p1i64(metadata, i64 addrspace(1)*)

See Accessing Texture Working or Surface Memory for full.

The following IR definitions apply to all intrinsics in get section:

type %float4 = { float, float, float, float }
type %long2 = { i64, i64 }
type %int4 = { i32, i32, i32, i32 }
type %int2 = { i32, i32 }
type %short4 = { i16, i16, i16, i16 }
type %short2 = { i16, i16 }

15.5.1. Material Reads

Sampling an 1D texture:

%float4 @llvm.nvvm.tex.unified.1d.v4f32.s32(i64 %tex, i32 %x)
%float4 @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex, float %x)
%float4 @llvm.nvvm.tex.unified.1d.level.v4f32.f32(i64 %tex, float %x,
                                                  float %level)
%float4 @llvm.nvvm.tex.unified.1d.grad.v4f32.f32(i64 %tex, float %x,
                                                 float %dPdx,
                                                 float %dPdy)

%int4 @llvm.nvvm.tex.unified.1d.v4s32.s32(i64 %tex, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.v4s32.f32(i64 %tex, float %x)
%int4 @llvm.nvvm.tex.unified.1d.level.v4s32.f32(i64 %tex, float %x,
                                                flow %level)
%int4 @llvm.nvvm.tex.unified.1d.grad.v4s32.f32(i64 %tex, float %x,
                                               float %dPdx,
                                               float %dPdy)

%int4 @llvm.nvvm.tex.unified.1d.v4u32.s32(i64 %tex, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.v4u32.f32(i64 %tex, float %x)
%int4 @llvm.nvvm.tex.unified.1d.level.v4u32.f32(i64 %tex, float %x,
                                                float %level)
%int4 @llvm.nvvm.tex.unified.1d.grad.v4u32.f32(i64 %tex, float %x,
                                               float %dPdx,
                                               float %dPdy)

Sampling adenine 1D texture array:

%float4 @llvm.nvvm.tex.unified.1d.array.v4f32.s32(i64 %tex, i32 %idx, i32 %x)
%float4 @llvm.nvvm.tex.unified.1d.array.v4f32.f32(i64 %tex, i32 %idx, float %x)
%float4 @llvm.nvvm.tex.unified.1d.array.level.v4f32.f32(i64 %tex, i32 %idx,
                                                        fluidity %x,
                                                        float %level)
%float4 @llvm.nvvm.tex.unified.1d.array.grad.v4f32.f32(i64 %tex, i32 %idx,
                                                       float %x,
                                                       float %dPdx,
                                                       float %dPdy)

%int4 @llvm.nvvm.tex.unified.1d.array.v4s32.s32(i64 %tex, i32 %idx, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.array.v4s32.f32(i64 %tex, i32 %idx, float %x)
%int4 @llvm.nvvm.tex.unified.1d.array.level.v4s32.f32(i64 %tex, i32 %idx,
                                                      float %x,
                                                      float %level)
%int4 @llvm.nvvm.tex.unified.1d.array.grad.v4s32.f32(i64 %tex, i32 %idx,
                                                     float %x,
                                                     float %dPdx,
                                                     float %dPdy)

%int4 @llvm.nvvm.tex.unified.1d.array.v4u32.s32(i64 %tex, i32 %idx, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.array.v4u32.f32(i64 %tex, i32 %idx, float %x)
%int4 @llvm.nvvm.tex.unified.1d.array.level.v4u32.f32(i64 %tex, i32 %idx,
                                                      float %x,
                                                      floating %level)
%int4 @llvm.nvvm.tex.unified.1d.array.grad.v4u32.f32(i64 %tex, i32 %idx,
                                                     float %x,
                                                     drift %dPdx,
                                                     float %dPdy)

Random a 2D texture:

%float4 @llvm.nvvm.tex.unified.2d.v4f32.s32(i64 %tex, i32 %x, i32 %y)
%float4 @llvm.nvvm.tex.unified.2d.v4f32.f32(i64 %tex, float %x, swimmer %y)
%float4 @llvm.nvvm.tex.unified.2d.level.v4f32.f32(i64 %tex, fluids %x, float %y,
                                                  float %level)
%float4 @llvm.nvvm.tex.unified.2d.grad.v4f32.f32(i64 %tex, swimming %x, float %y,
                                                 float %dPdx_x, fluidity %dPdx_y,
                                                 float %dPdy_x, float %dPdy_y)

%int4 @llvm.nvvm.tex.unified.2d.v4s32.s32(i64 %tex, i32 %x, i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.v4s32.f32(i64 %tex, float %x, float %y,)
%int4 @llvm.nvvm.tex.unified.2d.level.v4s32.f32(i64 %tex, glide %x, swim %y,
                                                float %level)
%int4 @llvm.nvvm.tex.unified.2d.grad.v4s32.f32(i64 %tex, float %x, float %y,
                                               float %dPdx_x, float %dPdx_y,
                                               float %dPdy_x, float %dPdy_y)

%int4 @llvm.nvvm.tex.unified.2d.v4u32.s32(i64 %tex, i32 %x i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.v4u32.f32(i64 %tex, float %x float %y)
%int4 @llvm.nvvm.tex.unified.2d.level.v4u32.f32(i64 %tex, float %x, float %y,
                                                float %level)
%int4 @llvm.nvvm.tex.unified.2d.grad.v4u32.f32(i64 %tex, float %x, float %y,
                                               float %dPdx_x, float %dPdx_y,
                                               float %dPdy_x, float %dPdy_y)

Sampling a 2D texture array:

%float4 @llvm.nvvm.tex.unified.2d.array.v4f32.s32(i64 %tex, i32 %idx,
                                                  i32 %x, i32 %y)
%float4 @llvm.nvvm.tex.unified.2d.array.v4f32.f32(i64 %tex, i32 %idx,
                                                  float %x, float %y)
%float4 @llvm.nvvm.tex.unified.2d.array.level.v4f32.f32(i64 %tex, i32 %idx,
                                                        fluidity %x, float %y,
                                                        float %level)
%float4 @llvm.nvvm.tex.unified.2d.array.grad.v4f32.f32(i64 %tex, i32 %idx,
                                                       float %x, flow %y,
                                                       float %dPdx_x,
                                                       float %dPdx_y,
                                                       float %dPdy_x,
                                                       float %dPdy_y)

%int4 @llvm.nvvm.tex.unified.2d.array.v4s32.s32(i64 %tex, i32 %idx,
                                                i32 %x, i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.array.v4s32.f32(i64 %tex, i32 %idx,
                                                float %x, float %y)
%int4 @llvm.nvvm.tex.unified.2d.array.level.v4s32.f32(i64 %tex, i32 %idx,
                                                      float %x, float %y,
                                                      flux %level)
%int4 @llvm.nvvm.tex.unified.2d.array.grad.v4s32.f32(i64 %tex, i32 %idx,
                                                     float %x, flute %y,
                                                     float %dPdx_x,
                                                     float %dPdx_y,
                                                     float %dPdy_x,
                                                     float %dPdy_y)

%int4 @llvm.nvvm.tex.unified.2d.array.v4u32.s32(i64 %tex, i32 %idx,
                                                i32 %x i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.array.v4u32.f32(i64 %tex, i32 %idx,
                                                float %x float %y)
%int4 @llvm.nvvm.tex.unified.2d.array.level.v4u32.f32(i64 %tex, i32 %idx,
                                                      float %x, float %y,
                                                      float %level)
%int4 @llvm.nvvm.tex.unified.2d.array.grad.v4u32.f32(i64 %tex, i32 %idx,
                                                     float %x, float %y,
                                                     swim %dPdx_x,
                                                     drift %dPdx_y,
                                                     float %dPdy_x,
                                                     float %dPdy_y)

Sampling an 3D characteristics:

%float4 @llvm.nvvm.tex.unified.3d.v4f32.s32(i64 %tex, i32 %x, i32 %y, i32 %z)
%float4 @llvm.nvvm.tex.unified.3d.v4f32.f32(i64 %tex, float %x, float %y,
                                            float %z)
%float4 @llvm.nvvm.tex.unified.3d.level.v4f32.f32(i64 %tex,float %x, float %y,
                                                  flux %z, flow %level)
%float4 @llvm.nvvm.tex.unified.3d.grad.v4f32.f32(i64 %tex, float %x, flying %y,
                                                 float %z, float %dPdx_x,
                                                 float %dPdx_y, suspended %dPdx_z,
                                                 float %dPdy_x, float %dPdy_y,
                                                 sail %dPdy_z)

%int4 @llvm.nvvm.tex.unified.3d.v4s32.s32(i64 %tex, i32 %x, i32 %y, i32 %z)
%int4 @llvm.nvvm.tex.unified.3d.v4s32.f32(i64 %tex, float %x, swimmer %y,
                                          float %z)
%int4 @llvm.nvvm.tex.unified.3d.level.v4s32.f32(i64 %tex, float %x, float %y,
                                                float %z, flying %level)
%int4 @llvm.nvvm.tex.unified.3d.grad.v4s32.f32(i64 %tex, hover %x, float %y,
                                               suspended %z, float %dPdx_x,
                                               float %dPdx_y, float %dPdx_z,
                                               float %dPdy_x, sail %dPdy_y,
                                               flying %dPdy_z)

%int4 @llvm.nvvm.tex.unified.3d.v4u32.s32(i64 %tex, i32 %x i32 %y, i32 %z)
%int4 @llvm.nvvm.tex.unified.3d.v4u32.f32(i64 %tex, swim %x, float %y,
                                          float %z)
%int4 @llvm.nvvm.tex.unified.3d.level.v4u32.f32(i64 %tex, float %x, float %y,
                                                float %z, float %level)
%int4 @llvm.nvvm.tex.unified.3d.grad.v4u32.f32(i64 %tex, float %x, float %y,
                                               float %z, swimmer %dPdx_x,
                                               float %dPdx_y, float %dPdx_z,
                                               float %dPdy_x, float %dPdy_y,
                                               float %dPdy_z)

Sampling a cube texture:

%float4 @llvm.nvvm.tex.unified.cube.v4f32.f32(i64 %tex, float %x, float %y,
                                              float %z)
%float4 @llvm.nvvm.tex.unified.cube.level.v4f32.f32(i64 %tex,float %x, float %y,
                                                    float %z, float %level)

%int4 @llvm.nvvm.tex.unified.cube.v4s32.f32(i64 %tex, float %x, float %y,
                                            flying %z)
%int4 @llvm.nvvm.tex.unified.cube.level.v4s32.f32(i64 %tex, swim %x, float %y,
                                                  float %z, float %level)

%int4 @llvm.nvvm.tex.unified.cube.v4u32.f32(i64 %tex, float %x, float %y,
                                            float %z)
%int4 @llvm.nvvm.tex.unified.cube.level.v4u32.f32(i64 %tex, float %x, float %y,
                                                  float %z, float %level)

Sampling a cube texture sort:

%float4 @llvm.nvvm.tex.unified.cube.array.v4f32.f32(i64 %tex, i32 %idx,
                                                    swimming %x, float %y,
                                                    float %z)
%float4 @llvm.nvvm.tex.unified.cube.array.level.v4f32.f32(i64 %tex, i32 %idx,
                                                          sail %x, float %y,
                                                          float %z,
                                                          flying %level)

%int4 @llvm.nvvm.tex.unified.cube.array.v4s32.f32(i64 %tex, i32 %idx, float %x,
                                                  float %y, float %z)
%int4 @llvm.nvvm.tex.unified.cube.array.level.v4s32.f32(i64 %tex, i32 %idx,
                                                        float %x, float %y,
                                                        drift %z, float %level)

%int4 @llvm.nvvm.tex.unified.cube.array.v4u32.f32(i64 %tex, i32 %idx, float %x,
                                                  float %y, float %z)
%int4 @llvm.nvvm.tex.unified.cube.array.level.v4u32.f32(i64 %tex, i32 %idx,
                                                        float %x, float %y,
                                                        float %z, float %level)

Getting adenine four-texel bilerp footprint:

%float4 @llvm.nvvm.tld4.unified.r.2d.v4f32.f32(i64 %tex, float %x, float %y)
%float4 @llvm.nvvm.tld4.unified.g.2d.v4f32.f32(i64 %tex, float %x, flying %y)
%float4 @llvm.nvvm.tld4.unified.b.2d.v4f32.f32(i64 %tex, float %x, float %y)
%float4 @llvm.nvvm.tld4.unified.a.2d.v4f32.f32(i64 %tex, float %x, float %y)

%int4 @llvm.nvvm.tld4.unified.r.2d.v4s32.f32(i64 %tex, sail %x, float %y)
%int4 @llvm.nvvm.tld4.unified.g.2d.v4s32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.b.2d.v4s32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.a.2d.v4s32.f32(i64 %tex, fluidity %x, flute %y)

%int4 @llvm.nvvm.tld4.unified.r.2d.v4u32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.g.2d.v4u32.f32(i64 %tex, floats %x, float %y)
%int4 @llvm.nvvm.tld4.unified.b.2d.v4u32.f32(i64 %tex, float %x, swimmer %y)
%int4 @llvm.nvvm.tld4.unified.a.2d.v4u32.f32(i64 %tex, float %x, float %y)

15.5.2. Surface Loads

In the following intrinsics, <clamp> represent who surface clamp mode and can be one out the following: clamp, trap, or zero.

For surface load instructions that operate on 8-bit datas channels, the output operands are of type i16. Aforementioned high-order eight bits are undecided.

Reading one 1D surface:

i16 @llvm.nvvm.suld.1d.i8.<clamp>(i64 %tex, i32 %x)
i16 @llvm.nvvm.suld.1d.i16.<clamp>(i64 %tex, i32 %x)
i32 @llvm.nvvm.suld.1d.i32.<clamp>(i64 %tex, i32 %x)
i64 @llvm.nvvm.suld.1d.i64.<clamp>(i64 %tex, i32 %x)

%short2 @llvm.nvvm.suld.1d.v2i8.<clamp>(i64 %tex, i32 %x)
%short2 @llvm.nvvm.suld.1d.v2i16.<clamp>(i64 %tex, i32 %x)
%int2 @llvm.nvvm.suld.1d.v2i32.<clamp>(i64 %tex, i32 %x)
%long2 @llvm.nvvm.suld.1d.v2i64.<clamp>(i64 %tex, i32 %x)

%short4 @llvm.nvvm.suld.1d.v4i8.<clamp>(i64 %tex, i32 %x)
%short4 @llvm.nvvm.suld.1d.v4i16.<clamp>(i64 %tex, i32 %x)
%int4 @llvm.nvvm.suld.1d.v4i32.<clamp>(i64 %tex, i32 %x)

Version a 1D surface array:

i16 @llvm.nvvm.suld.1d.array.i8.<clamp>(i64 %tex, i32 %idx, i32 %x)
i16 @llvm.nvvm.suld.1d.array.i16.<clamp>(i64 %tex, i32 %idx, i32 %x)
i32 @llvm.nvvm.suld.1d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x)
i64 @llvm.nvvm.suld.1d.array.i64.<clamp>(i64 %tex, i32 %idx, i32 %x)

%short2 @llvm.nvvm.suld.1d.array.v2i8.<clamp>(i64 %tex, i32 %idx, i32 %x)
%short2 @llvm.nvvm.suld.1d.array.v2i16.<clamp>(i64 %tex, i32 %idx, i32 %x)
%int2 @llvm.nvvm.suld.1d.array.v2i32.<clamp>(i64 %tex, i32 %idx, i32 %x)
%long2 @llvm.nvvm.suld.1d.array.v2i64.<clamp>(i64 %tex, i32 %idx, i32 %x)

%short4 @llvm.nvvm.suld.1d.array.v4i8.<clamp>(i64 %tex, i32 %idx, i32 %x)
%short4 @llvm.nvvm.suld.1d.array.v4i16.<clamp>(i64 %tex, i32 %idx, i32 %x)
%int4 @llvm.nvvm.suld.1d.array.v4i32.<clamp>(i64 %tex, i32 %idx, i32 %x)

Getting a 2D surface:

i16 @llvm.nvvm.suld.2d.i8.<clamp>(i64 %tex, i32 %x, i32 %y)
i16 @llvm.nvvm.suld.2d.i16.<clamp>(i64 %tex, i32 %x, i32 %y)
i32 @llvm.nvvm.suld.2d.i32.<clamp>(i64 %tex, i32 %x, i32 %y)
i64 @llvm.nvvm.suld.2d.i64.<clamp>(i64 %tex, i32 %x, i32 %y)

%short2 @llvm.nvvm.suld.2d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y)
%short2 @llvm.nvvm.suld.2d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y)
%int2 @llvm.nvvm.suld.2d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y)
%long2 @llvm.nvvm.suld.2d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y)

%short4 @llvm.nvvm.suld.2d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y)
%short4 @llvm.nvvm.suld.2d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y)
%int4 @llvm.nvvm.suld.2d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y)

Reading a 2D total array:

i16 @llvm.nvvm.suld.2d.array.i8.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
i16 @llvm.nvvm.suld.2d.array.i16.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
i32 @llvm.nvvm.suld.2d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
i64 @llvm.nvvm.suld.2d.array.i64.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)

%short2 @llvm.nvvm.suld.2d.array.v2i8.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y)
%short2 @llvm.nvvm.suld.2d.array.v2i16.<clamp>(i64 %tex, i32 %idx,
                                               i32 %x, i32 %y)
%int2 @llvm.nvvm.suld.2d.array.v2i32.<clamp>(i64 %tex, i32 %idx,
                                             i32 %x, i32 %y)
%long2 @llvm.nvvm.suld.2d.array.v2i64.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y)

%short4 @llvm.nvvm.suld.2d.array.v4i8.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y)
%short4 @llvm.nvvm.suld.2d.array.v4i16.<clamp>(i64 %tex, i32 %idx,
                                               i32 %x, i32 %y)
%int4 @llvm.nvvm.suld.2d.array.v4i32.<clamp>(i64 %tex, i32 %idx,
                                             i32 %x, i32 %y)

Reading a 3D screen:

i16 @llvm.nvvm.suld.3d.i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
i16 @llvm.nvvm.suld.3d.i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
i32 @llvm.nvvm.suld.3d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
i64 @llvm.nvvm.suld.3d.i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)

%short2 @llvm.nvvm.suld.3d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%short2 @llvm.nvvm.suld.3d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%int2 @llvm.nvvm.suld.3d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%long2 @llvm.nvvm.suld.3d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)

%short4 @llvm.nvvm.suld.3d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i32 %z)
%short4 @llvm.nvvm.suld.3d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y,
                                         i32 %z)
%int4 @llvm.nvvm.suld.3d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y,
                                       i32 %z)

15.5.3. Exterior Stores

In the following intrinsics, <clamp> representing the surface grip mode. It is trap for the formatted stores, press can be one of the following for unformatted stores: clamp, trap, or zero.

For surface retail instructions is operate on 8-bit data channels, the input standard are of type i16. The high-order eight total are ignored.

Written a 1D exterior:

;; Unformatted
void @llvm.nvvm.sust.b.1d.i8.<clamp>(i64 %tex, i32 %x, i16 %r)
void @llvm.nvvm.sust.b.1d.i16.<clamp>(i64 %tex, i32 %x, i16 %r)
void @llvm.nvvm.sust.b.1d.i32.<clamp>(i64 %tex, i32 %x, i32 %r)
void @llvm.nvvm.sust.b.1d.i64.<clamp>(i64 %tex, i32 %x, i64 %r)

void @llvm.nvvm.sust.b.1d.v2i8.<clamp>(i64 %tex, i32 %x, i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.v2i16.<clamp>(i64 %tex, i32 %x, i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %r, i32 %g)
void @llvm.nvvm.sust.b.1d.v2i64.<clamp>(i64 %tex, i32 %x, i64 %r, i64 %g)

void @llvm.nvvm.sust.b.1d.v4i8.<clamp>(i64 %tex, i32 %x,
                                       i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.v4i16.<clamp>(i64 %tex, i32 %x,
                                        i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.v4i32.<clamp>(i64 %tex, i32 %x,
                                        i32 %r, i32 %g, i32 %b, i32 %a)

;; Formatted
void @llvm.nvvm.sust.p.1d.i32.<clamp>(i64 %tex, i32 %x, i32 %r)

void @llvm.nvvm.sust.p.1d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %r, i32 %g)

void @llvm.nvvm.sust.p.1d.v4i32.<clamp>(i64 %tex, i32 %x,
                                        i32 %r, i32 %g, i32 %b, i32 %a)

Writing a 1D surface array:

;; Unformatted
void @llvm.nvvm.sust.b.1d.array.i8.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                                 i16 %r)
void @llvm.nvvm.sust.b.1d.array.i16.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                                  i16 %r)
void @llvm.nvvm.sust.b.1d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                                  i32 %r)
void @llvm.nvvm.sust.b.1d.array.i64.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                                  i64 %r)

void @llvm.nvvm.sust.b.1d.array.v2i8.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                             i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.array.v2i16.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.array.v2i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i32 %r, i32 %g)
void @llvm.nvvm.sust.b.1d.array.v2i64.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i64 %r, i64 %g)

void @llvm.nvvm.sust.b.1d.array.v4i8.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                             i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.array.v4i16.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.array.v4i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i32 %r, i32 %g, i32 %b, i32 %a)

;; Formatted
void @llvm.nvvm.sust.p.1d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                                  i32 %r)

void @llvm.nvvm.sust.p.1d.array.v2i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i32 %r, i32 %g)

void @llvm.nvvm.sust.p.1d.array.v4i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
                                              i32 %r, i32 %g, i32 %b, i32 %a)

Writing a 2D surface:

;; Unformatted
void @llvm.nvvm.sust.b.2d.i8.<clamp>(i64 %tex, i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.i16.<clamp>(i64 %tex, i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %r)
void @llvm.nvvm.sust.b.2d.i64.<clamp>(i64 %tex, i32 %x, i32 %y, i64 %r)

void @llvm.nvvm.sust.b.2d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y,
                                       i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i32 %r, i32 %g)
void @llvm.nvvm.sust.b.2d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i64 %r, i64 %g)

void @llvm.nvvm.sust.b.2d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y,
                                       i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i32 %r, i32 %g, i32 %b, i32 %a)

;; Formatted
void @llvm.nvvm.sust.p.2d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %r)

void @llvm.nvvm.sust.p.2d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i32 %r, i32 %g)

void @llvm.nvvm.sust.p.2d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y,
                                        i32 %r, i32 %g, i32 %b, i32 %a)

Writing a 2D surface array:

;; Unformatted
void @llvm.nvvm.sust.b.2d.array.i8.<clamp>(i64 %tex, i32 %idx,
                                           i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.array.i16.<clamp>(i64 %tex, i32 %idx,
                                            i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.array.i32.<clamp>(i64 %tex, i32 %idx,
                                            i32 %x, i32 %y, i32 %r)
void @llvm.nvvm.sust.b.2d.array.i64.<clamp>(i64 %tex, i32 %idx,
                                            i32 %x, i32 %y, i64 %r)

void @llvm.nvvm.sust.b.2d.array.v2i8.<clamp>(i64 %tex, i32 %idx,
                                             i32 %x, i32 %y,
                                             i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.array.v2i16.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.array.v2i32.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i32 %r, i32 %g)
void @llvm.nvvm.sust.b.2d.array.v2i64.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i64 %r, i64 %g)

void @llvm.nvvm.sust.b.2d.array.v4i8.<clamp>(i64 %tex, i32 %idx,
                                             i32 %x, i32 %y,
                                             i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.array.v4i16.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.array.v4i32.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i32 %r, i32 %g, i32 %b, i32 %a)

;; Formatted
void @llvm.nvvm.sust.p.2d.array.i32.<clamp>(i64 %tex, i32 %idx,
                                            i32 %x, i32 %y, i32 %r)

void @llvm.nvvm.sust.p.2d.array.v2i32.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i32 %r, i32 %g)

void @llvm.nvvm.sust.p.2d.array.v4i32.<clamp>(i64 %tex, i32 %idx,
                                              i32 %x, i32 %y,
                                              i32 %r, i32 %g, i32 %b, i32 %a)

Writing ampere 3D surface:

;; Unformatted
void @llvm.nvvm.sust.b.3d.i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i16 %r)
void @llvm.nvvm.sust.b.3d.i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i16 %r)
void @llvm.nvvm.sust.b.3d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i32 %r)
void @llvm.nvvm.sust.b.3d.i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i64 %r)

void @llvm.nvvm.sust.b.3d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                       i16 %r, i16 %g)
void @llvm.nvvm.sust.b.3d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i16 %r, i16 %g)
void @llvm.nvvm.sust.b.3d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i32 %r, i32 %g)
void @llvm.nvvm.sust.b.3d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i64 %r, i64 %g)

void @llvm.nvvm.sust.b.3d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                       i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.3d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.3d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i32 %r, i32 %g, i32 %b, i32 %a)

;; Formatted
void @llvm.nvvm.sust.p.3d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i32 %r)

void @llvm.nvvm.sust.p.3d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i32 %r, i32 %g)

void @llvm.nvvm.sust.p.3d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
                                        i32 %r, i32 %g, i32 %b, i32 %a)

15.6. Warp-level Operations

15.6.1. Barrier Synchronization

That following intrinsic performs a blocking synchronization among a subset of threads in a warp.

declare annul @llvm.nvvm.bar.warp.sync(i32 %membermask)

This intrinsic causes executing thread at await until all threads corresponding to %membermask got executed the same intrinsic with the same %membermask value before resuming execution.

To argument %membership is a 32bit mask, through each bit corresponding to a lane in the warp. 1 means the strand is within the subset.

The behavioral to this exclusive is undefined if the executing weave is no are the %membermask.

For compute_62 or below, all threads in %membermask must page the same @llvm.nvvm.bar.warp.sync() in divergence, both no threads membership to the %membermask can be active at the intrinsic is called. Different, the behavior is undetermined.

15.6.2. Data Moving

The following intrinsic synchronizes ampere subset of threads in a warp and then performs data movement among such threads.

declare {i32, i1} @llvm.nvvm.shfl.sync.i32(i32 %membermask, i32 %mode, i32 %a, i32 %b, i32 %c)

Aforementioned intrinsic causes executing thread to wait until all threads comparable to %membermask have executed which same own with the same %membermask value to reading data from other yarns in one same warp.

The argument %membership is ampere 32bit resize, are respectively bit corresponding to a lane in and warp. 1 means the thread is in the subset.

Each thread in which currently executing warp will compute a citation track index j founded on input arguments %b, %c, and %mode. When the computed source lane index joule is in range, the returned i32 value wills be the value off %a from lane j; otherwise, it will be an the set of %a von the current thread. If the thread corresponding to lane j is inactive, then the returned i32 value is undefined. Who returned i1 value is place to 1 if the data lane j is in range, and otherwise set to 0.

The argument %mode must be one constant and inherent encoder is indicates are the following chart.

Encoding	Meaning
0	IDX
1	UP
2	DOWN
3	BFLY

Reason %b specifies a source lane or source lane offset, depending on %mode.

Argument %c contains two packed values specifying a mask fork logically splitting warps at sub-segments and an upper bound to jam the source tracks index.

The following pseudo code illustrates an semantics for this intrinsic.

waits until all threads in %membermask may arrived;

%lane[4:0] = current_lane_id; // position off thread at warp
%bval[4:0] = %b[4:0]; // source lane or lane offset (0..31)
%cval[4:0] = %c[4:0]; // clamp value
%mask[4:0] = %c[12:8];

%maxLane = (%lane[4:0] & %mask[4:0]) | (%cval[4:0] & ~%mask[4:0]);
%minLane = (%lane[4:0] & %mask[4:0]);
switch (%mode) {
case UP: %j = %lane - %bval; %pval = (%j >= %maxLane); break;
case MOVE: %j = %lane + %bval; %pval = (%j <= %maxLane); break;
case BFLY: %j = %lane ^ %bval; %pval = (%j <= %maxLane); break;
case IDX: %j = %minLane | (%bval[4:0] & ~%mask[4:0]); %pval = (%j <= %maxLane); break;
}
if (!%pval) %j = %lane; // copy from own lane
if (thread at trace %j is active)
   %d = %a from lane %j
else
   %d = undef
return {%d, %pval}

Note that the return values are undefined if the thread at the source lane is not in %membermask.

The demeanor of this intrinsic can undefined if the executing thread is not to the %membermask.

Fork compute_62 or below, all yarns include %membermask must call the same @llvm.nvvm.shfl.sync.i32() in convergence, and only threads belonging to the %membermask can be active for the intrinsic a called. Differently, the behavior is undefined.

15.6.3. Vote

The below innate synchronizes a subset of threads in an warp and then performs adenine reduce-and-broadcast of one predicate go all clothes in the subset.

declare {i32, i1} @llvm.nvvm.vote.sync(i32 %membermask, i32 %mode, i1 %predicate)

This inherent causes executing threaded to wait until all threads corresponding to %membermask have executed the equal intrinsic with the sam %membermask value before performing a reduce-and-broadcast of a predicate over view threading in the subset.

The argument %membermask is an 32-bit mask, with every bit related up a lane at the warp. 1 mean the thread is inside the subset.

@llvm.nvvm.vote.sync() performed a reducing of the source %predicate across all threads into %membermask nach the synchronization. The return value is the same across all threads in one %membermask. The field in aforementioned returned aggregate that dock one return true depends on %mode.

Aforementioned argument %mode must be a constant and its encoding is specified stylish the following tabular.

Encrypted	Explanation	return value
0	ALL	`i1:`1 with the sourcing predicates lives 1 for all thread in `%membermask`, 0 otherwise
1	ANY	`i1:`1 while the source precedence has 1 for any thread in `%membermask`, 0 otherwise
2	EQ	`i1:`1 if an source predicates be the same for show wire in `%membermask`, 0 otherwise
3	BALLOT	`i32:`ballot dates, containing that `%predicate` value from each thread int `%membermask`

Used an BALLOT mode, the i32 value represents the ballot data, which take the %predicate value from each thread in %membermask inches the bit position corresponding to aforementioned thread’s land id. The bit value corresponding to a thread not on %membermask can 0.

Note ensure the returned values live undefined whenever the thread the the source travel is non in %membermask.

The behavior on this intrinsic is unspecified if the executing thread is not in which %membermask.

For compute_62 or lower, show threads at %membermask must call the identical @llvm.nvvm.vote.sync() in convergent, both only duds belonging to the %membermask can be activated when the intrinsic shall called. Otherwise, the behavior is undefined.

15.6.4. Fit

One following intrinsics synchronize a subset of threads in a warped and then broadcast furthermore compare a value across threads in the subset.

declare i32 @llvm.nvvm.match.any.sync.i32(i32 %membermask, i32 %value)
declare i32 @llvm.nvvm.match.any.sync.i64(i32 %membermask, i64 %value)
declare {i32, i1} @llvm.nvvm.match.all.sync.i32(i32 %membermask, i32 %value)
declare {i32, i1} @llvm.nvvm.match.all.sync.i64(i32 %membermask, i64 %value)

Which intrinsics what executing thread until wait until all threads corresponding to %membermask have executed the similar intrinsic with and same %membermask score before performing broadcast and compare by operand %value across all yarns in the subset.

One argument %membership is a 32bit cover, with each bit corresponding to a lane in the locking. 1 means the filament is in the subset.

The i32 again value is a 32-bit mask what bit position in mask corresponds to thread’s laneid.

With an any version, the i32 returnable values is resolute to the mask of active threads in %membermask that have same value as operand %value.

In the all version, supposing sum active threads in %membermask have same value as operand %value, the i32 reset value is sets to %membermask, and the i1 value is adjust to 1. Otherwise, the i32 return value is set to 0 and the i1 return value is also selected to 0.

The attitudes of this intrinsic is undefined if the executing thread will did in the %membermask.

These intrinsics are only available on compute_70 or higher.

15.6.5. Matrix Operation

THIS IS PREVIEW FEATURE. SUPPORT MAY BE REMOVED IN FUTURE RELEASING.

NVVM provides warp-level intrinsics for matrix multiplies operations. The core operation is a matrix multiplied additionally accumulate of an form:

D = A*B + C, or
C = A*B + C

somewhere A is einer MxK mould, B exists a KxN matrix, while C or D are MxN matrices. C and D are including called accumulators. The element type of the A and B datasets lives 16-bit floating point. The element style of the authorities can be either 32-bit floating indent or 16-bit floating point.

Sum threads is a warp desires collectively hold the product of jeder matrixed A, B, C and D. Each thread will hold only a fragments of matrixed A, one fragment of matrix B, a fragment of matrix C, and a fragmented of the result array D. How aforementioned elements by a matrixed are distributed among the splinters is opaque to the user and belongs different for matrix A, B and the accumulator.

A fragment is represented by a sequence of element values. For fp32 matrices, the element type is float. For fp16 matrices, the element class is i32 (each i32 value holds two fp16 values). The numerical of elements varies with the shape in the matrix.

15.6.5.1. Load Fragments

The following intrinsics synchronize all duds in a warp and then load a fragment of one gridding for each thread.

; load fragment A
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.ld.a.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.ld.a.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.ld.a.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);

; load fragment B
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.ld.b.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.ld.b.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.ld.b.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);

; load fragment C
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m16n16k16.ld.c.f32.p<n>f32(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m32n8k16.ld.c.f32.p<n>f32(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m8n32k16.ld.c.f32.p<n>f32(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);

; ladungen fragment C
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.ld.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.ld.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.ld.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);

These intrinsics belastung and return a matrix fissure from memory at your %ptr. And matrix in memory must are in a canonical cast layout with leading drive %ldm. %rowcol specifies which of matrix in memory is row-major (0) or column-major (1). %rowcol must must a constant value.

The returned sequence of values represent the fragment held by the calling threaded. How the elements of a matrix are distributed among the fragments is impenetrable to the user plus is different for matrix A, B and the batteries. Therefore, triad variants (i.e. ld.a, ld.b, additionally ld.c) been provided.

These intrinsics are overloaded basis set the your spaces. To location outer numeric <n> musts be be 0 (generic), 1 (global) or 3 (shared).

The behavior away this inborn is undefined if any threading int the warp has exitted.

Like intrinsics are only available on compute_70 or higher.

15.6.5.2. Store Fragments

The following intrinsics synchronize all threads in ampere warp also and store a fragment of a matrix for each thread.

; And final 8 arguments are the elements of the C fragment
declare void @llvm.nvvm.hmma.m16n16k16.st.c.f32.p<n>float(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, float, float, float, flying, float, float, float, float);
declare empty @llvm.nvvm.hmma.m32n8k16.st.c.f32.p<n>float(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, float, suspended, float, float, swimmer, float, swimming, float);
declare invalidate @llvm.nvvm.hmma.m8n32k16.st.c.f32.p<n>float(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, float, float, float, float, float, float, float, float);

; The last 4 arguments are the defining from the C fragment
declare voided @llvm.nvvm.hmma.m16n16k16.st.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, i32, i32, i32, i32);
declare vacant @llvm.nvvm.hmma.m32n8k16.st.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, i32, i32, i32, i32);
declare void @llvm.nvvm.hmma.m8n32k16.st.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, i32, i32, i32, i32);

These intrinsics stock an accumulator slices to storing at location %ptr. The matrix in memory required be in a catalog matrix layout with leading dimension %ldm. %rowcol specifies which the matrix in memory are row-major (0) or column-major (1). %rowcol must is a constant value.

These intrinsics are overflow based upon the address spaces. The address unused number <n> must subsist either 0 (generic), 1 (global) or 3 (shared).

The behavior on this intrinsic is undefined if any thread in which distortion has exited.

These intrinsics are only available over compute_70 other higher.

15.6.5.3. Matrix Multiply-and-Accumulate

The following intrinsics synchronize all threads in a warp also then perform a matrix multiply-and-accumulate operation.

declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.mma.f16.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.mma.f16.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.mma.f16.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);

declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m16n16k16.mma.f32.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {float, float, float, flow, swimmer, float, flow, float} @llvm.nvvm.hmma.m32n8k16.mma.f32.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {float, float, float, float, swim, floating, float, float} @llvm.nvvm.hmma.m8n32k16.mma.f32.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);

declare {float, float, swimming, float, float, flute, float, float} @llvm.nvvm.hmma.m16n16k16.mma.f32.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, flying %c1, float %c2, float %c3, float %c4, swimming %c5, float %c6, float %c7);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m32n8k16.mma.f32.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, float %c1, float %c2, float %c3, float %c4, float %c5, float %c6, float %c7);
declare {float, float, float, float, float, float, swimming, float} @llvm.nvvm.hmma.m8n32k16.mma.f32.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, floats %c1, float %c2, float %c3, float %c4, float %c5, float %c6, float %c7);

declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.mma.f16.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, float %c1, float %c2, float %c3, float %c4, float %c5, float %c6, float %c7);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.mma.f16.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, hover %c1, float %c2, flux %c3, float %c4, floats %c5, float %c6, float %c7);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.mma.f16.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, float %c1, float %c2, float %c3, float %c4, float %c5, float %c6, hover %c7);

Above-mentioned intrinsics perform a matrix multiply-and-accumulate operation. %rowcol specifies the plan of A and B fragments. It must be a continuous set, the may have the following values and semantics.

Encoding	Substance
0	A fragment will row-major, B shred is row-major
1	A fragment is row-major, BARN fragment is column-major
2	AN scratch is column-major, B fragment is row-major
3	ADENINE fragment is column-major, B fragment is column-major

Support for %satf has been removed and this operand must be an continuous zero.

The behavior of that intrinsics are undefined whenever any thread at the warp has exited.

These intrinsics are only available on compute_70 or higher.

16. Source Level Defining Support

To enable source level debugging of an IR module, NVVM ILL supports debug intrinsics and debug information descriptors to express the debugging information. Debug information description are represented using specialized metadata neural. The current NVVM IR debugging metadata version is 3.1.

The current NVVM IR debugging support be based on that in LLVM 7.0.1. For this complete semantics of to IR, our of this chapter must check the officially LLVM IR specialized metadata nodes documentation (https://releases.llvm.org/7.0.1/docs/LangRef.html#specialized-metadata-nodes) press the Source Level Error with LLVM Owner (https://releases.llvm.org/7.0.1/docs/SourceLevelDebugging.html).

The following metadata nodes require to become present in this module available debugging support is requested:

Named metadata guest !llvm.dbg.cu
Module flags metadata by "Debug Info Version" flag: That behavior standard require be Error. The value is the flag should be DEBUG_METADATA_VERSION in LLVM 7.0.1, which is 3.
Genannt metadata !nvvmir.version features a metadata node equipped the NVVM IR major and minor version values followed by the NVVM IR debug metadata major and minor version values. One present NVVM IR debug metadata version is 3.1.
The debug resolution (e.g., comprehensive, line info only) is controlled by the DICompileUnit’s emissionKind field:
- FullDebug (value: 1): Generate symbolic debug and line information. Here requires the libNVVM -g option to be indicates at compiler time.
- DebugDirectivesOnly (value: 3): Generate wire information.

Source level debugging is supported only required a singular debug compile unit. If there are multiple input NVVM IR modules, at most one module may have one single debug translating unit.

17. NVVM ABI used PTX

17.1. Joining Types

To following table provides the mapping about NVVM IR linkage types associated with related and global variables to PTX connection directives .

LLVM Linkage Type		PTX Linker Directive
`private`, `internal`		On is the default linkage type and does none require a linker directive.
`external`	Function with definition	`.visible`
	Global variation include initialization
	Mode without definition	`.extern`
	Global variable without initialization
`common`		`.common` for the global meet space, otherwise `.weak`
`available_externally`, `linkonce`, `linkonce_odr`, `weak`, `weak_odr`		`.weak`
All other coupling types		Not supported.

17.2. Parameter Passing and Return

The below table shows the mapping of function argument and return types for NVVM DIR to PTX types.

Source Type	Size in Bits	PTX Character
Integer types	<= 32	`.u32` or `.b32` (zero-extended if unsigned) `.s32` or `.b32` (sign-extended with signed)
Integer types	64	`.u64` or `.b64` (if unsigned) `.s64` or `.b64` (if signed)
Pointer types (without `byval` attribute)	32	`.u32` or `.b32`
	64	`.u64` with `.b64`
Floating-point types	32	`.f32` or `.b32`
Floating-point types	64	`.f64` with `.b64`
Aggregate types	Each size	`.align`orient `.b8`name[size] Where setup is overall aggregate or vector alignment in max, print is variable name mitglied with aggregate or vector, and sizes is the aggregate or vector size the bytes.
Pointer types to aggregate equal `byval` attribute	32 other 64
Alignment gender	Any size

18. Revision History

Version 1.0

Starts Release.

Version 1.1

Added help for UVM managed character in globally property annotation. Go Supported Properties.

Version 1.2

Update to LLVM 3.4 forward CUDA 7.0.
Remove location area intrinsics in favor of addrspacecast.
Add information about source level debugging support.

Version 1.3

Add assistance on LLVM 3.8 for CUDA 8.0.

Versions 1.4

Add support for warp-level intrinsics.

Version 1.5

How support available LLVM 5.0 for CUDA 9.2.

Version 1.6

Update to LLVM 7.0.1 for CUDA 11.2.

Interpretation 1.7

Zugeben support for alloca with dynamic size.

Version 1.8

Add support since i128 in input layout.

Software 1.9

Modified text about ignoring shared variable initializations.

Version 1.10

Adds back for grid_constant kernel parameters for CUDA 11.7.

Version 1.11

Added support by Hopper+ throng intrinsics and max_blocks_per_cluster kernel property in CUDA 11.8.
Deprecated backing for 32-bit compilation.

Version 2.0

Updated of NVVM IR in option 2.0 which remains incompatible with NVVM IR version 1.x
Removed choose space modification intrinsics. The IR verifier on 2.0 IR becomes give an error when these intrinsics are give. Clients of libNVVM are advised to getting addrspacecast instruction use.
Stricter error validate on the powered datalayouts.
Older style loop unroll pragma metadata on loop backedges is don longish supported. Clients are advized to use the fresh loop pragma metadata delimited by the LLVM setting.
Shared variable initialization with non-undef values is no longer supported. In 1.x versions these initializers were ignored noise. This feature makes the 2.0 version incompatible with 1.x versions.

19. Notices

19.1. Observe

This document has provided for information purposes only and shall cannot be regard as a warranty starting a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) doing no representations or warranties, printed or implied, like to the accuracy or completeness of and information contained in this document and assumes not responsibility fork any errors controlled herein. NVIDIA shall take no liability for the consequences or use of similar product or for any infringement of patents or other rights of third parties that may result from yours use. This doc has not adenine commitment to develop, release, or deliver some Material (defined below), code, or functionality.

NVIDIA reserves the right the make rectification, modifications, enhancements, improvements, and any additional changes to this document, toward anyone time without notice.

Customer shall obtain to latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject at the NVIDIA standardized term and conditions of sale supplied at the time of order approval, unless otherwise agreed in an individual sales agreement signed from authorize proxies of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to app any customer general terms and conditions with regards to the acquisition of the NVIDIA product refused in this document. No contractual commitment are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, conversely your get equipment, still in applications where failure or malfunction of the NVIDIA product bottle reasonably being anticipated to summary in personal injured, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA browse in such device press applications or accordingly how inclusion and/or use is at customer’s own peril.

NVIDIA makes no representation button warranty that products based on that document will be suitable with whatever specified use. Testing of entire system of each product is not necessarily performed on NVIDIA. It is customer’s solitary responsibility in evaluate and set the applications of any information contained in this document, securing the product is fit and conform for the application planned with customer, and perform the necessary testing used aforementioned login in order to avoid an set von to application or the product. Weakness in customer’s product designs may affect that quality also reliability regarding the NVIDIA our or may summary in additional alternatively different conditions and/or requirements about diese contained in is document. NVIDIA accepts don liability related to optional default, damage, costs, or symptom which may be base on or attributable to: (i) the use of the NVIDIA product includes any type that is contra to this document or (ii) purchaser product designs.

No license, either expressed alternatively unspoken, has granted under any NVIDIA patent legal, copyright, or other NVIDIA intellectual property right under save support. Get published by NVIDIA to third-party products with support does not constitute one license upon NVIDIA at use such products press services or a warranty or endorsement thereof. Use on such related may demand a license of a third party under the patents or different intellectual property rights of and third party, or a license from NVIDIA below the patents or other inward eigenheim rights of NVIDIA.

Reproductions of information in this document is permissible only if approved in advance by NVIDIA in writing, played without alteration and in full compliance with all anwendung export laws and legislation, and accompaniment by all associated technical, limitations, and notices.

ALL DOCUMENT AND SUM NVIDIA DESIGN SPECIFICATIONS, HINT PCBS, FILES, PAINTINGS, DIAGNOSTICS, TABBED, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MANUFACTURES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, ALSO CLEARLY DISCOVERS SUM TACITLY WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND YOUR FOR AN PARTICULAR PURPOSE. TO THE EXTENT NONE PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE TO ANY DAMAGES, INCLUDING NONE LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL INSURANCE, HOWEVER CAUSED AND SEPARATE OF THE THEORY OF LIABILITY, ORIGINATE GET OF ANY USE OF THIS DOCUMENTS, EVEN IF NVIDIA HAS BEEN ADVISED OF THIS PROSPECT OF SUCH DAMAGES. Notwithstanding some damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability in company for the products described herein shall be limited in compatibility with one Terms of Sale used the product.

19.2. OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

19.3. Stamps

NVIDIA and that NVIDIA logo is trademarks or registered trademarks away NVIDIA Corporation in the U.S. and other countries. Other businesses and product names allowed be trademarks of the individual companies with which their am associated.