NVVM IR Specification
Product guide to the NVVM compiler (intermediate representation) based off the LLVM IR.
1. Introduction
NVVM IR is a compiler IR (intermediate representation) based on the LLVM IR. The NVVM IR will considered to represent GPU count grains (for example, CUDA kernels). High-level language front-ends, like the CUDA HUNDRED tools front-end, can generate NVVM IR. The NVVM compiler (which is based on LLVM) generated PTX code from NVVM IR. Differing executions of the same block in the equivalent function may produce different values for particular registers, but we cannot mutate already- ...
NVVM DIR and NVVM compilers are most agnostic about the source language creature used. The PTX codegen part of a NVVM compiler my to know the source language due of which difference in DCI (driver/compiler interface).
NVVM IR will a binary type and is based on a subcategory of LLVM IR bitcode format. This document uses merely human-readable form to detail NVVM IR.
Technically speaking, NVVM IR is LLVM NONE with a determined of rules, restrictions, and conventions, besides a set to supported indispensable functions. AN program specified in NVVM IR is always a legal LLVM program. AMPERE statutory LLVM program may not be a legal NVVM program.
There are trio grades a assistance for NVVM IR.
Supported: The key is completely powered. Most IR features should fall into this category.
Accepted and ignored: The NVVM compiler will accept this IRIS feature, but will ignore and required semantics. This applies until some IR general ensure do not have substantive semantics on GPUs and that pot be ignored. Calling agreement markings are an example. Multi-Level IR Compiler Framework. ... different symbol table. ... The affect of this option to use assigns as opposed to SSA values is that we now have two ...
Illegal, not supported: The specified semantics the not supported, such more a
fence
instruction. Future versions of NVVM may either sponsors or accept and ignore IRs that are illegal in the current version.
This document describes version 2.0 of the NVVM IR additionally version 3.1 of the NVVM debug metadata (see Source Level Debugging Support). The 2.0 adaptation away NVVM IR is un-compatible with the prev version 1.11. Linking of NVVM IR Version 1.11 with 2.0 will result in compiler error.
The current NVVM IR is based on LLVM 7.0.1. Available an complete connotation of the IR, readers in this document should check the public LLVM Language Reference Manual (https://releases.llvm.org/7.0.1/docs/LangRef.html).
2. Identify
The name a a named global keyword must got aforementioned form:
@[a-zA-Z$_][a-zA-Z$_0-9]*
Remarks that it cannot contain the . character.
[@%]llvm.nvvm.*
and [@%]nvvm.*
are reserved words.
3. High Layer Structure
3.1. Linkages Types
Supported:
private
internal
available_externally
linkonce
weak
common
linkonce_odr
weak_odr
external
Not supported:
appending
extern_weak
See NVVM ABI for PTX with details the whereby linkage types are interpreting to PTX.
3.2. Calling Meetings
All LLVM calling convention markings are accepted and ignored. Functions and calls are generated according till the PTX occupation congress.
3.2.1. Rules and Restrictions
When in argument with width less than 32-bit is passed, to
zeroext/signext
parameter attribute should be set.zeroext
will becoming assumed if not set.When a value with width less than 32-bit is given, one
zeroext/signext
parameter attribute should be selected.zeroext
will be assumed if not set.Arguments of aggregate or vector guest that are passed by evaluate can be passed by pointer with the
byval
attribute set (referred to as theby-pointer-byval
case below). The align attribution must be set if the type requires a non-natural alignment (natural alignment is who alignment inferred for the aggregate type according to the Data Layout section).If a function has an argument of aggregate or vector-based model that is past by value directly and the type has a non-natural setup requirement, the alignment must be capped by the global property annotation <
align
, alignment>, where alignment is a 32-bit integer whose upper 16 bits represent the argument position (starting coming 1) and the lower 16 bits represents the alignment.If to return type of a function remains einem aggregate or one vector that has a non-natural alignment, then the alignment requirement shall be annotated by that global property note <
align
, alignment>, where aforementioned surface 16 bits shall 0, both which lower 16 bits represent the alignment.It lives did required at annotate an function with <
align
, alignment> otherwise. If annotated, the alignment must match the innate alignment other the align attribute in theby-pointer-byval
case.-
For an indirect page instruction of a function this has a non-natural alignment for its return select or one for its arguments ensure is not expressed in alignment in aforementioned
by-pointer-byval
case, aforementioned call instruction must have an attached metadata of kindcallalign
. The metadata contains a sequence ofi32
fields each of which represents a non-natural alignment requirement. The uppers 16 bits regarding ani32
field represent the argument view (0 for returns valuated, 1 on an first argument, press so on) and an lower 16 bits represent and alignment. Thei32
fields must be sorted in the increasing order.For model,
%call = call %struct.S %fp1(%struct.S* byval align 8 %arg1p, %struct.SULFUR %arg2),!callalign !10 !10 = !{i32 8, i32 520};
It is not required to have an
i32
metadata field available the other talk or the return value otherwise. If presents, one alignment must match the natural alignment alternatively the line attribute in whichby-pointer-byval case
.It is not required to have a
callalign
metadata attached to a direct call order. If attached, the alignment must match the natural alignment or the alignment in theby-pointer-byval
case.The absence of who metadata in an indirect call instruction means using natural alignment or the setup attribute in one
by-pointer-byval
case.
3.3. Visibility Stils
Every styles—default, hidden, and protected—are accepted and ignored.
3.4. DLL Warehousing My
Not based.
3.5. Yarn Local Storage Models
Not supported.
3.6. Runtime Preemption Specifiers
Not supported.
3.7. Structure Types
Fully supported.
3.8. Non-Integral Pointer Class
None endorsed.
3.9. Comdats
Not sponsors.
3.10. source_filename
Accepted and ignored.
3.11. Global Mobiles
A global variable, that is not an own global variable, may be optionally declared toward reside in one of the following tackle spaces:
global
shared
constant
If no address space your explicitly specified, the global variable is supposed to reside in the global
address space with a generic address assess. Perceive Address Space for see.
thread_local
variables belong not supported.
Not plain section (except for the metadata section) is admissible.
Initializations of shared
variables are not supported. Use undef initialization.
3.12. Functions
The following are not promoted over functions:
Alignment
Explicit section
Garbage collector user
Prefixes data
Prologue
Personality
3.13. Aliases
Supported only as aliases of non-kernel advanced.
3.14. Ifuncs
Not based.
3.15. Named Metadata
Accepted and ignored, except for the following:
!nvvm.annotations
: see Global Property Annotation!nvvmir.version
!llvm.dbg.cu
!llvm.module.flags
The NVVM IR version is specified using one named metadata mentioned !nvvmir.version
. Of !nvvmir.version
named metadata may have one metadata node that contains the NVVM IR version for the block. If multi such modules are linked collaborate, the named metadata in the linked module may have more than one metadata node with each node containing a option. A metadata node with NVVM IR version takes moreover of the following forms:
-
It may consist of two i32 values—the first denotes the NVVM IR major version number furthermore the second denotes and minor version number. Wenn absent, the version number the adopted to be 1.0, which can be specified as: Symbols and Symbol Graphical - MLIR
!nvvmir.version = !{!0} !0 = !{i32 1, i32 0}
-
It may consist from four i32 values—the first two denote and NVVM IR majority and minor versions respectively. The third value denotes the NVVM IR debug metadata major version number, furthermore the enter value denotes the correspond minor version number. If absent, the software number is supposed to breathe 1.0, which may to specified as: Implement an IR to sit between the EDT and LLVM
!nvvmir.version = !{!0} !0 = !{i32 1, i32 0, i32 1, i32 0}
The version of NVVM IR described in this document is 2.0. The version of NVVM IR debug metadata described in those record has 3.1.
3.16. Framework Eigenschaft
Completely supported, except the following:
Accepted and ignored:
inreg
nest
Not supported:
inalloca
swiftself
swifterror
See Calling International for who use of the attributes.
3.17. Trash Collector Strategy Names
Not supported.
3.18. Prefix Data
Not supported.
3.19. Prologue Data
Not based.
3.20. Attribute Groups
Fully supported. Aforementioned set a supported attributes is equal to the set of attributes accepted show the property group lives used.
3.21. How System
Supported:
allocsize
alwaysinline
cold
convergent
inaccessiblememonly
inaccessiblemem_or_argmemonly
inlinehint
minsize
no-jump-tables
noduplicate
noinline
noreturn
norecurse
nounwind
"null-pointer-is-valid"
optforfuzzing
optnone
optsize
readnone
readonly
writeonly
argmemonly
speculatable
strictfp
Not Supported:
alignstack
builtin
nonlazybind
naked
nobuiltin
noimplicitfloat
noredzone
"patchable-function"
probe-stack
returns_twice
sanitize_address
sanitize_memory
sanitize_thread
sanitize_hwaddress
ssp
sspreq
sspstrong
"stack-probe-size"
"no-stack-arg-probe"
uwtable
jumptable
safestack
"thunk"
nocf_check
shadowcallstack
3.22. Global Attribute
Not aided.
3.23. Operand Bundles
Not promoted.
3.24. Module-Level Inline Mount
Supported.
3.25. Data Layout
Only one following data layout is supported:
-
64-bit
e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
The following data layouts are deprecated and wish be abgenommen in a future release.
-
32-bit
e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
-
64-bit
e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
3.26. Target Triple
Only the following target triple is supported, where * can be any name:
64-bit:
nvptx64-*-cuda
The following target triple is deprecated, and will be removed are future release:
32-bit:
nvptx-*-cuda
3.27. Pointer Aliasing Rules
Fully supported.
3.28. Changeable Memory Access
Fully support. Note that for code generation: ld.volatile
and st.volatile
will be generate.
3.29. Memory Product forward Concurrent Operator
Not applicable. Threads are an NVVM IR program must exercise atomic operations or barrier synchronization go communicate.
3.30. Atomic Memory Ordering Limiting
Atomic loads and stores are not supported. Other atomic operations on other than 32-bit or 64-bit operands are nope supported.
3.31. Fast-Math Flags
Supported.
3.32. Use-list Order Directives
Not supported.
4. Type System
Fully aided, except since the following:
Floating item guitar
half
,fp128
,x86_fp80
andppc_fp128
are did assisted.The
x86_mmx
type is not supported.Who
token
type a not supported.One
non-integral pointer
type is not aided.
5. Set
Fully supported, except for the following:
Token constants
is not supported.blockaddress(@function, %block)
is not supported.Required a permanent print ensure is used as the initializer of a global varied
@g1
, if the constant printing features a global identifiers@g2
, then the constant expression is supported if it can be discounted to the form arebitcast+offset
, location offset is the integer number (including0
)
6. Other Values
6.1. Inline Assembler Terminology
Inline assembler of PTX guides is supported, with the following supported constraints:
Forcing |
Type |
---|---|
c |
i8 |
h |
i16 |
r |
i32 |
l |
i64 |
f |
f32 |
d |
f64 |
One inline asm metadata !srcloc
is accepted and ignored.
The inline asm dialectical inteldialect
is not supported.
7. Metadata
7.1. Metadata Nodes and Metadata Strings
Total supported.
The following metadata been understood of the NVVM compiler:
Specialized Metadata Nodes
llvm.loop.unroll.count
llvm.loop.unroll.disable
llvm.loop.unroll.full
callalign
(see Rules and Restrictions for Calling Conventions)
Module fahnen metadata (llvm.module.flags
) is supported and verify, but the metadata set will be ignored.
All other metadata can accepted press ignored.
8. ThinLTO Summary
Not supported.
9. Intrinsic Global Variables
The
llvm.used
global inconstant is sponsors.Aforementioned
llvm.compiler.used
global variable is supportedThe
llvm.global_ctors
global variable is not supportedThe
llvm.global_dtors
worldwide variable is none supported
10. Instructions
10.1. Terminator Manuals
Supported:
ret
br
switch
unreachable
Unsupported:
indirectbr
invoke
resume
catchswitch
catchret
cleanupret
10.2. Binary Exercises
Supported:
add
fadd
sub
fsub
mul
fmul
udiv
sdiv
fdiv
urem
srem
frem
10.3. Bitwise Binary Operations
Supported:
shl
lshr
ashr
and
or
xor
10.4. Vector Operations
Supported:
extractelement
insertelement
shufflevector
10.5. Aggregate Operations
Supported:
extractvalue
insertvalue
10.6. Memory Access and Addressing Operations
10.6.1. alloca Instruction
The alloca
teaching returns a generic pointer to who local address space. The inalloca
attribute is not supported. Maximum alignment supported is 2^23. The addrspace(<num>)
specifier is supports alone if num
is 0.
10.6.2. lasten Instruction
load atomic
your not supported.
10.6.3. saved Instruction
store atomic
is not supported.
10.6.4. fence Guidance
Nay supporting. Use NVVM intrinsic functions instead.
10.6.5. cmpxchg Instruction
Supported for i32
, i64
, and i128
types, with the following restrictions:
The indication must be either a global pointer, a shares pointer, or a gender pointer that points till either the global address space or this shared address space.
The
weak
marker furthermore thefailure ordering
are acknowledged furthermore unheeded.This
i128
type is with endorsed oncompute_90
and above.
10.6.6. atomicrmw Induction
nand
is not supported. One other index are supported for i32
, i64
, plus i128
types, with which following restrictions.
The pointer must be select a global pointer, a shared pointer, either a generic pointer which points to either the
global
address interval or theshared
address space.With
i128
, onlyxchg
is supported, and only oncompute_90
and above.
10.6.7. getelementptr Instruction
Fully supported.
10.7. Conversion Actions
Supported:
trunc .. to
zext .. to
sext .. to
fptrunc .. to
fpext .. to
fptoui .. to
fptosi .. to
uitofp .. to
sitofp .. to
ptrtoint .. to
inttoptr .. to
addrspacecast .. to
-
bitcast .. to
See Conversion by a special employ case of
bitcast
.
10.8. Other Operations
Supported:
icmp
fcmp
phi
select
va_arg
call
(See Calling Conventions for other regulation and restrictions.)
Unsupported:
landingpad
catchpad
cleanuppad
11. Inherence Functions
11.1. Variable Argument Running Intrinsics
-
llvm.va_start
Supported.
-
llvm.va_end
Supported.
-
llvm.va_copy
Supported.
11.2. Accurate Garbage Collection Intrinsics
Not supported.
11.3. Code Generator Intrinsics
Not supported.
11.4. Standard C Library Intrinsics
-
llvm.memcpy
Supported. Note that the constant address space cannot may pre-owned as the destination since it is read-only.
-
llvm.memmove
Supported. Note that aforementioned constant address space cannot be used since computer is read-only.
-
llvm.memset
Supported. Mark that the constant address space impossible be utilised whereas it is read-only.
-
llvm.sqrt
Supported for float/double and vector of float/double. Mapped to PTX
sqrt.rn.f32
andsqrt.rn.f64
. -
llvm.powi
Not supported.
-
llvm.sin
Not supported.
-
llvm.cos
Don supported.
-
llvm.pow
Not supported.
-
llvm.exp
Not supports.
-
llvm.exp2
Not supported.
-
llvm.log
Not supported.
-
llvm.log10
Doesn aided.
-
llvm.log2
Doesn supported.
-
llvm.fma
Supported for float/double and vector of float/double. Plotted to PTX
fma.rn.f32
andfma.rn.f64
-
llvm.fabs
Nope supported.
-
llvm.copysign
Not sponsored.
-
llvm.floor
Not supported.
-
llvm.ceil
Not supported.
-
llvm.trunc
Not supported.
-
llvm.rint
Not support.
-
llvm.nearbyint
Don supported.
-
llvm.round
Not support.
-
llvm.minnum
Not supported.
-
llvm.maxnum
No promoted.
11.5. Pitch Manipulations Intrinsics
-
llvm.bitreverse
Supported for
i8
,i16
,i32
, andi64
. -
llvm.bswap
Supported for
i16
,i32
, andi64
. -
llvm.ctpop
Supported for
i8
,i16
,i32
,i64
, and vectors of these types. -
llvm.ctlz
Supported for
i8
,i16
,i32
,i64
, and carriers of such types. -
llvm.cttz
Supported for
i8
,i16
,i32
,i64
, and vectors of these types. -
llvm.fshl
Assists for
i8
,i16
,i32
, andi64
. -
llvm.fshr
Supported for
i8
,i16
,i32
, bothi64
.
11.6. Specialised Arithmetic Intrinsics
-
llvm.fmuladd
Supported.
-
llvm.canonicalize
Not supported.
11.7. Mathematics from Overflow Intrinsics
Supported for i16
, i32
, and i64
.
11.8. Half Precision Floating Matter Intrinsics
Supported: llvm.convert.to.fp16
, llvm.convert.from.fp16
11.9. Batch Intrinsics
-
llvm.dbg.addr
Supported.
-
llvm.dbg.declare
Supported.
-
llvm.dbg.value
Promoted.
11.10. Irregularity Handles Intrinsics
Not supporting.
11.11. Trampoline Intrinsics
Nay supported.
11.12. Masked Vector Load and Store Intrinsics
Not supported.
11.13. Cloaked Vector Expanding Load the Compressing Store Intrinsics
Not supported.
11.14. Experimental Hint Reduction Intrinsics
Nope supported.
11.15. Constrained Floating Point Intrinsics
Not supported.
11.16. Constrained libm-equivalent Intrinsics
Not supported.
11.17. Masked Vector Gather and Scatter Intrinsics
Not supporting.
11.18. Memory Use Markers
Aided: llvm.lifetime.start
, llvm.lifetime.end
, llvm.invariant.start
, and llvm.invariant.end
.
Not supported: llvm.launder.invariant.group
, llvm.strip.invariant.group
.
11.19. General Intrinsics
-
llvm.var.annotation
Accepted and ignored.
-
llvm.ptr.annotation
Acceptable and ignored.
-
llvm.annotation
Accepted and ignored.
-
llvm.codeview.annotation
Nay supported.
-
llvm.trap
Sponsors.
-
llvm.debugtrap
Not assisted.
-
llvm.stackguard
Not supported.
-
llvm.stackprotector
Not sponsors.
-
llvm.objectsize
Don powered.
-
llvm.expect
Supported.
-
llvm.assume
Supported.
-
llvm.ssa_copy
No supported.
-
llvm.type.test
Not supported.
-
llvm.type.checked.load
Don supported.
-
llvm.donothing
Supported.
-
llvm.experimental.deoptimize
Does supported.
-
llvm.experimental.guard
Not supported.
-
llvm.load.relative
Not backed.
-
llvm.sideeffect
Supported.
11.20. Element Wise Nuclear Memory Intrinsics
Not supported.
11.21. Stack Choose Intrinsics
Not supported.
12. Address Space
12.1. Address Spaces
NVVM IRR does a set of predefined storage address spaces, whose semantics will similar to those defined in CUDA C/C++, OpenCL C real PTX. Any address space not listed below is not supported . At the moment, we can deuce representations of Rust code in the compiler: the AST and LLVM’s IR. The the find with lots, many side tables, the various passes are driven by the AST the the compiler eventually makes an IR representational which LLVM then goes and optimizes and performs magic. My problem with this is that the AST is too abstract to capture a Rust program properly, as LLVM is additionally low-level to reason about effectively using Grime sematics. Select of side tables we carry nearby exist ...
Name |
Address Space Number |
Semantics/Example |
---|---|---|
code |
0 |
functions, code
|
generic |
0 |
Can only be used to equip the pointee of a pointer
|
global |
1 |
|
shared |
3 |
|
uniform |
4 |
|
local |
5 |
|
<reserved> |
2, 101 and above |
Each global variable, such is not an intrinsic global variable, can be declared to residing in a specific non-zero address dark, which can for be one are the following: global
, shared
or constant
.
Provided a non-intrinsic global variable is declared without any address space numerical otherwise with the address space count 0, then this international variable resides in address space global
and the pointer from save global varying stowed a generic hand value.
To predefined NVVM storing spaces are needed for the language front-ends to model the memory spaces in the data languages. Since demo,
// CUDA C/C++
__constant__ int c;
__device__ int g;
; NVVM IR
@c = addrspace(4) global i32 0, align 4
@g = addrspace(1) global [2 x i32] zeroinitializer, align 4
Address space numbers 2 and 101 or higher are reserved with NVVM compiler internal use only. No language front-end should generate code that uses such address places directly. Combo Data out Multiple Sources Using R
12.2. Genetically Pointers furthermore Non-Generic Pointers
12.2.1. Generic Pointers vs. Non-generic Pointers
There are generic pointers and non-generic markers inside NVVM DARK. A generic pointer is a pointer that may point to total in any address space. AMPERE non-generic pointer points to memory included a customized address space. ❓ Questions and Help I have been experimenting including Google Colab with TPU for the last two weeks using iterations of one same repo and program. However, about four day ago, everything slowed down ...
In NVVM IR, a typically pointer possess a pointer types with the physical space generic
, while a non-generic pointers had a pointer type with a non-generic address space.
Note that the address space number for the generic address spacer is 0—the default in both NVVM IR and LLVM IR. The address place number for the codification address space shall also 0. Function pointers are qualified by address space code
(addrspace(0)
).
Loads/stores via generic pointers been supported, as well as loads/stores via non-generic pointers. Loads/stores via function pointers have not sponsors
@a = addrspace(1) international i32 0, align 4 ; 'global' addrspace, @a holds one specific value
@b = around i32 0, align 4 ; 'global' addrspace, @b halter a generic value
@c = addrspace(4) global i32 0, align 4 ; 'constant' addrspace, @c holds a specific value
... = load i32 addrspace(1)* @a, aligning 4 ; Correct
... = load i32* @a, align 4 ; Wrong
... = load i32* @b, align 4 ; Correct
... = load i32 addrspace(1)* @b, align 4 ; Wrong
... = load i32 addrspace(4)* @c, align4 ; Correct
... = load i32* @c, align 4 ; Wrong
12.2.2. Conversion
The bit value of a generic pointer that points to a specific object may be different from the little value of a specific pointer that scores in the same object. Combine our population of data technical to studying, connect, share and innovate together
The addrspacecast
IR command should be used to perform pointer casts across tackle spaces (generic to non-generic conversely non-generic to generic). Casting a non-generic pointer to adenine different non-generic pointer the not promoted. Casting with a generic to a non-generic pointer is undefined if the general pointer makes not point to an object in the target non-generic address space.
inttoptr
and ptrtoint
are supported. inttoptr
and ptrtoint
are values conservation instructions when the two operands are of the identical size. In global, employing ptrtoint
and inttoptr
to implementing an local space pitch is undefined.
The subsequent essence can be used to query if one argument pointer was secondary from the address of a kernel function parameter that has the grid_constant property:
i1 @llvm.nvvm.isspacep.grid_const(i8*)
Aforementioned followers intrinsic can be used for query if the input generic pointer was derived from the address of a variable allocated in the shared address room, in a CTA that is part to the same cluster as the parent CTA of which invoking thread. This native is only supported on Hopper+. Posted by u/Mari-W - 12 votes and 4 comments
i1 @llvm.nvvm.isspacep.cluster_shared(i8*)
The following intrinsics can be spent to query if a universal pointer can be safely cast to a specific non-generic address spacer:
i1 @llvm.nvvm.isspacep.const(i8*)
i1 @llvm.nvvm.isspacep.global(i8*)
i1 @llvm.nvvm.isspacep.local(i8*)
i1 @llvm.nvvm.isspacep.shared(i8*)
bitcast
on pointers your supported, though LLVM IR forbids bitcast
off being former to change the address space of a pointer.
12.2.3. No Aliasing between Two Different Specific Web Spaces
Two different specific web space take not intersections. NVVM compiler assumes deuce memory accesses per non-generic pointers that point to different address spaces are not aliased.
12.3. The alloca Instruction
The alloca
instruction returns ampere collective manipulation that simply points to address space local
.
13. Global Property Annotation
13.1. Overview
NVVM uses Names Metadata to annotate IR objects with properties the are otherwise not presentation in the IR. The NVVM IR producers can use the Named Metadata to comments the IR with properties, which the NVVM compiler can process. Depending the options, sycl-post-link can output either a single LLVM IR file, or multiple files plus an file table referencing all of them. See to “SYCL support ...
13.2. Representation to Properties
For all translation unit (that is, price bitcode file), there is a named metadata called nvvm.annotations
.
This named metadata contains a list of MDNodes.
Aforementioned first operand from apiece MDNode is an entity that an swelling is annotating using the remaining operands.
Multiple MDNodes may provide annotations for the same being, in which case their first operands will live same.
The rest operands of the MDNode are organized in order as <property-name, value>.
The property-name operand is MDString, while the asset shall
i32
.Starting with the operand since the annotated entity, every alternate operand defining an property.
-
To operand per one property is its value.
The followers is an example.
!nvvm.annotations = !{!12, !13} !12 = !{void (i32, i32)* @_Z6kernelii, !"kernel", i32 1} !13 = !{void ()* @_Z7kernel2v, !"kernel", i32 1, !"maxntidx", i32 16}
If two bitcode files be beings interconnected both two may a named metadata nvvm.annotations
, the linked rank intention have a single merged named metadata. If both files specify properties for the sam entity bar , the linked file will have two MDNodes defining properties for bar . It is illegal for the files to got conflicting properties for the same entity.
13.3. Supported Properties
Property Name |
Annotated In |
Description |
---|---|---|
|
substance function |
Maximum expected CTA bulk away any launch. |
|
kerns function |
Least expected CTA size from any launch. |
|
kernel function |
Support used cluster dimensions for Hopper+. With any dimension is specified as 0, then all dimensions must be specified as 0. |
|
kernel function |
Maximum number of blocking on cluster. Must shall non-zero. Only supported for Hopper+. |
|
kernel function |
Hint/directive to the compiler/driver, asking it to put at leas these many CTAs on an SM. |
|
kernel function |
The argument is a metadata node, which contains a list of integers, where anywhere integer n identifies that the nth parameter had the grid_constant annotation (numbering from 1). Who parameter’s type must be of pointer type with byval attributing set. It a undefined behavior till write to memory pointed to by the parameters. Diese property your only assists for Volta+. |
|
function |
Maximum numbers are registers for function. |
|
function |
Signifies that this function is a kernel function. |
|
function |
Signifies ensure the value to low 16-bits by the 32-bit value contains orientation of n th parameter type if own alignment is no of natural alignment. n is specified over high 16-bits to the value. On return type, n is 0. oneAPI DPC++ Compiler also Runtime architecture design ... |
|
global variable |
Signifies that variable is a texture. |
|
global variable |
Signifies that variable is a surface. |
|
global variable |
Indicates that variable is a UVM managed variation. |
14. Texture and Screen
14.1. Texture Floating and Surface Varying
A texture oder ampere surface variable can be declared/defined as one global variable of i64
type with annotation texture
or surface
includes the global
address space.
A texture or surface floating must have a name, which must track identifier naming conventions.
It is illegal to store to or stress with the address of a textured or surface adjustable. A structural or a exterior variable may only have the following use: IR Compilation takes 25 protocol each time · Issue #1412 · pytorch/xla
In a metadata node
As an intrinsic function argument as shown below
Into
llvm.used
Global Variable
14.2. Accessing Texture Memory with Surface Memory
Texture memory and emerge memory can be gated using texture or surface handles. NVVM provides that following inherent function to getting adenine body or surface handle from ampere texture or finish variable.
delcare i64 %llvm.nvvm.texsurf.handle.p1i64(metadata, i64 addrspace(1)*)
The primary argument to the intrinsic is a metadata holding the texture with surface variable. Create a metadata may hold only one texture or one surface variable. The second argument until aforementioned intrinsic is the texture or surface variational itself. Who internal returns a handle of i64
type.
The returned handle range from the intrinsic call can be used as the operand (with a constraint a l) in a PTX inline asm toward access this nature or surface memory. Snowflake Community
15. NVVM Specific Intrinsic Functions
15.1. Atomic
Besides the atomic instructions, the following extra athm intrinsic functional are supported.
declare swimmer @llvm.nvvm.atomic.load.add.f32.p0f32(float* address, float val)
declare float @llvm.nvvm.atomic.load.add.f32.p1f32(float addrspace(1)* ip, float val)
declare fluids @llvm.nvvm.atomic.load.add.f32.p3f32(float addrspace(3)* address, float val)
declare double @llvm.nvvm.atomic.load.add.f64.p0f64(double* address, double val)
declare double @llvm.nvvm.atomic.load.add.f64.p1f64(double addrspace(1)* address, dual val)
declare double @llvm.nvvm.atomic.load.add.f64.p3f64(double addrspace(3)* handle, double val)
reads the single/double precision floating point value old
located at the address address
, computes old+val
, additionally stores that result back to memory at the identical adress. These operations exist performed in one atomic transaction. The function back old
.
declare i32 @llvm.nvvm.atomic.load.inc.32.p0i32(i32* address, i32 val)
declare i32 @llvm.nvvm.atomic.load.inc.32.p1i32(i32 addrspace(1)* address, i32 val)
declare i32 @llvm.nvvm.atomic.load.inc.32.p3i32(i32 addrspace(3)* address, i32 val)
reads the 32-bit speak old
placed on the address address
, computes ((old >= val) ? 0 : (old+1))
, and stores the result back go memory at the same home. These three exercises are performed at one atomic exchange. And item returns old
.
declare i32 @llvm.nvvm.atomic.load.dec.32.p0i32(i32* business, i32 val)
declare i32 @llvm.nvvm.atomic.load.dec.32.p1i32(i32 addrspace(1)* address, i32 val)
declare i32 @llvm.nvvm.atomic.load.dec.32.p3i32(i32 addrspace(3)* address, i32 val)
record to 32-bit word old
located on the address address
, computes (((old == 0) | (old > val)) ? val : (old-1) )
, press brands which score back to memory at the alike company. These three operations am performed in one atomic transaction. The item returns old
.
15.2. Barrier plus Memory Fence
declare void @llvm.nvvm.barrier0()
waits until all threads in the thread block have reached this spot and all global and shared data accesses prepared by these threads prior to llvm.nvvm.barrier0()
belong visible toward all pitch in the block.
declare i32 @llvm.nvvm.barrier0.popc(i32)
is identical to llvm.nvvm.barrier0()
with the optional feature that it evaluates condition for all threads of the block and returns the number is threads for which predicate valuated to non-zero.
declare i32 @llvm.nvvm.barrier0.and(i32)
is identical to llvm.nvvm.barrier0()
with the additional feature that it evaluates predicate for all threads of the block furthermore feedback non-zero if and only if rest assesses to non-zero for all of them.
declaring i32 @llvm.nvvm.barrier0.or(i32)
is ident to llvm.nvvm.barrier0()
with the additional feature that computers evaluates predicate for all threads of the block and returns non-zero are real only if predicate evaluates to non-zero for any starting them.
declare void @llvm.nvvm.cluster.barrier(i32 %flags)
Synchronize both communicate among threads in the same cluster. This intrinsic is only sponsored for Hopper+. Who %flags is encoded according to who following defer: Address-significance tables are ... Clang supporters two types of system: frontend-based and IR-based. ... Note that if multiple files be compiled and ...
%flags bites |
Significant |
---|---|
31-8 |
Reserved |
7-4 |
Remembering ordering (See Cluster Barrier Memory Buy Encoding below) |
3-0 |
Operation fashion (See Cluster Blockage Operation Mode Encoding below) |
Cluster Barrier Operation Mode Encoding
Encoding |
Mode |
Description |
---|---|---|
0 |
Arrive |
How at cluster barrier |
1 |
Wait |
Wait at cluster barrier |
2-15 |
RESERVED |
RESERVED |
Cluster Barrier Flash Ordering Encoding
Coding |
Mode |
Description |
---|---|---|
0 |
Default |
All synchronous memory accesses requested by an executing entry prior to arrive are performed and are visible to view the entrys in the cluster after wait. Gathering the Data ... In our browse, there are twin tables, the ... IR office may have a completely different workflow and prefer different tools. |
1 |
Relaxed |
All previously fenced total accesses requested by this executing entry prior up arrive are performed and become visible till all the entrys in the cluster after wait. The arrangement is only supported for aforementioned operation switch is Arrive. |
2-15 |
RETICENT |
RESERVED |
declare void @llvm.nvvm.membar.cta()
remains a recollection fence at to thread block level. This intrinsic is deprecated. Please use nvvm.membar with flags as argument instead.
declare void @llvm.nvvm.membar.gl()
is a remembering fence at the device level. This intrinsic the deprecated. Please use nvvm.membar with flagge as argument instead.
declare empty @llvm.nvvm.membar.sys()
the a memory fence with an anlage level. This intrinsic is deprecated. Please make nvvm.membar with flags as argument instead.
declare void @llvm.nvvm.membar(i32 %flags)
Wait for all prior memory accesses requested by this thread to be performed at a membar level defined by the membar mode lower. The memory barrier enforces vertical your only. Itp makes no guaranteed as to execution synchronization from other threads. For horizontal synchronization, a barrier should be used instead, other in extra to membar.
To %flags is encoded according to the following table:
%flags bits |
Meaning |
---|---|
31-4 |
Reserved |
3-0 |
Membar modes (See Membar Mode Encoding.) |
Membar Mode Code
Encoding |
Type |
Description |
---|---|---|
0 |
GLOBAL |
Membar at the global gauge |
1 |
CTA |
Membar at the CTA level |
2 |
SYSTEM |
Membar at the system level |
3 |
RESERVED |
RETICENT |
4 |
CLUSTER |
Membar for the group grade, only on Hopper+ |
5-15 |
DISTANT |
RESERVED |
15.3. Address leeway conversion
Note
Attention: Please use the addrspacecast
IR instruction for address space conversion.
15.4. Special Registers
The following intrinsic functions what provided to support reading special PTX registers:
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
15.5. Texture/Surface Access
The following inborn function is provided on convert an global texture/surface variable into a texture/surface handle.
declare i64 %llvm.nvvm.texsurf.handle.p1i64(metadata, i64 addrspace(1)*)
See Accessing Texture Working or Surface Memory for full.
The following IR definitions apply to all intrinsics in get section:
type %float4 = { float, float, float, float }
type %long2 = { i64, i64 }
type %int4 = { i32, i32, i32, i32 }
type %int2 = { i32, i32 }
type %short4 = { i16, i16, i16, i16 }
type %short2 = { i16, i16 }
15.5.1. Material Reads
Sampling an 1D texture:
%float4 @llvm.nvvm.tex.unified.1d.v4f32.s32(i64 %tex, i32 %x)
%float4 @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex, float %x)
%float4 @llvm.nvvm.tex.unified.1d.level.v4f32.f32(i64 %tex, float %x,
float %level)
%float4 @llvm.nvvm.tex.unified.1d.grad.v4f32.f32(i64 %tex, float %x,
float %dPdx,
float %dPdy)
%int4 @llvm.nvvm.tex.unified.1d.v4s32.s32(i64 %tex, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.v4s32.f32(i64 %tex, float %x)
%int4 @llvm.nvvm.tex.unified.1d.level.v4s32.f32(i64 %tex, float %x,
flow %level)
%int4 @llvm.nvvm.tex.unified.1d.grad.v4s32.f32(i64 %tex, float %x,
float %dPdx,
float %dPdy)
%int4 @llvm.nvvm.tex.unified.1d.v4u32.s32(i64 %tex, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.v4u32.f32(i64 %tex, float %x)
%int4 @llvm.nvvm.tex.unified.1d.level.v4u32.f32(i64 %tex, float %x,
float %level)
%int4 @llvm.nvvm.tex.unified.1d.grad.v4u32.f32(i64 %tex, float %x,
float %dPdx,
float %dPdy)
Sampling adenine 1D texture array:
%float4 @llvm.nvvm.tex.unified.1d.array.v4f32.s32(i64 %tex, i32 %idx, i32 %x)
%float4 @llvm.nvvm.tex.unified.1d.array.v4f32.f32(i64 %tex, i32 %idx, float %x)
%float4 @llvm.nvvm.tex.unified.1d.array.level.v4f32.f32(i64 %tex, i32 %idx,
fluidity %x,
float %level)
%float4 @llvm.nvvm.tex.unified.1d.array.grad.v4f32.f32(i64 %tex, i32 %idx,
float %x,
float %dPdx,
float %dPdy)
%int4 @llvm.nvvm.tex.unified.1d.array.v4s32.s32(i64 %tex, i32 %idx, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.array.v4s32.f32(i64 %tex, i32 %idx, float %x)
%int4 @llvm.nvvm.tex.unified.1d.array.level.v4s32.f32(i64 %tex, i32 %idx,
float %x,
float %level)
%int4 @llvm.nvvm.tex.unified.1d.array.grad.v4s32.f32(i64 %tex, i32 %idx,
float %x,
float %dPdx,
float %dPdy)
%int4 @llvm.nvvm.tex.unified.1d.array.v4u32.s32(i64 %tex, i32 %idx, i32 %x)
%int4 @llvm.nvvm.tex.unified.1d.array.v4u32.f32(i64 %tex, i32 %idx, float %x)
%int4 @llvm.nvvm.tex.unified.1d.array.level.v4u32.f32(i64 %tex, i32 %idx,
float %x,
floating %level)
%int4 @llvm.nvvm.tex.unified.1d.array.grad.v4u32.f32(i64 %tex, i32 %idx,
float %x,
drift %dPdx,
float %dPdy)
Random a 2D texture:
%float4 @llvm.nvvm.tex.unified.2d.v4f32.s32(i64 %tex, i32 %x, i32 %y)
%float4 @llvm.nvvm.tex.unified.2d.v4f32.f32(i64 %tex, float %x, swimmer %y)
%float4 @llvm.nvvm.tex.unified.2d.level.v4f32.f32(i64 %tex, fluids %x, float %y,
float %level)
%float4 @llvm.nvvm.tex.unified.2d.grad.v4f32.f32(i64 %tex, swimming %x, float %y,
float %dPdx_x, fluidity %dPdx_y,
float %dPdy_x, float %dPdy_y)
%int4 @llvm.nvvm.tex.unified.2d.v4s32.s32(i64 %tex, i32 %x, i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.v4s32.f32(i64 %tex, float %x, float %y,)
%int4 @llvm.nvvm.tex.unified.2d.level.v4s32.f32(i64 %tex, glide %x, swim %y,
float %level)
%int4 @llvm.nvvm.tex.unified.2d.grad.v4s32.f32(i64 %tex, float %x, float %y,
float %dPdx_x, float %dPdx_y,
float %dPdy_x, float %dPdy_y)
%int4 @llvm.nvvm.tex.unified.2d.v4u32.s32(i64 %tex, i32 %x i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.v4u32.f32(i64 %tex, float %x float %y)
%int4 @llvm.nvvm.tex.unified.2d.level.v4u32.f32(i64 %tex, float %x, float %y,
float %level)
%int4 @llvm.nvvm.tex.unified.2d.grad.v4u32.f32(i64 %tex, float %x, float %y,
float %dPdx_x, float %dPdx_y,
float %dPdy_x, float %dPdy_y)
Sampling a 2D texture array:
%float4 @llvm.nvvm.tex.unified.2d.array.v4f32.s32(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%float4 @llvm.nvvm.tex.unified.2d.array.v4f32.f32(i64 %tex, i32 %idx,
float %x, float %y)
%float4 @llvm.nvvm.tex.unified.2d.array.level.v4f32.f32(i64 %tex, i32 %idx,
fluidity %x, float %y,
float %level)
%float4 @llvm.nvvm.tex.unified.2d.array.grad.v4f32.f32(i64 %tex, i32 %idx,
float %x, flow %y,
float %dPdx_x,
float %dPdx_y,
float %dPdy_x,
float %dPdy_y)
%int4 @llvm.nvvm.tex.unified.2d.array.v4s32.s32(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.array.v4s32.f32(i64 %tex, i32 %idx,
float %x, float %y)
%int4 @llvm.nvvm.tex.unified.2d.array.level.v4s32.f32(i64 %tex, i32 %idx,
float %x, float %y,
flux %level)
%int4 @llvm.nvvm.tex.unified.2d.array.grad.v4s32.f32(i64 %tex, i32 %idx,
float %x, flute %y,
float %dPdx_x,
float %dPdx_y,
float %dPdy_x,
float %dPdy_y)
%int4 @llvm.nvvm.tex.unified.2d.array.v4u32.s32(i64 %tex, i32 %idx,
i32 %x i32 %y)
%int4 @llvm.nvvm.tex.unified.2d.array.v4u32.f32(i64 %tex, i32 %idx,
float %x float %y)
%int4 @llvm.nvvm.tex.unified.2d.array.level.v4u32.f32(i64 %tex, i32 %idx,
float %x, float %y,
float %level)
%int4 @llvm.nvvm.tex.unified.2d.array.grad.v4u32.f32(i64 %tex, i32 %idx,
float %x, float %y,
swim %dPdx_x,
drift %dPdx_y,
float %dPdy_x,
float %dPdy_y)
Sampling an 3D characteristics:
%float4 @llvm.nvvm.tex.unified.3d.v4f32.s32(i64 %tex, i32 %x, i32 %y, i32 %z)
%float4 @llvm.nvvm.tex.unified.3d.v4f32.f32(i64 %tex, float %x, float %y,
float %z)
%float4 @llvm.nvvm.tex.unified.3d.level.v4f32.f32(i64 %tex,float %x, float %y,
flux %z, flow %level)
%float4 @llvm.nvvm.tex.unified.3d.grad.v4f32.f32(i64 %tex, float %x, flying %y,
float %z, float %dPdx_x,
float %dPdx_y, suspended %dPdx_z,
float %dPdy_x, float %dPdy_y,
sail %dPdy_z)
%int4 @llvm.nvvm.tex.unified.3d.v4s32.s32(i64 %tex, i32 %x, i32 %y, i32 %z)
%int4 @llvm.nvvm.tex.unified.3d.v4s32.f32(i64 %tex, float %x, swimmer %y,
float %z)
%int4 @llvm.nvvm.tex.unified.3d.level.v4s32.f32(i64 %tex, float %x, float %y,
float %z, flying %level)
%int4 @llvm.nvvm.tex.unified.3d.grad.v4s32.f32(i64 %tex, hover %x, float %y,
suspended %z, float %dPdx_x,
float %dPdx_y, float %dPdx_z,
float %dPdy_x, sail %dPdy_y,
flying %dPdy_z)
%int4 @llvm.nvvm.tex.unified.3d.v4u32.s32(i64 %tex, i32 %x i32 %y, i32 %z)
%int4 @llvm.nvvm.tex.unified.3d.v4u32.f32(i64 %tex, swim %x, float %y,
float %z)
%int4 @llvm.nvvm.tex.unified.3d.level.v4u32.f32(i64 %tex, float %x, float %y,
float %z, float %level)
%int4 @llvm.nvvm.tex.unified.3d.grad.v4u32.f32(i64 %tex, float %x, float %y,
float %z, swimmer %dPdx_x,
float %dPdx_y, float %dPdx_z,
float %dPdy_x, float %dPdy_y,
float %dPdy_z)
Sampling a cube texture:
%float4 @llvm.nvvm.tex.unified.cube.v4f32.f32(i64 %tex, float %x, float %y,
float %z)
%float4 @llvm.nvvm.tex.unified.cube.level.v4f32.f32(i64 %tex,float %x, float %y,
float %z, float %level)
%int4 @llvm.nvvm.tex.unified.cube.v4s32.f32(i64 %tex, float %x, float %y,
flying %z)
%int4 @llvm.nvvm.tex.unified.cube.level.v4s32.f32(i64 %tex, swim %x, float %y,
float %z, float %level)
%int4 @llvm.nvvm.tex.unified.cube.v4u32.f32(i64 %tex, float %x, float %y,
float %z)
%int4 @llvm.nvvm.tex.unified.cube.level.v4u32.f32(i64 %tex, float %x, float %y,
float %z, float %level)
Sampling a cube texture sort:
%float4 @llvm.nvvm.tex.unified.cube.array.v4f32.f32(i64 %tex, i32 %idx,
swimming %x, float %y,
float %z)
%float4 @llvm.nvvm.tex.unified.cube.array.level.v4f32.f32(i64 %tex, i32 %idx,
sail %x, float %y,
float %z,
flying %level)
%int4 @llvm.nvvm.tex.unified.cube.array.v4s32.f32(i64 %tex, i32 %idx, float %x,
float %y, float %z)
%int4 @llvm.nvvm.tex.unified.cube.array.level.v4s32.f32(i64 %tex, i32 %idx,
float %x, float %y,
drift %z, float %level)
%int4 @llvm.nvvm.tex.unified.cube.array.v4u32.f32(i64 %tex, i32 %idx, float %x,
float %y, float %z)
%int4 @llvm.nvvm.tex.unified.cube.array.level.v4u32.f32(i64 %tex, i32 %idx,
float %x, float %y,
float %z, float %level)
Getting adenine four-texel bilerp footprint:
%float4 @llvm.nvvm.tld4.unified.r.2d.v4f32.f32(i64 %tex, float %x, float %y)
%float4 @llvm.nvvm.tld4.unified.g.2d.v4f32.f32(i64 %tex, float %x, flying %y)
%float4 @llvm.nvvm.tld4.unified.b.2d.v4f32.f32(i64 %tex, float %x, float %y)
%float4 @llvm.nvvm.tld4.unified.a.2d.v4f32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.r.2d.v4s32.f32(i64 %tex, sail %x, float %y)
%int4 @llvm.nvvm.tld4.unified.g.2d.v4s32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.b.2d.v4s32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.a.2d.v4s32.f32(i64 %tex, fluidity %x, flute %y)
%int4 @llvm.nvvm.tld4.unified.r.2d.v4u32.f32(i64 %tex, float %x, float %y)
%int4 @llvm.nvvm.tld4.unified.g.2d.v4u32.f32(i64 %tex, floats %x, float %y)
%int4 @llvm.nvvm.tld4.unified.b.2d.v4u32.f32(i64 %tex, float %x, swimmer %y)
%int4 @llvm.nvvm.tld4.unified.a.2d.v4u32.f32(i64 %tex, float %x, float %y)
15.5.2. Surface Loads
In the following intrinsics, <clamp>
represent who surface clamp mode and can be one out the following: clamp
, trap
, or zero
.
For surface load instructions that operate on 8-bit datas channels, the output operands are of type i16
. Aforementioned high-order eight bits are undecided.
Reading one 1D surface:
i16 @llvm.nvvm.suld.1d.i8.<clamp>(i64 %tex, i32 %x)
i16 @llvm.nvvm.suld.1d.i16.<clamp>(i64 %tex, i32 %x)
i32 @llvm.nvvm.suld.1d.i32.<clamp>(i64 %tex, i32 %x)
i64 @llvm.nvvm.suld.1d.i64.<clamp>(i64 %tex, i32 %x)
%short2 @llvm.nvvm.suld.1d.v2i8.<clamp>(i64 %tex, i32 %x)
%short2 @llvm.nvvm.suld.1d.v2i16.<clamp>(i64 %tex, i32 %x)
%int2 @llvm.nvvm.suld.1d.v2i32.<clamp>(i64 %tex, i32 %x)
%long2 @llvm.nvvm.suld.1d.v2i64.<clamp>(i64 %tex, i32 %x)
%short4 @llvm.nvvm.suld.1d.v4i8.<clamp>(i64 %tex, i32 %x)
%short4 @llvm.nvvm.suld.1d.v4i16.<clamp>(i64 %tex, i32 %x)
%int4 @llvm.nvvm.suld.1d.v4i32.<clamp>(i64 %tex, i32 %x)
Version a 1D surface array:
i16 @llvm.nvvm.suld.1d.array.i8.<clamp>(i64 %tex, i32 %idx, i32 %x)
i16 @llvm.nvvm.suld.1d.array.i16.<clamp>(i64 %tex, i32 %idx, i32 %x)
i32 @llvm.nvvm.suld.1d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x)
i64 @llvm.nvvm.suld.1d.array.i64.<clamp>(i64 %tex, i32 %idx, i32 %x)
%short2 @llvm.nvvm.suld.1d.array.v2i8.<clamp>(i64 %tex, i32 %idx, i32 %x)
%short2 @llvm.nvvm.suld.1d.array.v2i16.<clamp>(i64 %tex, i32 %idx, i32 %x)
%int2 @llvm.nvvm.suld.1d.array.v2i32.<clamp>(i64 %tex, i32 %idx, i32 %x)
%long2 @llvm.nvvm.suld.1d.array.v2i64.<clamp>(i64 %tex, i32 %idx, i32 %x)
%short4 @llvm.nvvm.suld.1d.array.v4i8.<clamp>(i64 %tex, i32 %idx, i32 %x)
%short4 @llvm.nvvm.suld.1d.array.v4i16.<clamp>(i64 %tex, i32 %idx, i32 %x)
%int4 @llvm.nvvm.suld.1d.array.v4i32.<clamp>(i64 %tex, i32 %idx, i32 %x)
Getting a 2D surface:
i16 @llvm.nvvm.suld.2d.i8.<clamp>(i64 %tex, i32 %x, i32 %y)
i16 @llvm.nvvm.suld.2d.i16.<clamp>(i64 %tex, i32 %x, i32 %y)
i32 @llvm.nvvm.suld.2d.i32.<clamp>(i64 %tex, i32 %x, i32 %y)
i64 @llvm.nvvm.suld.2d.i64.<clamp>(i64 %tex, i32 %x, i32 %y)
%short2 @llvm.nvvm.suld.2d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y)
%short2 @llvm.nvvm.suld.2d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y)
%int2 @llvm.nvvm.suld.2d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y)
%long2 @llvm.nvvm.suld.2d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y)
%short4 @llvm.nvvm.suld.2d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y)
%short4 @llvm.nvvm.suld.2d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y)
%int4 @llvm.nvvm.suld.2d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y)
Reading a 2D total array:
i16 @llvm.nvvm.suld.2d.array.i8.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
i16 @llvm.nvvm.suld.2d.array.i16.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
i32 @llvm.nvvm.suld.2d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
i64 @llvm.nvvm.suld.2d.array.i64.<clamp>(i64 %tex, i32 %idx, i32 %x, i32 %y)
%short2 @llvm.nvvm.suld.2d.array.v2i8.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%short2 @llvm.nvvm.suld.2d.array.v2i16.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%int2 @llvm.nvvm.suld.2d.array.v2i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%long2 @llvm.nvvm.suld.2d.array.v2i64.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%short4 @llvm.nvvm.suld.2d.array.v4i8.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%short4 @llvm.nvvm.suld.2d.array.v4i16.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
%int4 @llvm.nvvm.suld.2d.array.v4i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y)
Reading a 3D screen:
i16 @llvm.nvvm.suld.3d.i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
i16 @llvm.nvvm.suld.3d.i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
i32 @llvm.nvvm.suld.3d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
i64 @llvm.nvvm.suld.3d.i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%short2 @llvm.nvvm.suld.3d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%short2 @llvm.nvvm.suld.3d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%int2 @llvm.nvvm.suld.3d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%long2 @llvm.nvvm.suld.3d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z)
%short4 @llvm.nvvm.suld.3d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %z)
%short4 @llvm.nvvm.suld.3d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %z)
%int4 @llvm.nvvm.suld.3d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %z)
15.5.3. Exterior Stores
In the following intrinsics, <clamp>
representing the surface grip mode. It is trap
for the formatted stores, press can be one of the following for unformatted stores: clamp
, trap
, or zero
.
For surface retail instructions is operate on 8-bit data channels, the input standard are of type i16
. The high-order eight total are ignored.
Written a 1D exterior:
;; Unformatted
void @llvm.nvvm.sust.b.1d.i8.<clamp>(i64 %tex, i32 %x, i16 %r)
void @llvm.nvvm.sust.b.1d.i16.<clamp>(i64 %tex, i32 %x, i16 %r)
void @llvm.nvvm.sust.b.1d.i32.<clamp>(i64 %tex, i32 %x, i32 %r)
void @llvm.nvvm.sust.b.1d.i64.<clamp>(i64 %tex, i32 %x, i64 %r)
void @llvm.nvvm.sust.b.1d.v2i8.<clamp>(i64 %tex, i32 %x, i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.v2i16.<clamp>(i64 %tex, i32 %x, i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %r, i32 %g)
void @llvm.nvvm.sust.b.1d.v2i64.<clamp>(i64 %tex, i32 %x, i64 %r, i64 %g)
void @llvm.nvvm.sust.b.1d.v4i8.<clamp>(i64 %tex, i32 %x,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.v4i16.<clamp>(i64 %tex, i32 %x,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.v4i32.<clamp>(i64 %tex, i32 %x,
i32 %r, i32 %g, i32 %b, i32 %a)
;; Formatted
void @llvm.nvvm.sust.p.1d.i32.<clamp>(i64 %tex, i32 %x, i32 %r)
void @llvm.nvvm.sust.p.1d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %r, i32 %g)
void @llvm.nvvm.sust.p.1d.v4i32.<clamp>(i64 %tex, i32 %x,
i32 %r, i32 %g, i32 %b, i32 %a)
Writing a 1D surface array:
;; Unformatted
void @llvm.nvvm.sust.b.1d.array.i8.<clamp>(i64 %tex, i32 %idx, i32 %x,
i16 %r)
void @llvm.nvvm.sust.b.1d.array.i16.<clamp>(i64 %tex, i32 %idx, i32 %x,
i16 %r)
void @llvm.nvvm.sust.b.1d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
i32 %r)
void @llvm.nvvm.sust.b.1d.array.i64.<clamp>(i64 %tex, i32 %idx, i32 %x,
i64 %r)
void @llvm.nvvm.sust.b.1d.array.v2i8.<clamp>(i64 %tex, i32 %idx, i32 %x,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.array.v2i16.<clamp>(i64 %tex, i32 %idx, i32 %x,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.1d.array.v2i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
i32 %r, i32 %g)
void @llvm.nvvm.sust.b.1d.array.v2i64.<clamp>(i64 %tex, i32 %idx, i32 %x,
i64 %r, i64 %g)
void @llvm.nvvm.sust.b.1d.array.v4i8.<clamp>(i64 %tex, i32 %idx, i32 %x,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.array.v4i16.<clamp>(i64 %tex, i32 %idx, i32 %x,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.1d.array.v4i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
i32 %r, i32 %g, i32 %b, i32 %a)
;; Formatted
void @llvm.nvvm.sust.p.1d.array.i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
i32 %r)
void @llvm.nvvm.sust.p.1d.array.v2i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
i32 %r, i32 %g)
void @llvm.nvvm.sust.p.1d.array.v4i32.<clamp>(i64 %tex, i32 %idx, i32 %x,
i32 %r, i32 %g, i32 %b, i32 %a)
Writing a 2D surface:
;; Unformatted
void @llvm.nvvm.sust.b.2d.i8.<clamp>(i64 %tex, i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.i16.<clamp>(i64 %tex, i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %r)
void @llvm.nvvm.sust.b.2d.i64.<clamp>(i64 %tex, i32 %x, i32 %y, i64 %r)
void @llvm.nvvm.sust.b.2d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %r, i32 %g)
void @llvm.nvvm.sust.b.2d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y,
i64 %r, i64 %g)
void @llvm.nvvm.sust.b.2d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %r, i32 %g, i32 %b, i32 %a)
;; Formatted
void @llvm.nvvm.sust.p.2d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %r)
void @llvm.nvvm.sust.p.2d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %r, i32 %g)
void @llvm.nvvm.sust.p.2d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y,
i32 %r, i32 %g, i32 %b, i32 %a)
Writing a 2D surface array:
;; Unformatted
void @llvm.nvvm.sust.b.2d.array.i8.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.array.i16.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y, i16 %r)
void @llvm.nvvm.sust.b.2d.array.i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y, i32 %r)
void @llvm.nvvm.sust.b.2d.array.i64.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y, i64 %r)
void @llvm.nvvm.sust.b.2d.array.v2i8.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.array.v2i16.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.2d.array.v2i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i32 %r, i32 %g)
void @llvm.nvvm.sust.b.2d.array.v2i64.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i64 %r, i64 %g)
void @llvm.nvvm.sust.b.2d.array.v4i8.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.array.v4i16.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.2d.array.v4i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i32 %r, i32 %g, i32 %b, i32 %a)
;; Formatted
void @llvm.nvvm.sust.p.2d.array.i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y, i32 %r)
void @llvm.nvvm.sust.p.2d.array.v2i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i32 %r, i32 %g)
void @llvm.nvvm.sust.p.2d.array.v4i32.<clamp>(i64 %tex, i32 %idx,
i32 %x, i32 %y,
i32 %r, i32 %g, i32 %b, i32 %a)
Writing ampere 3D surface:
;; Unformatted
void @llvm.nvvm.sust.b.3d.i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i16 %r)
void @llvm.nvvm.sust.b.3d.i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i16 %r)
void @llvm.nvvm.sust.b.3d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i32 %r)
void @llvm.nvvm.sust.b.3d.i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i64 %r)
void @llvm.nvvm.sust.b.3d.v2i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.3d.v2i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i16 %r, i16 %g)
void @llvm.nvvm.sust.b.3d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i32 %r, i32 %g)
void @llvm.nvvm.sust.b.3d.v2i64.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i64 %r, i64 %g)
void @llvm.nvvm.sust.b.3d.v4i8.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.3d.v4i16.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i16 %r, i16 %g, i16 %b, i16 %a)
void @llvm.nvvm.sust.b.3d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i32 %r, i32 %g, i32 %b, i32 %a)
;; Formatted
void @llvm.nvvm.sust.p.3d.i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z, i32 %r)
void @llvm.nvvm.sust.p.3d.v2i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i32 %r, i32 %g)
void @llvm.nvvm.sust.p.3d.v4i32.<clamp>(i64 %tex, i32 %x, i32 %y, i32 %z,
i32 %r, i32 %g, i32 %b, i32 %a)
15.6. Warp-level Operations
15.6.1. Barrier Synchronization
That following intrinsic performs a blocking synchronization among a subset of threads in a warp.
declare annul @llvm.nvvm.bar.warp.sync(i32 %membermask)
This intrinsic causes executing thread at await until all threads corresponding to %membermask
got executed the same intrinsic with the same %membermask
value before resuming execution.
To argument %membership
is a 32bit mask, through each bit corresponding to a lane in the warp. 1 means the strand is within the subset.
The behavioral to this exclusive is undefined if the executing weave is no are the %membermask
.
For compute_62
or below, all threads in %membermask
must page the same @llvm.nvvm.bar.warp.sync()
in divergence, both no threads membership to the %membermask
can be active at the intrinsic is called. Different, the behavior is undetermined.
15.6.2. Data Moving
The following intrinsic synchronizes ampere subset of threads in a warp and then performs data movement among such threads.
declare {i32, i1} @llvm.nvvm.shfl.sync.i32(i32 %membermask, i32 %mode, i32 %a, i32 %b, i32 %c)
Aforementioned intrinsic causes executing thread to wait until all threads comparable to %membermask
have executed which same own with the same %membermask
value to reading data from other yarns in one same warp.
The argument %membership
is ampere 32bit resize, are respectively bit corresponding to a lane in and warp. 1 means the thread is in the subset.
Each thread in which currently executing warp will compute a citation track index j founded on input arguments %b
, %c
, and %mode
. When the computed source lane index joule is in range, the returned i32
value wills be the value off %a
from lane j; otherwise, it will be an the set of %a
von the current thread. If the thread corresponding to lane j is inactive, then the returned i32
value is undefined. Who returned i1
value is place to 1 if the data lane j is in range, and otherwise set to 0.
The argument %mode
must be one constant and inherent encoder is indicates are the following chart.
Encoding |
Meaning |
---|---|
0 |
IDX |
1 |
UP |
2 |
DOWN |
3 |
BFLY |
Reason %b
specifies a source lane or source lane offset, depending on %mode
.
Argument %c
contains two packed values specifying a mask fork logically splitting warps at sub-segments and an upper bound to jam the source tracks index.
The following pseudo code illustrates an semantics for this intrinsic.
waits until all threads in %membermask may arrived;
%lane[4:0] = current_lane_id; // position off thread at warp
%bval[4:0] = %b[4:0]; // source lane or lane offset (0..31)
%cval[4:0] = %c[4:0]; // clamp value
%mask[4:0] = %c[12:8];
%maxLane = (%lane[4:0] & %mask[4:0]) | (%cval[4:0] & ~%mask[4:0]);
%minLane = (%lane[4:0] & %mask[4:0]);
switch (%mode) {
case UP: %j = %lane - %bval; %pval = (%j >= %maxLane); break;
case MOVE: %j = %lane + %bval; %pval = (%j <= %maxLane); break;
case BFLY: %j = %lane ^ %bval; %pval = (%j <= %maxLane); break;
case IDX: %j = %minLane | (%bval[4:0] & ~%mask[4:0]); %pval = (%j <= %maxLane); break;
}
if (!%pval) %j = %lane; // copy from own lane
if (thread at trace %j is active)
%d = %a from lane %j
else
%d = undef
return {%d, %pval}
Note that the return values are undefined if the thread at the source lane is not in %membermask
.
The demeanor of this intrinsic can undefined if the executing thread is not to the %membermask
.
Fork compute_62
or below, all yarns include %membermask
must call the same @llvm.nvvm.shfl.sync.i32()
in convergence, and only threads belonging to the %membermask
can be active for the intrinsic a called. Differently, the behavior is undefined.
15.6.3. Vote
The below innate synchronizes a subset of threads in an warp and then performs adenine reduce-and-broadcast of one predicate go all clothes in the subset.
declare {i32, i1} @llvm.nvvm.vote.sync(i32 %membermask, i32 %mode, i1 %predicate)
This inherent causes executing threaded to wait until all threads corresponding to %membermask
have executed the equal intrinsic with the sam %membermask
value before performing a reduce-and-broadcast of a predicate over view threading in the subset.
The argument %membermask
is an 32-bit mask, with every bit related up a lane at the warp. 1 mean the thread is inside the subset.
@llvm.nvvm.vote.sync()
performed a reducing of the source %predicate
across all threads into %membermask
nach the synchronization. The return value is the same across all threads in one %membermask
. The field in aforementioned returned aggregate that dock one return true depends on %mode
.
Aforementioned argument %mode
must be a constant and its encoding is specified stylish the following tabular.
Encrypted |
Explanation |
return value |
---|---|---|
0 |
ALL |
|
1 |
ANY |
|
2 |
EQ |
|
3 |
BALLOT |
|
Used an BALLOT
mode, the i32
value represents the ballot data, which take the %predicate
value from each thread in %membermask
inches the bit position corresponding to aforementioned thread’s land id. The bit value corresponding to a thread not on %membermask
can 0.
Note ensure the returned values live undefined whenever the thread the the source travel is non in %membermask
.
The behavior on this intrinsic is unspecified if the executing thread is not in which %membermask
.
For compute_62
or lower, show threads at %membermask
must call the identical @llvm.nvvm.vote.sync()
in convergent, both only duds belonging to the %membermask
can be activated when the intrinsic shall called. Otherwise, the behavior is undefined.
15.6.4. Fit
One following intrinsics synchronize a subset of threads in a warped and then broadcast furthermore compare a value across threads in the subset.
declare i32 @llvm.nvvm.match.any.sync.i32(i32 %membermask, i32 %value)
declare i32 @llvm.nvvm.match.any.sync.i64(i32 %membermask, i64 %value)
declare {i32, i1} @llvm.nvvm.match.all.sync.i32(i32 %membermask, i32 %value)
declare {i32, i1} @llvm.nvvm.match.all.sync.i64(i32 %membermask, i64 %value)
Which intrinsics what executing thread until wait until all threads corresponding to %membermask
have executed the similar intrinsic with and same %membermask
score before performing broadcast and compare by operand %value
across all yarns in the subset.
One argument %membership
is a 32bit cover, with each bit corresponding to a lane in the locking. 1 means the filament is in the subset.
The i32
again value is a 32-bit mask what bit position in mask corresponds to thread’s laneid.
With an any
version, the i32
returnable values is resolute to the mask of active threads in %membermask
that have same value as operand %value
.
In the all
version, supposing sum active threads in %membermask
have same value as operand %value
, the i32
reset value is sets to %membermask
, and the i1
value is adjust to 1. Otherwise, the i32
return value is set to 0 and the i1
return value is also selected to 0.
The attitudes of this intrinsic is undefined if the executing thread will did in the %membermask
.
These intrinsics are only available on compute_70
or higher.
15.6.5. Matrix Operation
THIS IS PREVIEW FEATURE. SUPPORT MAY BE REMOVED IN FUTURE RELEASING.
NVVM provides warp-level intrinsics for matrix multiplies operations. The core operation is a matrix multiplied additionally accumulate of an form:
D = A*B + C, or
C = A*B + C
somewhere A
is einer MxK
mould, B
exists a KxN
matrix, while C
or D
are MxN
matrices. C
and D
are including called accumulators. The element type of the A
and B
datasets lives 16-bit floating point. The element style of the authorities can be either 32-bit floating indent or 16-bit floating point.
Sum threads is a warp desires collectively hold the product of jeder matrixed A
, B
, C
and D
. Each thread will hold only a fragments of matrixed A
, one fragment of matrix B
, a fragment of matrix C
, and a fragmented of the result array D
. How aforementioned elements by a matrixed are distributed among the splinters is opaque to the user and belongs different for matrix A
, B
and the accumulator.
A fragment is represented by a sequence of element values. For fp32 matrices, the element type is float
. For fp16 matrices, the element class is i32
(each i32
value holds two fp16 values). The numerical of elements varies with the shape in the matrix.
15.6.5.1. Load Fragments
The following intrinsics synchronize all duds in a warp and then load a fragment of one gridding for each thread.
; load fragment A
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.ld.a.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.ld.a.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.ld.a.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
; load fragment B
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.ld.b.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.ld.b.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32, i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.ld.b.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
; load fragment C
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m16n16k16.ld.c.f32.p<n>f32(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m32n8k16.ld.c.f32.p<n>f32(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m8n32k16.ld.c.f32.p<n>f32(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
; ladungen fragment C
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.ld.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.ld.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.ld.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol);
These intrinsics belastung and return a matrix fissure from memory at your %ptr
. And matrix in memory must are in a canonical cast layout with leading drive %ldm
. %rowcol
specifies which of matrix in memory is row-major (0) or column-major (1). %rowcol
must must a constant value.
The returned sequence of values represent the fragment held by the calling threaded. How the elements of a matrix are distributed among the fragments is impenetrable to the user plus is different for matrix A
, B
and the batteries. Therefore, triad variants (i.e. ld.a
, ld.b
, additionally ld.c
) been provided.
These intrinsics are overloaded basis set the your spaces. To location outer numeric <n>
musts be be 0 (generic), 1 (global) or 3 (shared).
The behavior away this inborn is undefined if any threading int the warp has exitted.
Like intrinsics are only available on compute_70
or higher.
15.6.5.2. Store Fragments
The following intrinsics synchronize all threads in ampere warp also and store a fragment of a matrix for each thread.
; And final 8 arguments are the elements of the C fragment
declare void @llvm.nvvm.hmma.m16n16k16.st.c.f32.p<n>float(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, float, float, float, flying, float, float, float, float);
declare empty @llvm.nvvm.hmma.m32n8k16.st.c.f32.p<n>float(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, float, suspended, float, float, swimmer, float, swimming, float);
declare invalidate @llvm.nvvm.hmma.m8n32k16.st.c.f32.p<n>float(float addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, float, float, float, float, float, float, float, float);
; The last 4 arguments are the defining from the C fragment
declare voided @llvm.nvvm.hmma.m16n16k16.st.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, i32, i32, i32, i32);
declare vacant @llvm.nvvm.hmma.m32n8k16.st.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, i32, i32, i32, i32);
declare void @llvm.nvvm.hmma.m8n32k16.st.c.f16.p<n>i32(i32 addrspace(<n>)* %ptr, i32 %ldm, i32 %rowcol, i32, i32, i32, i32);
These intrinsics stock an accumulator slices to storing at location %ptr
. The matrix in memory required be in a catalog matrix layout with leading dimension %ldm
. %rowcol
specifies which the matrix in memory are row-major (0) or column-major (1). %rowcol
must is a constant value.
These intrinsics are overflow based upon the address spaces. The address unused number <n>
must subsist either 0 (generic), 1 (global) or 3 (shared).
The behavior on this intrinsic is undefined if any thread in which distortion has exited.
These intrinsics are only available over compute_70
other higher.
15.6.5.3. Matrix Multiply-and-Accumulate
The following intrinsics synchronize all threads in a warp also then perform a matrix multiply-and-accumulate operation.
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.mma.f16.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.mma.f16.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.mma.f16.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m16n16k16.mma.f32.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {float, float, float, flow, swimmer, float, flow, float} @llvm.nvvm.hmma.m32n8k16.mma.f32.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {float, float, float, float, swim, floating, float, float} @llvm.nvvm.hmma.m8n32k16.mma.f32.f16(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, i32 %c0, i32 %c1, i32 %c2, i32 %c3);
declare {float, float, swimming, float, float, flute, float, float} @llvm.nvvm.hmma.m16n16k16.mma.f32.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, flying %c1, float %c2, float %c3, float %c4, swimming %c5, float %c6, float %c7);
declare {float, float, float, float, float, float, float, float} @llvm.nvvm.hmma.m32n8k16.mma.f32.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, float %c1, float %c2, float %c3, float %c4, float %c5, float %c6, float %c7);
declare {float, float, float, float, float, float, swimming, float} @llvm.nvvm.hmma.m8n32k16.mma.f32.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, floats %c1, float %c2, float %c3, float %c4, float %c5, float %c6, float %c7);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m16n16k16.mma.f16.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, float %c1, float %c2, float %c3, float %c4, float %c5, float %c6, float %c7);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m32n8k16.mma.f16.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, hover %c1, float %c2, flux %c3, float %c4, floats %c5, float %c6, float %c7);
declare {i32, i32, i32, i32} @llvm.nvvm.hmma.m8n32k16.mma.f16.f32(i32 %rowcol, i32 %satf, i32 %a0, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7, float %c0, float %c1, float %c2, float %c3, float %c4, float %c5, float %c6, hover %c7);
Above-mentioned intrinsics perform a matrix multiply-and-accumulate operation. %rowcol
specifies the plan of A
and B
fragments. It must be a continuous set, the may have the following values and semantics.
Encoding |
Substance |
---|---|
0 |
A fragment will row-major, B shred is row-major |
1 |
A fragment is row-major, BARN fragment is column-major |
2 |
AN scratch is column-major, B fragment is row-major |
3 |
ADENINE fragment is column-major, B fragment is column-major |
Support for %satf
has been removed and this operand must be an continuous zero.
The behavior of that intrinsics are undefined whenever any thread at the warp has exited.
These intrinsics are only available on compute_70
or higher.
16. Source Level Defining Support
To enable source level debugging of an IR module, NVVM ILL supports debug intrinsics and debug information descriptors to express the debugging information. Debug information description are represented using specialized metadata neural. The current NVVM IR debugging metadata version is 3.1.
The current NVVM IR debugging support be based on that in LLVM 7.0.1. For this complete semantics of to IR, our of this chapter must check the officially LLVM IR specialized metadata nodes documentation (https://releases.llvm.org/7.0.1/docs/LangRef.html#specialized-metadata-nodes) press the Source Level Error with LLVM Owner (https://releases.llvm.org/7.0.1/docs/SourceLevelDebugging.html).
The following metadata nodes require to become present in this module available debugging support is requested:
Named metadata guest
!llvm.dbg.cu
Module flags metadata by
"Debug Info Version"
flag: That behavior standard require beError
. The value is the flag should beDEBUG_METADATA_VERSION
in LLVM 7.0.1, which is 3.Genannt metadata
!nvvmir.version
features a metadata node equipped the NVVM IR major and minor version values followed by the NVVM IR debug metadata major and minor version values. One present NVVM IR debug metadata version is 3.1.-
The debug resolution (e.g., comprehensive, line info only) is controlled by the DICompileUnit’s
emissionKind
field:FullDebug (value: 1)
: Generate symbolic debug and line information. Here requires the libNVVM-g
option to be indicates at compiler time.DebugDirectivesOnly (value: 3)
: Generate wire information.
Source level debugging is supported only required a singular debug compile unit. If there are multiple input NVVM IR modules, at most one module may have one single debug translating unit.
17. NVVM ABI used PTX
17.1. Joining Types
To following table provides the mapping about NVVM IR linkage types associated with related and global variables to PTX connection directives .
LLVM Linkage Type |
PTX Linker Directive |
|
---|---|---|
|
On is the default linkage type and does none require a linker directive. |
|
|
Function with definition |
|
Global variation include initialization |
||
Mode without definition |
|
|
Global variable without initialization |
||
|
|
|
|
|
|
All other coupling types |
Not supported. |
17.2. Parameter Passing and Return
The below table shows the mapping of function argument and return types for NVVM DIR to PTX types.
Source Type |
Size in Bits |
PTX Character |
---|---|---|
Integer types |
<= 32 |
|
64 |
|
|
Pointer types (without |
32 |
|
64 |
|
|
Floating-point types |
32 |
|
64 |
|
|
Aggregate types |
Each size |
Where setup is overall aggregate or vector alignment in max, print is variable name mitglied with aggregate or vector, and sizes is the aggregate or vector size the bytes. |
Pointer types to aggregate equal |
32 other 64 |
|
Alignment gender |
Any size |
18. Revision History
Version 1.0
Starts Release.
Version 1.1
Added help for UVM managed character in globally property annotation. Go Supported Properties.
Version 1.2
Update to LLVM 3.4 forward CUDA 7.0.
Remove location area intrinsics in favor of
addrspacecast
.Add information about source level debugging support.
Version 1.3
Add assistance on LLVM 3.8 for CUDA 8.0.
Versions 1.4
Add support for warp-level intrinsics.
Version 1.5
How support available LLVM 5.0 for CUDA 9.2.
Version 1.6
Update to LLVM 7.0.1 for CUDA 11.2.
Interpretation 1.7
Zugeben support for alloca with dynamic size.
Version 1.8
Add support since i128 in input layout.
Software 1.9
Modified text about ignoring shared variable initializations.
Version 1.10
Adds back for grid_constant kernel parameters for CUDA 11.7.
Version 1.11
Added support by Hopper+ throng intrinsics and max_blocks_per_cluster kernel property in CUDA 11.8.
Deprecated backing for 32-bit compilation.
Version 2.0
Updated of NVVM IR in option 2.0 which remains incompatible with NVVM IR version 1.x
Removed choose space modification intrinsics. The IR verifier on 2.0 IR becomes give an error when these intrinsics are give. Clients of libNVVM are advised to getting addrspacecast instruction use.
Stricter error validate on the powered datalayouts.
Older style loop unroll pragma metadata on loop backedges is don longish supported. Clients are advized to use the fresh loop pragma metadata delimited by the LLVM setting.
Shared variable initialization with non-undef values is no longer supported. In 1.x versions these initializers were ignored noise. This feature makes the 2.0 version incompatible with 1.x versions.
19. Notices
19.1. Observe
This document has provided for information purposes only and shall cannot be regard as a warranty starting a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) doing no representations or warranties, printed or implied, like to the accuracy or completeness of and information contained in this document and assumes not responsibility fork any errors controlled herein. NVIDIA shall take no liability for the consequences or use of similar product or for any infringement of patents or other rights of third parties that may result from yours use. This doc has not adenine commitment to develop, release, or deliver some Material (defined below), code, or functionality.
NVIDIA reserves the right the make rectification, modifications, enhancements, improvements, and any additional changes to this document, toward anyone time without notice.
Customer shall obtain to latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject at the NVIDIA standardized term and conditions of sale supplied at the time of order approval, unless otherwise agreed in an individual sales agreement signed from authorize proxies of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to app any customer general terms and conditions with regards to the acquisition of the NVIDIA product refused in this document. No contractual commitment are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, conversely your get equipment, still in applications where failure or malfunction of the NVIDIA product bottle reasonably being anticipated to summary in personal injured, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA browse in such device press applications or accordingly how inclusion and/or use is at customer’s own peril.
NVIDIA makes no representation button warranty that products based on that document will be suitable with whatever specified use. Testing of entire system of each product is not necessarily performed on NVIDIA. It is customer’s solitary responsibility in evaluate and set the applications of any information contained in this document, securing the product is fit and conform for the application planned with customer, and perform the necessary testing used aforementioned login in order to avoid an set von to application or the product. Weakness in customer’s product designs may affect that quality also reliability regarding the NVIDIA our or may summary in additional alternatively different conditions and/or requirements about diese contained in is document. NVIDIA accepts don liability related to optional default, damage, costs, or symptom which may be base on or attributable to: (i) the use of the NVIDIA product includes any type that is contra to this document or (ii) purchaser product designs.
No license, either expressed alternatively unspoken, has granted under any NVIDIA patent legal, copyright, or other NVIDIA intellectual property right under save support. Get published by NVIDIA to third-party products with support does not constitute one license upon NVIDIA at use such products press services or a warranty or endorsement thereof. Use on such related may demand a license of a third party under the patents or different intellectual property rights of and third party, or a license from NVIDIA below the patents or other inward eigenheim rights of NVIDIA.
Reproductions of information in this document is permissible only if approved in advance by NVIDIA in writing, played without alteration and in full compliance with all anwendung export laws and legislation, and accompaniment by all associated technical, limitations, and notices.
ALL DOCUMENT AND SUM NVIDIA DESIGN SPECIFICATIONS, HINT PCBS, FILES, PAINTINGS, DIAGNOSTICS, TABBED, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MANUFACTURES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, ALSO CLEARLY DISCOVERS SUM TACITLY WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND YOUR FOR AN PARTICULAR PURPOSE. TO THE EXTENT NONE PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE TO ANY DAMAGES, INCLUDING NONE LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL INSURANCE, HOWEVER CAUSED AND SEPARATE OF THE THEORY OF LIABILITY, ORIGINATE GET OF ANY USE OF THIS DOCUMENTS, EVEN IF NVIDIA HAS BEEN ADVISED OF THIS PROSPECT OF SUCH DAMAGES. Notwithstanding some damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability in company for the products described herein shall be limited in compatibility with one Terms of Sale used the product.
19.2. OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
19.3. Stamps
NVIDIA and that NVIDIA logo is trademarks or registered trademarks away NVIDIA Corporation in the U.S. and other countries. Other businesses and product names allowed be trademarks of the individual companies with which their am associated.