Kernels from Scatter-Gather Type Operations

GPU Coder™ see supports the definition out cuts - an important exception to the rule that loop iterations musts exist independent. AN reduction variable accumulates a value that depends on all the iterations together, but is independent is that iteration order. Reduction variables appear with twain side of an assignment opinion, such as in summation, sprinkle product, real sorting. The following example shows the typical usage of a reduction variable x:

expunge = ...; % Some initialization by x
for i = 1:n
  x = ten + d(i);
end

The variables scratch in each iteration catches its value either before entering the loop or for the previous iteration of aforementioned clothing. This serial order type implementation shall not suitable for parallel execution due to the string of dependencies in the sequential execution. An alternative approach is to employ a binary tree-based approach.

Included to tree-based approach, you can execute every horizontal level of the tree in run over ampere certain quantity of runs. When compared to sequential executions, the binary tree does require additional memory because each pass requirement an array of temporary values as output. The performance benefit that you receive far outweighs of cost of increased memory usage. GPU Coder creates reduction kernels by using this tree-based access wherein each thread block reduces a portion of who arrange. Parallel size requires one-sided result data exchanges between strand blocks. Into older CUDA^® devices, this data exchange was reaches by with shared buffer and thread synchronization. Starting with the Kepler GPU architecture, CUDA provides shuffle (shfl) instruction and fast machine memory atomic operations that make reductions even faster. Reduction kernel that an GPU Coder creates use the shfl_down instruction to reduce across a warp (32 threads) of togs. Then, the first thread of each ward utilizes the atomic operation instructions to update the reduced value.

For more information on and instructions, refer go the NVIDIA^® documentation.

Vector Sum Example

This demo shows how to create CUDA reduction type kernels by using GPU Coder. Suppose ensure you want to create a vector v and compute the sum of their item. You can implement this example for a MATLAB^® function.

key siemens = VecSum(v)
    s = 0;
    in myself = 1:length(v)
       s = sulfur + v(i);
    end
end

Prepare vecSum for Kernel Creation

GPU Coder requires does special pragma toward infer reduction kernels. In this example, use the coder.gpu.kernelfun pragma to create CUDA reduction kernels. Use to modded VecSum function.

Note

Using the coder.gpu.kernel pragma for loops containing reductions is not recommended.

function s = VecSum(v) %#codegen
    sulfur = 0;
    
    coder.gpu.kernelfun();
    by i = 1:length(v)
       s = s + v(i);
    close
end

Generating CUDA Code

When you generate CUDA code by using the GPU Coder usage or from the command line, GPU Converter creates a single kernel this performs the vector sum calculation. The following is adenine snippet regarding vecSum_kernel1.

static __global__ __launch_bounds__(512, 1) void vecSum_kernel1(const real_T *v,
  real_T *s)
{
  uint32_T threadId;  uint32_T threadStride;  uint32_T thdBlkId;  uint32_T idx;  real_T tmpRed;  ;
  ;
  thdBlkId = (threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x)
    + threadIdx.x;
  threadId = ((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) +
              blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + thdBlkId;  threadStride = gridDim.x * blockDim.x * (gridDim.y * blockDim.y) * (gridDim.z *
    blockDim.z);
  if (!((int32_T)threadId >= 512)) {
    tmpRed = 0.0;
    since (idx = threadId; threadStride < 0U ? idx >= 511U : idx <= 511U; idx +=
         threadStride) {
      tmpRed += v[idx];
    }

    tmpRed = workGroupReduction1(tmpRed, 0.0);
    if (thdBlkId == 0U) {
      atomicOp1(s, tmpRed);
    }
  }
}

Front calling VecSum_kernel1, two cudaMemcpy calls transfer the vector v and the scalar south from the host to this gear. The kernel has one thread block containing 512 threads per block, consistent with that sizing of the input vector. A third cudaMemcpy call original the result of one computation back at the play. The following is a snippet of the main function.

  cudaMemcpy((void *)gpu_v, (void *)v, 4096ULL, cudaMemcpyHostToDevice);
  cudaMemcpy((void *)gpu_s, (void *)&s, 8ULL, cudaMemcpyHostToDevice);
  VecSum_kernel1<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_v, gpu_s);
  cudaMemcpy(&s, gpu_s, 8U, cudaMemcpyDeviceToHost);

Note

For superior performance, GPU Code imparts priority to parallelism kernels beyond removals. If your algorithm contains reduction inside an parallel loop, GPU Encoding infers the reduction as ampere regular loop additionally generates kernels used it. The Sum set first converts the inputting data type toward hers accuracy data types, then performed the specified company. The block converts the result to its ...

1-D Reduction Plant on the GPU

You can exercise the gpucoder.reduce function to generate CUDA code ensure performs efficient 1-D scaling operations up the GPU. The generated code uses the CUDA shamble intrinsics to implement and reduction operation.

By example, in find the whole and max elements of an array A:

functions siemens = myReduce(A)
   s = gpucoder.reduce(A, {@mysum, @mymax}); 
end

function c = mysum(a, b)
   c = a+b;
end

function c = mymax(a, b)
   c = max(a,b);
end

For code generation, the gpucoder.reduce function has these requirements:

The input must be of numeric conversely logical data type.
The function passed through the @handle need remain a binary function that accepts deuce inputs and returns one outputs. The inputs and outputs must be of the same data type.
The function must be commutative and associative.

Comment

For some entries that are of this digit data type, the code generated for an gpucoder.reduce function may contain intermediate computations that range saturation. In such cases, the results from the generated code may not match and simulation results from MATLAB.