Gaussian Blur

This kernel writing tutorial is split into two parts – one to build the actual kernel routine, which can be leveraged as a template for many other use cases, and another to initialize demo data and call the kernel.

Source file: tests/static_data_tests/gaussian_blur_3x3.cpp

Concepts demonstrated

This tutorial addresses the following concepts:

  • Implementing a simple kernel.

  • Using memCpy to copy tensor data from DDR to OCM.

  • Staging the weights tensor for broadcast from the OCM to the core array.

  • Using the YX iterator pattern to “walk” through tiles in an RoI (region of interest).

  • Performing a convolution.

  • Target machine: q8 EPU.

Gaussian blur algorithm

Performing a 3x3 Gaussian Blur is a specific case of convolution of a tensor in array memory and a static 3x3 matrix (weights tensor). The matrix values are constants designed to produce a Gaussian blur of the original input tensor.

Applying the algorithm

Because the full input tensor of 1080h x 1920w (x2 bytes) is too large to fit in core memory, we’ll perform the convolution iteratively, on smaller, equally sized regions of interest (RoIs), until we’ve processed and written the entire convolved array to an output array allocated in OCM memory.

Iterating through RoIs of an input tensor, a tile at a time (also called a walk of the tensor), is supported natively in the API.

The process we’ll follow amounts to a repeatable recipe that developers can use to perform and iterative convolution for a wide variety of use cases.

Writing the kernel routine

The kernel routine performs the convolution within the EPU, copying the output back to OCM one RoI at a time, then copies the result back into DDR memory on the host. Later, in Part 2 below, we’ll see how to initialize data and actually call the kernel from your host code.

Data components

The tutorial involves three tensors:

Data

size

description

var

Input tensor

1080x1920

Input tensor. A grayscale image.

ddrIn

Output tensor

1080x1920

Output tensor. The Gaussian blur that results from the convolution performed by the kernel.

ddrOut

Weights tensor

3x3

The ‘mask’ tensor that operates on the input file. Created algorithmically in host code and passed by reference to the kernel. Its values determine values in the output tensor.

ddrWeights

Overview

The steps required to implement the Gaussian blur kernel are summarized below.

The tutorial assumes that the 1080h x 1920w input tensor already exists in DDR.

  1. Copy tensor data from DDR to OCM

For this example, we start with an 1080h x 1920w grayscale image in DDR and copy it into OCM after allocating space for the input tensor, the output tensor, and the weights tensor. (The weights tensor we’ll create in Part 2 is also copied to OCM in the step).

  1. Set up the outer loop to flow each RoI into the core.

For this example, we’ve decided to divide the input tensor in OCM into 30 identically sized (1080h x 64w) RoI’s since the full tensor won’t fit in core memory.

  1. Broadcast the weights tensor and flow an RoI of the input image into the array, perform the convolution, send the convolved output to OCM, an RoI at a time.

These basic steps are applicable to all kinds of convolution-based use-cases, and a Gaussian blur is a great example to start with.

Let’s look at each step in detail and how it corresponds to the code example.

Copy tensor data to OCM

The input image is a 1080x1920p gray scale Image. To bring data in from DDR to OCM, we’ll use the familiar memCpy API.

Since we can fit an image of size 1080 * 1920 * 2 Bytes = ~4mb in our OCM (On-Chip Memory, ~8mb available on a q8), we don’t need to partition the image before copying from DDR.

Note

When an image/tensor exceeds the size of the OCM, memCpy lets you specify coordinates to define smaller regions of the source image to transfer, referred to as Regions of Interest (RoI’s).

First, we allocate space in OCM for both the input tensor and the output tensor that will be created by the convolution.

Next we do a straight memCpy from DDR to OCM for both the input tensor and the weights tensor.

With tensors in OCM, we’re ready to bring the input into the core in a loop, one RoI at a time.


The outermost loop brings the 1080p image from OCM into core memory, but only one RoI at a time, so we don’t exceed the capacity of core memory.

In the below example, we’re using the YX Iterator pattern as shown in the snippet below to fetch data from the OCM to the core registers (see Iterators for more about patterns supported in the API).

    fetchTilesInRoi<DataRoiShapeDesc, IteratorType::YX_BORDER>(ocmIn, qData, 0, 0, 0, widthOffset);

Note

We’ve chosen the YX_BORDER Iterator. The BORDER option will ensure that when we perform operations on a tile that we’ve included the borders from the image, since a blur requires adjacent values in the calculation. Without the border, the elements along all four edges of the input Roi will be missing three nearest-neighbor elements required for the blur.

In the below example, we’ve chosen to split the 1080p image along the width axis into 30 RoIs:

typedef OcmTensor<FixedPoint16<numFracBits>, 1, inputDepth, height, 64> OcmRoiShape;

Allocating an Array

On a q8, we only are able to store data equivalent to the number of array cores, which is 8 * 8 = 64 (one tile), multiplied by the size of our register file, which is 1024, multiplied by our core register width, which is 4 bytes. This enables us to store up to 64 * 1024 * (2 2-byte elements per 1 4-byte register width) = 131,072 elements. Keep in mind that for this example we’re using FixedPoint16.

Recall that to fit data efficiently into an array, we’re going to read from OCM with an RoI’s of size 1080h x 64w.

tensor
typedef OcmTensor<FixedPoint16<numFracBits>, 1, inputDepth, height, 64> OcmRoiShape;

On a q8, this will allow us to store 1,620 tiles in a qVar_t<FixedPoint16<7>> array as denoted in the snippet below:

  qVar_t<FixedPoint16<numFracBits>> qData[OcmRoiShape::NUM_TILES];

Note

The NUM_TILES attribute will automatically calculate the number of tiles to allocate based on a Tensor shape.

Performing Iteration and Gaussian Blur

Within the iteration, we perform these tasks:

  1. Stage the weights data for broadcast.

  2. Make the convolution API call, which also automatically “pops” the previously staged weights matrix off the broadcast bus to apply the blur to each element in the core Array.

tensor

Finally, performing iteration over our RoI’s and performing a blur can be seen in the snippet below:

  /* Ensure the number of RoIs encompasses all of the data in the width dimension */
  constexpr std::int32_t numRois =
    roundUpToNearestMultiple(DdrInOutTensorShape::NUM_COLS, OcmRoiShape::NUM_COLS) / (OcmRoiShape::NUM_COLS);
  debugPrint(numRois, "numRois");

  for(std::int32_t roiNum = 0; roiNum < numRois; roiNum++) {
    std::int32_t widthOffset = roiNum * OcmRoiShape::NUM_COLS;
    debugPrint(widthOffset, "widthOffset");
    // First time that we fetch something, we're starting at width offset = 0
    // Second time we fetch an Roi, we're starting at 60
    // Third time we fetch an Roi, we're starting at 120
    //! [Fetching Data]
    fetchTilesInRoi<DataRoiShapeDesc, IteratorType::YX_BORDER>(ocmIn, qData, 0, 0, 0, widthOffset);
    //! [Fetching Data]
    for(std::int32_t tileNum = 0; tileNum < OcmRoiShape::NUM_TILES; tileNum++) {
      BroadcastStream::stage<OcmWeightTensorShape, RoiShapeDesc>(ocmWeights);
      qData[tileNum] = NN::depthwiseConvTileBlockFx16<FixedPoint16<numFracBits>, 3>(qData[tileNum]);
    }
    writeTilesInRoi<DataRoiShapeDesc, IteratorType::YX_NO_BORDER>(qData, ocmOut, 0, 0, 0, widthOffset);
  }

The iteration continues until all tiles in the RoI have been exhausted.

Note that we wrap up the kernel routine by copying the new tensor back from OCM to DDR, then freeing up the OCM memory where the input and output tensors were stored.

Initializing and calling the kernel

The kernel we built executes on the EPU, but there’s some additional set up code required on the host to initialize data and actually run it.

Overview

The three steps required to create the input tensors and call the kernel are summarized below:

  1. Allocate memory in DDR for both the input and output tensors

  2. Build and “pack” the weights tensor

  3. Call the kernel routine (passing pointers to the tensors we just built)

Allocate tensor memory in DDR

To set up before calling the tensor, we’ll first initialize the input tensor and allocate space for the output tensor. Note that we’ll pass pointers to these locations into the kernel routine in Step 3, so it can both copy the inputs into OCM and copy the output tensor back to DDR.

  DdrInOutTensorShape ddrIn; DdrInOutTensorShape ddrOut; DdrInOutTensorShape ddrOutExp; DdrWeightTensorShape ddrWeights;

  DdrInOutTensorShape::allocate(ddrIn);
  DdrWeightTensorShape::allocate(ddrWeights);
  DdrInOutTensorShape::allocate(ddrOut);
  DdrInOutTensorShape::allocate(ddrOutExp);

  populateTensorConst(ddrIn, 127 << numFracBits);

  clearTensor(ddrWeights);
  generateWeightTensor(ddrWeights);

  TensorArg<DdrInOutTensorShape>  inputArg{&ddrIn};
  TensorArg<DdrWeightTensorShape> weightArg{&ddrWeights};
  TensorArg<DdrInOutTensorShape>  outputArg{&ddrOut, &ddrIn, 127 << numFracBits};
  packageKernel(OUTPUT_PREFIX, FUNC_ARG(gaussianBlurFx16Test), inputArg, weightArg, outputArg);

  DeviceBufferRef in, weights, out;

  auto device = DeviceManager().getDevice(callKernelGlobals.simConfig);
  device.loadKernel();
  device.allocateAndCopyToDevice(ddrIn, in);
  device.allocateAndCopyToDevice(ddrWeights, weights);
  device.allocate<DdrInOutTensorShape>(out);

  device.runKernel(ENTRYPOINT(gaussianBlurFx16Test), in, weights, out);

  device.copyBufferFromDevice(out, ddrOut);

  auto nativeCompareVisitor =
    [](const auto& t1, const auto& t2) { return nativeCompare<DdrInOutTensorShape>(t1, t2, 127 << numFracBits); };

  runtime_assert(compareTensors(ddrOut, ddrOutExp, nativeCompareVisitor), "tensor mismatch");

We also call the function that initializes the weights tensor here:

generateWeightTensor(ddrWeights);  //initialize 3x3 static tensor

Build and “pack” the weights tensor

The weights tensor is a constant 3x3 array. Gaussian weights for a 3x3 tensor are shown below. The kernel must convolve data with the weights shown below into each resulting element.

Gaussian Weights

Although the two-dimensional tensor format shown might be the best way to visualize the 3x3 matrix, it is actually passed into API methods as a linear array, so we’ll “pack” it by mapping each data element from its two-dimensional position into its corresponding one-dimensional “packed” position.

We use a scheme that allows you to place the weights into a linear tensor of shape: batch = 1, channel = 1, height = 1, width = 12.

If the 2D array pattern is numbered like this (position 4 would be the center weight of a 3x3 Kernel)…

0 1 2
3 4 5
6 7 8

The positions above are mapped to fit linear DdrTensor<FixedPoint16<7>, 1, 1, 1, 12> , like this:

[4, x, x, x, 1, 7, 5, 3, 2, 6, 8, 0]

The 'x' elements shown above represent 0’s and are ignored by the convolution operation.

You can see this operation performed in the code snippet below, which is run on the host machine prior to kernel execution:

void generateWeightTensor(DdrWeightTensorShape& ddrWeights) {
#ifndef __epu__
  /* Weights are packed in the following order
    Assuming the following format:
    0 1 2
    3 4 5
    6 7 8
    [4, x, x, x, 1, 7, 5, 3, 2, 6, 8, 0] where x is 0
  */
  auto  weights          = DdrWeightTensorShape::cast(ddrWeights);
  float scaleFactor      = 16.0;
  (*weights)[0][0][0][0] = floatToFixed<numFracBits>(4.0 / scaleFactor);
  for(std::int32_t i = 4; i < numberOfWeightsForGaussianBlurFx16; i++) {
    (*weights)[0][0][0][i] =
      (i < 8) ? floatToFixed<numFracBits>(2.0 / scaleFactor) : floatToFixed<numFracBits>(1.0 / scaleFactor);
  }
#endif  // !__epu__
}

Notice:

  • the center position in the 2D array is packed into the 0th position by default.

  • the 8 other values are mapped in a NSEW sequence from 2D into the linear array starting at index 4.

Note also that it’s convenient to map the data algorithmically simply because the symmetry in the 2D Gaussian along the YX and diagonal axes makes the values fall into discreet index ranges in the linear packing pattern. We could just as easily build the linear weights array explicitly.

Calling the kernel

Finally, we’ll call the kernel routine we built in Recipe 1 above, passing it references to the three tensor objects we just initialized in DDR.

  DdrInOutTensorShape ddrIn; DdrInOutTensorShape ddrOut; DdrInOutTensorShape ddrOutExp; DdrWeightTensorShape ddrWeights;

  DdrInOutTensorShape::allocate(ddrIn);
  DdrWeightTensorShape::allocate(ddrWeights);
  DdrInOutTensorShape::allocate(ddrOut);
  DdrInOutTensorShape::allocate(ddrOutExp);

  populateTensorConst(ddrIn, 127 << numFracBits);

  clearTensor(ddrWeights);
  generateWeightTensor(ddrWeights);

  TensorArg<DdrInOutTensorShape>  inputArg{&ddrIn};
  TensorArg<DdrWeightTensorShape> weightArg{&ddrWeights};
  TensorArg<DdrInOutTensorShape>  outputArg{&ddrOut, &ddrIn, 127 << numFracBits};
  packageKernel(OUTPUT_PREFIX, FUNC_ARG(gaussianBlurFx16Test), inputArg, weightArg, outputArg);

  DeviceBufferRef in, weights, out;

  auto device = DeviceManager().getDevice(callKernelGlobals.simConfig);
  device.loadKernel();
  device.allocateAndCopyToDevice(ddrIn, in);
  device.allocateAndCopyToDevice(ddrWeights, weights);
  device.allocate<DdrInOutTensorShape>(out);

  device.runKernel(ENTRYPOINT(gaussianBlurFx16Test), in, weights, out);

  device.copyBufferFromDevice(out, ddrOut);

  auto nativeCompareVisitor =
    [](const auto& t1, const auto& t2) { return nativeCompare<DdrInOutTensorShape>(t1, t2, 127 << numFracBits); };

  runtime_assert(compareTensors(ddrOut, ddrOutExp, nativeCompareVisitor), "tensor mismatch");

Further Improvements

It’s possible to further optimize the code demonstrated in the tutorial.

For example, by utilizing ping-pong buffers (i.e. starting the fetch of the next RoI from OCM before the current RoI has been computed), the memory fetching overhead can be further reduced since subsequent RoIs are fetched asynchronously.

We can also consider using int8 as a data type, which results in a more efficient data flow.

Key concepts

This tutorial demonstrated several foundational concepts that you’ll use repeatedly to implement kernel routines and process data in the core array.

  • Broadcast requirements and conventions: The broadcast function is used to transfer constant data that is common to every core calculation, a tile at a time.

    It’s important to stage data before each call to any API method whose function includes “popping” data off the broadcast bus. That’s why the example repeatedly stages the weights tensor data within the inner loop. After running an operation on the core Array, the constant data must be staged again.

  • Borders: By definition, the blur is calculated for a given pixel in the input tensor by using the values of adjacent pixels. Because of that, when iterating through RoIs, the algorithm still needs pixels outside of the RoI for the calculation. Using the BORDER version of the iterator is what automatically brings in those border pixels for purposes of the calculation. The architecture supports this without the need to allocate additional core memory since there are border cores built in to the hardware specifically for this.

  • Weights: Weights in the mask tensor are represented by the FixedPoint type with 2^7 fractional values. We broadcast this tensor to the current tile, since each core in the tile uses it to do the convolution.