Gaussian Blur

This kernel writing tutorial is split into two parts – one to build the actual kernel routine, which can be leveraged as a template for many other use cases, and another to initialize demo data and call the kernel.

Concepts demonstrated

This tutorial addresses the following concepts:

  • Implementing a simple kernel.

  • Using memCpy to copy tensor data from DDR to OCM.

  • Using the YX iterator pattern to “walk” through tiles in an RoI (region of interest).

  • Target machine: q8 EPU.

Gaussian blur algorithm

Performing a 3x3 Gaussian Blur is a specific case of convolution of a tensor in array memory and a static 3x3 matrix (weights tensor). The matrix values are constants designed to produce a Gaussian blur of the original input tensor.

Applying the algorithm

Because the full input tensor of 1080h x 1920w (x2 bytes) is too large to fit in core memory, we’ll perform the convolution iteratively, on smaller, equally sized regions of interest (RoIs), until we’ve processed and written the entire convolved array to an output array allocated in OCM memory.

Iterating through RoIs of an input tensor, a tile at a time (also called a walk of the tensor), is supported natively in the API.

The process we’ll follow amounts to a repeatable recipe that developers can use to perform and iterative convolution for a wide variety of use cases.

Writing the kernel routine

The kernel routine performs the convolution within the EPU, copying the output back to OCM one RoI at a time, then copies the result back into DDR memory on the host. Later, in Part 2 below, we’ll see how to initialize data and actually call the kernel from your host code.

Data components

The tutorial involves three tensors:





Input tensor


Input tensor. A grayscale image.


Output tensor


Output tensor. The Gaussian blur that results from the convolution performed by the kernel.



The steps required to implement the Gaussian blur kernel are summarized below.

The tutorial assumes that the 1080h x 1920w input tensor already exists in DDR.

  1. Copy tensor data from DDR to OCM

For this example, we start with an 1080h x 1920w grayscale image in DDR and copy it into OCM after allocating space for the input tensor, and the output tensor.

  1. Set up the outer loop to flow each RoI into the core.

For this example, we’ve decided to divide the input tensor in OCM into 30 identically sized (1080h x 64w) RoI’s since the full tensor won’t fit in core memory.

  1. Flow an RoI of the input image into the array, perform the convolution, send the convolved output to OCM, an RoI at a time.

These basic steps are applicable to all kinds of convolution-based use-cases, and a Gaussian blur is a great example to start with.

Let’s look at each step in detail and how it corresponds to the code example.

Copy tensor data to OCM

The input image is a 1080x1920p gray scale Image. To bring data in from DDR to OCM, we’ll use the familiar memCpy API.

Since we can fit an image of size 1080 * 1920 * 2 Bytes = ~4mb in our OCM (On-Chip Memory, ~8mb available on a q8), we don’t need to partition the image before copying from DDR.


When an image/tensor exceeds the size of the OCM, memCpy lets you specify coordinates to define smaller regions of the source image to transfer, referred to as Regions of Interest (RoI’s).

First, we allocate space in OCM for both the input tensor and the output tensor that will be created by the convolution.

Next we do a straight memCpy from DDR to OCM for the input tensor.

  OcmInOutTensorShape ocmIn;
  OcmInOutTensorShape ocmOut;

  memCpy(ddrIn, ocmIn);

With tensors in OCM, we’re ready to bring the input into the core in a loop, one RoI at a time.

Set up the outer loop to flow each RoI into the core

The outermost loop brings the 1080p image from OCM into core memory, but only one RoI at a time, so we don’t exceed the capacity of core memory.

In the example below, we’re using the YX Iterator pattern as shown in the snippet below to fetch data from the OCM to the core registers (see Iterators for more about patterns supported in the API).


We’ve chosen the YX_BORDER Iterator. The BORDER option will ensure that when we perform operations on a tile that we’ve included the borders from the image, since a blur requires adjacent values in the calculation. Without the border, the elements along all four edges of the input Roi will be missing three nearest-neighbor elements required for the blur.

In the example below, we’ve chosen to split the 1080p image along the width axis into 30 RoIs:

typedef OcmTensor<FixedPoint16<numFracBits>, 1, inputDepth, height, 64> OcmRoiShape;

Allocating an Array

On a q8, we only are able to store data equivalent to the number of array cores, which is 8 * 8 = 64 (one tile), multiplied by the size of our register file, which is 1024, multiplied by our core register width, which is 4 bytes. This enables us to store up to 64 * 1024 * (2 2-byte elements per 1 4-byte register width) = 131,072 elements. Keep in mind that for this example we’re using FixedPoint16.

Recall that to fit data efficiently into an array, we’re going to read from OCM with an RoI’s of size 1080h x 64w.

typedef OcmTensor<FixedPoint16<numFracBits>, 1, inputDepth, height, 64> OcmRoiShape;

On a q8, this will allow us to store 1,620 tiles in a qVar_t<FixedPoint16<7>> array as denoted in the snippet below:


The NUM_TILES attribute will automatically calculate the number of tiles to allocate based on a Tensor shape.

Performing Iteration and Gaussian Blur

Within the iteration, we just need to make use of the image::blur function to apply the blur to each element in the core Array. This function requires a template parameter for kernel size (3 in this case) and it applies a standard Gaussian blur on the tile. The output of the blur is written back into the input tile.


Finally, performing iteration over our RoI’s and performing a blur can be seen in the snippet below:

The iteration continues until all tiles in the RoI have been exhausted.

Note that we wrap up the kernel routine by copying the new tensor back from OCM to DDR, then freeing up the OCM memory where the input and output tensors were stored.

Initializing and calling the kernel

The kernel we built executes on the EPU, but there’s some additional set up code required on the host to initialize data and actually run it.


The steps required to create the input tensors and call the kernel are summarized below:

  1. Allocate memory in DDR for both the input and output tensors

  2. Call the kernel routine (passing pointers to the tensors we just built)

Allocate tensor memory in DDR

We’ll first initialize the input tensor and allocate space for the output tensor. Note that we’ll pass pointers to these locations into the kernel routine in the next step, so it can both copy the inputs into OCM and copy the output tensor back to DDR.

  DdrInOutTensorShape ddrIn; DdrInOutTensorShape ddrOut; DdrInOutTensorShape ddrOutExp;


  populateTensorConst(ddrIn, 127 << numFracBits);

  TensorArg<DdrInOutTensorShape> inputArg{&ddrIn};
  TensorArg<DdrInOutTensorShape> outputArg{&ddrOut, &ddrIn, 127 << numFracBits};

Calling the kernel

Finally, we’ll call the kernel routine we built above, passing it references to the tensor objects we just initialized in DDR.

  packageKernel(OUTPUT_PREFIX, FUNC_ARG(gaussianBlurFx16Test), inputArg, outputArg);

  DeviceBufferRef in, out;

  auto device = DeviceManager().getDevice(callKernelGlobals.deviceConfig);
  device.allocateAndCopyToDevice(ddrIn, in);

  device.runKernel(ENTRYPOINT(gaussianBlurFx16Test), in, out);

  device.copyBufferFromDevice(out, ddrOut);

Further Improvements

It’s possible to further optimize the code demonstrated in the tutorial.

For example, by utilizing ping-pong buffers (i.e. starting the fetch of the next RoI from OCM before the current RoI has been computed), the memory fetching overhead can be further reduced since subsequent RoIs are fetched asynchronously.

We can also consider using int8 as a data type, which results in a more efficient data flow.

Key concepts

This tutorial demonstrated the foundational concepts of Borders and RoIs. By definition, the blur is calculated for a given pixel in the input tensor by using the values of adjacent pixels. Because of that, when iterating through RoIs, the algorithm still needs pixels outside of the RoI for the calculation. Using the BORDER version of the iterator is what automatically brings in those border pixels for purposes of the calculation. The architecture supports this without the need to allocate additional core memory since there are border cores built in to the hardware specifically for this.