Gaussian Blur
========================================
This kernel writing tutorial is split into two parts -- one to build the actual kernel routine, which can be leveraged as a template for many other use cases, and another to initialize demo data and call the kernel.
Source file: *tests/static_data_tests/gaussian_blur_3x3.cpp*
Concepts demonstrated
---------------------------
This tutorial addresses the following concepts:
* Implementing a simple **kernel**.
* Using ``memCpy`` to copy tensor data from DDR to OCM.
* Staging the **weights tensor** for **broadcast** from the OCM to the core array.
* Using the **YX iterator** pattern to "walk" through tiles in an RoI (region of interest).
* Performing a **convolution**.
* Target machine: q8 EPU.
Gaussian blur algorithm
---------------------------
Performing a 3x3 Gaussian Blur is a specific case of `convolution `_ of a tensor in array memory and a static 3x3 matrix (weights tensor). The matrix values are constants designed to produce a `Gaussian blur `_ of the original input tensor.
Applying the algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^
Because the full input tensor of 1080h x 1920w (x2 bytes) is too large to fit in core memory, we'll perform the convolution iteratively, on smaller, equally sized *regions of interest* (RoIs), until we've processed and written the entire convolved array to an *output array* allocated in OCM memory.
Iterating through RoIs of an input tensor, a tile at a time (also called a *walk* of the tensor), is supported natively in the :ref:`API `.
The process we'll follow amounts to a **repeatable recipe** that developers can use to perform and iterative convolution for a wide variety of use cases.
Writing the kernel routine
-----------------------------------
The kernel routine performs the convolution within the EPU, copying the output back to OCM one RoI at a time, then copies the result back into DDR memory on the host. Later, in :ref:`Part 2 ` below, we'll see how to initialize data and actually call the kernel from your host code.
Data components
^^^^^^^^^^^^^^^^^^
The tutorial involves three tensors:
.. list-table::
:header-rows: 1
* - Data
- size
- description
- var
* - Input tensor
- 1080x1920
- Input tensor. A grayscale image.
- ddrIn
* - Output tensor
- 1080x1920
- Output tensor. The Gaussian blur that results from the convolution performed by the kernel.
- ddrOut
* - Weights tensor
- 3x3
- The 'mask' tensor that operates on the input file. Created algorithmically in host code and passed by reference to the kernel. Its values determine values in the output tensor.
- ddrWeights
Overview
^^^^^^^^^^^^^^^^^^^
The steps required to implement the Gaussian blur kernel are summarized below.
The tutorial assumes that the 1080h x 1920w input tensor already exists in DDR.
1. Copy tensor data from DDR to OCM
For this example, we start with an 1080h x 1920w grayscale image in DDR and copy it into OCM after allocating space for the input tensor, the output tensor, and the weights tensor. (The weights tensor we'll create in :ref:`Part 2 ` is also copied to OCM in the step).
2. Set up the outer loop to flow each RoI into the core.
For this example, we've decided to divide the input tensor in OCM into 30 identically sized (1080h x 64w) RoI's since the full tensor won't fit in core memory.
3. Broadcast the weights tensor and flow an RoI of the input image into the array, perform the convolution, send the convolved output to OCM, an RoI at a time.
These basic steps are applicable to all kinds of convolution-based use-cases, and a Gaussian blur is a great example to start with.
Let's look at each step in detail and how it corresponds to the code example.
Copy tensor data to OCM
^^^^^^^^^^^^^^^^^^^^^^^^^
The input image is a ``1080x1920p`` gray scale Image. To bring data in from DDR to OCM, we'll use the familiar ``memCpy`` :ref:`API `.
Since we can fit an image of size ``1080 * 1920 * 2 Bytes = ~4mb`` in our OCM (On-Chip Memory, ~8mb available on a q8), we **don't** need to partition the image before copying from DDR.
.. note::
When an image/tensor exceeds the size of the OCM, ``memCpy`` lets you specify coordinates to define smaller regions of the source image to transfer, referred to as :ref:`Regions of Interest (RoI's) `.
First, we allocate space in OCM for both the input tensor and the output tensor that will be created by the convolution.
Next we do a straight ``memCpy`` from DDR to OCM for both the input tensor and the weights tensor.
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Input memCpy
:end-before: Input memCpy
With tensors in OCM, we're ready to bring the input into the core in a loop, one RoI at a time.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The outermost loop brings the 1080p image from OCM into core memory, but only one RoI at a time, so we don't exceed the capacity of core memory.
In the below example, we're using the YX Iterator pattern as shown in the snippet below to fetch data from the OCM to the core registers (see :ref:`Iterators` for more about patterns supported in the API).
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Fetching Data
:end-before: Fetching Data
.. note::
We've chosen the YX_BORDER Iterator. The BORDER option will ensure that when we perform operations on a tile that we've included the borders from the image, since a blur requires adjacent values in the calculation. Without the border, the elements along all four edges of the input Roi will be missing three nearest-neighbor elements required for the blur.
In the below example, we've chosen to split the 1080p image along the width axis into 30 RoIs:
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: OcmRoi
:end-before: OcmRoi
Allocating an Array
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
On a q8, we only are able to store data equivalent to the number of array cores, which is 8 * 8 = 64 (one tile), multiplied by the size of our register file, which is 1024, multiplied by our core register width, which is 4 bytes. This enables us to store up to 64 * 1024 * (2 2-byte elements per 1 4-byte register width) = 131,072 elements. Keep in mind that for this example we're using FixedPoint16.
Recall that to fit data efficiently into an array, we're going to read from OCM with an RoI's of size 1080h x 64w.
.. image:: ../../static/img/RoI_1920x1080_30_v3.svg
:alt: tensor
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: OcmRoi
:end-before: OcmRoi
On a q8, this will allow us to store 1,620 tiles in a ``qVar_t>`` array as denoted in the snippet below:
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Allocate Array
:end-before: Allocate Array
.. note::
The ``NUM_TILES`` attribute will automatically calculate the number of tiles to allocate based on a Tensor shape.
Performing Iteration and Gaussian Blur
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Within the iteration, we perform these tasks:
#. Stage the weights data for broadcast.
#. Make the convolution API call, which also automatically "pops" the previously staged weights matrix off the broadcast bus to apply the blur to each element in the core Array.
.. image:: ../../static/img/gaussian_RoI_walk_v1_r1.svg
:alt: tensor
Finally, performing iteration over our RoI's and performing a blur can be seen in the snippet below:
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Algo
:end-before: Algo
The iteration continues until all tiles in the RoI have been exhausted.
Note that we wrap up the kernel routine by copying the new tensor back from OCM to DDR, then freeing up the OCM memory where the input and output tensors were stored.
Initializing and calling the kernel
----------------------------------------
The kernel we built executes on the EPU, but there's some additional set up code required on the host to initialize data and actually run it.
Overview
^^^^^^^^^^
The three steps required to create the input tensors and call the kernel are summarized below:
#. Allocate memory in DDR for both the input and output tensors
#. Build and "pack" the weights tensor
#. Call the kernel routine (passing pointers to the tensors we just built)
Allocate tensor memory in DDR
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To set up before calling the tensor, we'll first initialize the input tensor and allocate space for the output tensor. Note that we'll pass pointers to these locations into the kernel routine in Step 3, so it can both copy the inputs into OCM and copy the output tensor back to DDR.
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Main
:end-before: Main
We also call the function that initializes the weights tensor here:
.. code-block:: c++
generateWeightTensor(ddrWeights); //initialize 3x3 static tensor
Build and "pack" the weights tensor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The weights tensor is a constant 3x3 array. Gaussian weights for a 3x3 tensor are shown below. The kernel must convolve data with the weights shown below into each resulting element.
.. image:: ../../static/img/gaussian_weights.png
:alt: Gaussian Weights
Although the two-dimensional tensor format shown might be the best way to visualize the 3x3 matrix, it is actually passed into API methods as a linear array, so we'll "pack" it by mapping each data element from its two-dimensional position into its corresponding one-dimensional "packed" position.
We use a scheme that allows you to place the weights into a linear tensor of shape: batch = 1, channel = 1, height = 1, width = 12.
If the 2D array pattern is numbered like this (position 4 would be the center weight of a 3x3 Kernel)...
.. code-block::
0 1 2
3 4 5
6 7 8
The positions above are mapped to fit linear ``DdrTensor, 1, 1, 1, 12>`` , like this:
.. code-block::
[4, x, x, x, 1, 7, 5, 3, 2, 6, 8, 0]
The ``'x'`` elements shown above represent 0's and are ignored by the convolution operation.
You can see this operation performed in the code snippet below, which is run on the host machine prior to kernel execution:
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Pack Weights
:end-before: Pack Weights
Notice:
* the center position in the 2D array is packed into the 0th position by default.
* the 8 other values are mapped in a NSEW sequence from 2D into the linear array starting at index 4.
Note also that it's convenient to map the data algorithmically simply because the symmetry in the 2D Gaussian along the YX and diagonal axes makes the values fall into discreet index ranges in the linear packing pattern. We could just as easily build the linear weights array explicitly.
Calling the kernel
^^^^^^^^^^^^^^^^^^^^
Finally, we'll call the kernel routine we built in Recipe 1 above, passing it references to the three tensor objects we just initialized in DDR.
.. literalinclude:: ../../../tests/static_data_tests/gaussian_blur_3x3.cpp
:start-after: Main
:end-before: Main
Further Improvements
--------------------
It's possible to further optimize the code demonstrated in the tutorial.
For example, by utilizing ping-pong buffers (i.e. starting the fetch of the next RoI from OCM before the current RoI has been computed), the memory fetching overhead can be further reduced since subsequent RoIs are fetched asynchronously.
We can also consider using int8 as a data type, which results in a more efficient data flow.
Key concepts
-------------
This tutorial demonstrated several foundational concepts that you'll use repeatedly to implement kernel routines and process data in the core array.
* **Broadcast requirements and conventions:** The broadcast function is used to transfer constant data that is common to every core calculation, a tile at a time.
It's important to stage data before each call to any API method whose function includes "popping" data off the broadcast bus. That's why the example repeatedly stages the weights tensor data within the inner loop. After running an operation on the core Array, the constant data must be staged again.
* **Borders:** By definition, the blur is calculated for a given pixel in the input tensor by using the values of adjacent pixels. Because of that, when iterating through RoIs, the algorithm still needs pixels outside of the RoI for the calculation. Using the BORDER version of the iterator is what automatically brings in those border pixels for purposes of the calculation. The architecture supports this without the need to allocate additional core memory since there are border cores built in to the hardware specifically for this.
* **Weights:** Weights in the mask tensor are represented by the FixedPoint type with 2^7 fractional values. We broadcast this tensor to the current tile, since each core in the tile uses it to do the convolution.