Creating a Simple Kernel

This guide will show you how to create a simple kernel which flows in a tensor from DDR into the Array via the OCM, performs operations on the tensor, and finally flows the tensor back out to DDR memory.

Requirements

  1. quadric SDK

Getting Started

Follow the steps below to create a simple kernel:

  1. Add the appropriate QIL includes to the top of your .cpp file.

    // Include Quadric Interemediate Language Header.
    #include <quadric/host.h>
    #include <quadric/qil.h>
    
  2. Define the shapes of the tensors you wish to flow into the EPU:

    /*
     * Define shapes of Tensors, Note there's a DDR Tensor
     * and also an OcmTensor shape. The types DdrInOutShape and  OcmInOutShape
     * contain the attributes of the data in DDR and OCM respectively.
     */
    
    typedef DdrTensor<std::int32_t, 1, 1, 8, 8> DdrInOutShape;
    typedef OcmTensor<std::int32_t, 1, 1, 8, 8> OcmInOutShape;
    
  3. Define a kernel function with a return type of void and DDR tensors passed as pointers arguments:

    EPU_ENTRY void myKernel(DdrInOutShape::ptrType ddrInpPtr, DdrInOutShape::ptrType ddrOutPtr) {
    

    Note

    The usage of the EPU_ENTRY attribute. This specifies that the function is run on the EPU.

  4. Since DDR tensors are passed in as pointers, recreate the tensor object as shown below:

      DdrInOutShape ddrInp(ddrInpPtr);
      DdrInOutShape ddrOut(ddrOutPtr);
    
  5. Create OCM tensors and allocate them:

      OcmInOutShape ocmInp;
      OcmInOutShape ocmOut;
      // Create an instance of the On Chip Memory (OCM) Memory Allocator
      MemAllocator ocmMem;
      ocmMem.allocate(ocmInp);
      ocmMem.allocate(ocmOut);
    
  6. memCpy data from DDR to OCM using the memCpy API:

      memCpy(ddrInp, ocmInp);
    
  7. Create a qVar :ref: array <qVar_t Data Type> to store tiles, flow in data from the OCM to the Epu Array using the fetchAllTiles API:

      qVar_t<std::int32_t> input[OcmInOutShape::NUM_TILES];
      fetchAllTiles(ocmInp, input);
    
  8. Perform an operation on each tile of data. In the example below, we’re going to add a 1 to each tile of data:

      for(std::int32_t tileNum = 0; tileNum < OcmInOutShape::NUM_TILES; tileNum++) {
        // debugPrint(input[tileNum], "Data Before Addition"); // Uncomment to view debug data (Archsim only)
        input[tileNum] += 1;
        // debugPrint(input[tileNum], "Data After Addition"); // Uncomment to view debug data (Archsim only)
      }
    
  9. Flow out data from the array to the OCM:

      writeAllTiles(input, ocmOut);
    
  10. Finally, write the data to DDR:

      memCpy(ocmOut, ddrOut);
    
  11. Perform host side setup to launch the kernel with data:

    // Perform actions on the host computer
    HOST_MAIN(
      // Create DDR Tensors on the host computer, one for input, one for output
      DdrInOutShape ddrInPtr; DdrInOutShape ddrOutPtr;
      // Allocate tensors on the host computer.
      DdrInOutShape::allocate(ddrInPtr);
      DdrInOutShape::allocate(ddrOutPtr);
    
      // Populate the Tensor sequentially.
      populateTensorSequential(ddrInPtr);
      // Wrap the Tensor in a TensorArg meant to pass to the kernel.
      TensorArg<DdrInOutShape> inputArg{&ddrInPtr};
      TensorArg<DdrInOutShape> outputArg{&ddrOutPtr};
      // Launch the kernel.
      callKernel(OUTPUT_PREFIX, ENTRYPOINT(myKernel), inputArg, outputArg););
    

    Note the usage of HOST_MAIN. The code defined in here is launched on the HOST computer only (i.e the development machine).

  12. Launch the kernel using the quadric SDK:

    $ docker run -w /ws -v `pwd`:/ws -it  quadric.io/graphsim:0.8.8 source simple_flow.cpp
    

A complete example of the kernel tutorial described above can seen below:

/*
 * QUADRIC.IO CONFIDENTIAL
 * __________________
 *
 * [2020] quadric.io Incorporated
 * All Rights Reserved.
 *
 * NOTICE: All information contained herein is, and remains
 * the property of quadric.io Incorporated and its suppliers,
 * if any. The intellectual and technical concepts contained
 * herein are proprietary to quadric.io Incorporated
 * and its suppliers and may be covered by U.S. and Foreign Patents,
 * patents in process, and are protected by trade secret or copyright law.
 * Dissemination of this information or reproduction of this material
 * is strictly forbidden unless prior written permission is obtained
 * from quadric.io Incorporated.
 */

//! [Adding QIL Header]
// Include Quadric Interemediate Language Header.
#include <quadric/host.h>
#include <quadric/qil.h>
//! [Adding QIL Header]

//! [Adding Tensor Defs]
/*
 * Define shapes of Tensors, Note there's a DDR Tensor
 * and also an OcmTensor shape. The types DdrInOutShape and  OcmInOutShape
 * contain the attributes of the data in DDR and OCM respectively.
 */

typedef DdrTensor<std::int32_t, 1, 1, 8, 8> DdrInOutShape;
typedef OcmTensor<std::int32_t, 1, 1, 8, 8> OcmInOutShape;
//! [Adding Tensor Defs]

//! [Adding Kernel Defs]
EPU_ENTRY void myKernel(DdrInOutShape::ptrType ddrInpPtr, DdrInOutShape::ptrType ddrOutPtr) {
  //! [Adding Kernel Defs]

  //! [Recreate Tensor Obj]
  DdrInOutShape ddrInp(ddrInpPtr);
  DdrInOutShape ddrOut(ddrOutPtr);
  //! [Recreate Tensor Obj]

  //! [Create OcmTensors]
  OcmInOutShape ocmInp;
  OcmInOutShape ocmOut;
  // Create an instance of the On Chip Memory (OCM) Memory Allocator
  MemAllocator ocmMem;
  ocmMem.allocate(ocmInp);
  ocmMem.allocate(ocmOut);
  //! [Create OcmTensors]

  //! [Do inbound Memcpy]
  memCpy(ddrInp, ocmInp);
  //! [Do inbound Memcpy]

  //! [Do alloc and fetch]
  qVar_t<std::int32_t> input[OcmInOutShape::NUM_TILES];
  fetchAllTiles(ocmInp, input);
  //! [Do alloc and fetch]

  //! [Add one]
  for(std::int32_t tileNum = 0; tileNum < OcmInOutShape::NUM_TILES; tileNum++) {
    // debugPrint(input[tileNum], "Data Before Addition"); // Uncomment to view debug data (Archsim only)
    input[tileNum] += 1;
    // debugPrint(input[tileNum], "Data After Addition"); // Uncomment to view debug data (Archsim only)
  }
  //! [Add one]

  //! [Flow out data]
  writeAllTiles(input, ocmOut);
  //! [Flow out data]

  //! [Write to DDR]
  memCpy(ocmOut, ddrOut);
  //! [Write to DDR]
}

//! [Perform host setup]
// Perform actions on the host computer
HOST_MAIN(
  // Create DDR Tensors on the host computer, one for input, one for output
  DdrInOutShape ddrInPtr; DdrInOutShape ddrOutPtr;
  // Allocate tensors on the host computer.
  DdrInOutShape::allocate(ddrInPtr);
  DdrInOutShape::allocate(ddrOutPtr);

  // Populate the Tensor sequentially.
  populateTensorSequential(ddrInPtr);
  // Wrap the Tensor in a TensorArg meant to pass to the kernel.
  TensorArg<DdrInOutShape> inputArg{&ddrInPtr};
  TensorArg<DdrInOutShape> outputArg{&ddrOutPtr};
  // Launch the kernel.
  callKernel(OUTPUT_PREFIX, ENTRYPOINT(myKernel), inputArg, outputArg););

//! [Perform host setup]