Data Access Patterns

In a deployed EPU-host solution, your code has access to three distinct memory segments:

  • Array (Core local memory on the EPU)

  • OCM (on-chip memory on the EPU)

  • DDR (RAM in the host system)

OCM is intermediary between DDR and array memory, so you can transfer data between DDR and OCM, and between OCM and the core array, but not directly between DDR and the core array.

Each of the three memory segments has specific access rules and usage restrictions. That includes the distinct data flow pattern(s) appropriate to use with each. The SDK has APIs specific to particular segments that not only support read/write operations, but help ensure the rules associated with each type are observed.

Data Movement

The diagram below depicts a high level layout of the EPU and the data access patterns available to access data from the subsystem:

data access
  • Flows: Data is fetched sequentially. Optionally, you can specify an access pattern at compile time.

  • Random Access (RAU): Data is accessed at specified addresses. Only applies to transferring between OCM and the Core Array.

  • Broadcast: OCM data flows to the core array i.e. each core gets a copy of the data. Only applies to fetching data from OCM into the Core Array.

Data Sharing Between Cores

Oftentimes, it is helpful to pass data between individual cores in the array. We do this with five APIs – qAllNeighbors<>, qNorth<>, qSouth<>, qEast<> and qWest<>. qAllNeighbors<> copies the source register of each core to all of its neighbors’ ports (north, south, east, and west of the core). The other four APIs do this in only one direction (e.g. qNorth<> copies the source register to the north neighbor’s port). This API thus allows for data to be transferred between cores in the array.

Data movement DDR <-> OCM

Data is copied between OCM and DDR memory using the memCpy API, linked here, which also provides an optional mechanism for partitioning (tensor) data from DDR when its size exceeds the capacity of OCM.

Copy DDR to OCM

Below, see an example of using memcpy to copy from DDR to OCM:

  MemAllocator                          ocmMem;
  DdrTensor<std::int32_t, 1, 2, 12, 15> ddrTensor1;
  OcmTensor<std::int32_t, 1, 2, 12, 15> ocmTensor1;
  ocmMem.allocate(ocmTensor1);  // Allocate space for the OCM tensor.

  // copy data from DDR Tensor to OCM
  memCpy(ddrTensor1,   // Source:      DDR tensor
         ocmTensor1);  // Destination: OCM tensor

Handling tensors larger than OCM capacity

If the DDR tensor data exceeds the OCM size, you can partition it into smaller regions before transferring to the OCM. The memCpy API provides optional region of interest (ROI) offset parameters which specify the starting offsets in the DDR tensor.

Below, see an example of using memCpy with ROI offsets.

Use memCpy with ROI offsets when OCM shape is smaller than the DDR tensor:

  MemAllocator                          ocmMem2;
  DdrTensor<std::int32_t, 1, 2, 25, 63> ddrTensor2;
  OcmTensor<std::int32_t, 1, 2, 12, 63> ocmTensor2;
  ocmMem2.allocate(ocmTensor2);  // Allocate space for the OCM tensor.

  // copy data from DDR (starting at an offset) to OCM
  memCpy(ddrTensor2,  // Source:      DDR tensor
         ocmTensor2,  // Destination: OCM tensor
         0,           // DDR batch offset
         0,           // DDR channel offset
         4,           // DDR height offset
         0);          // DDR width offset.
memcpy roi

Warning

In the current codebase, memCpy requires the OCM width to be the same as DDR width.

Copy OCM to DDR

Below, see an example of using memcpy to copy from OCM to DDR:

  memCpy(ocmTensor1,   // Source:      OCM tensor
         ddrTensor1);  // Destination: DDR tensor

Data movement OCM <-> Array

There are three distinct approaches to transferring data between the core array and OCM:

  • Flows: Using Fetch and Write APIs

  • Random Access: Using the RAU API

  • Broadcast: using the Broadcast API

Flow

Fetch and write flow patterns are used to move data consecutively between OCM and the core array. The data flows into the array one tile at a time.

Note

A tile is defined as the set of data elements that fill up an array. For a q8 (8x8 array cores), a tile consists of 64 elements.

Fetch API

The Fetch API flows data into the array from the OCM.

The following example outlines the flow of a tensor (1x3x8x8) into the array using fetchAllTiles API:

  constexpr std::uint32_t WIDTH  = 8;
  constexpr std::uint32_t HEIGHT = 8;
  constexpr std::uint32_t DEPTH  = 3;

  // Defining the Tensor shape as a typedef avoids being verbose when the shape information is required later in the
  // code.
  typedef OcmTensor<std::int32_t, 1, DEPTH, HEIGHT, WIDTH> OcmTensorShape3;
  MemAllocator                                             ocmMem3;
  OcmTensorShape3                                          ocmTensor3;

  // allocate local memory on array. The tensor typedef provides helper enums to access
  qVar_t<std::int32_t> qData[OcmTensorShape3::NUM_TILES];

  // Fetch all the data from OCM into the Array
  fetchAllTiles(ocmTensor3, qData);

The sequence of how tiles flow in/out of the array can be controlled by the iterator pattern specified in the fetch and write APIs.

There are a multitude of iterators that allow users to iterate over their tensors in dimension specific ways. Iterators are currently denoted by their major, minor dimensional “walk” of the tensor. Dimensions are currently noted as X, Y, Z.

  • X represents width

  • Y represents height

  • Z represents depth

    tile flow

The iterator pattern can be specified to fetchAllTiles as a template parameter:

  // flows in data in a YX pattern as shown in the diagram above
  fetchAllTiles<IteratorType::YX_NO_BORDER>(ocmTensor3, qData);

The supported iterator types are as follows:

enum class IteratorType { XY_NO_BORDER = 0, YX_NO_BORDER, YX_BORDER, ZY_BORDER, ZY_NO_BORDER, ROW, ROW_ZY, LINE };

Similar to memCpy API, an ROI can also be used to fetch from within an OCM tensor using fetchTilesInRoi API as follows.

  typedef OcmTensor<std::int32_t, 1, 2, 25, 63>                       OcmTensorShape4;
  typedef OcmTensor<typename OcmTensorShape4::elemType, 1, 2, 12, 23> RoiShape;
  OcmTensorShape4                                                     ocmTensor4;
  // the local memory is allocated to the size of RoiShape instead of ocmTensor4
  qVar_t<std::int32_t> qRoiData[RoiShape::NUM_TILES];

  // Flow in data from ocmTensor4 starting at row offset 5 and column offset 16
  fetchTilesInRoi<RoiShape, IteratorType::XY_NO_BORDER>(ocmTensor4, qRoiData, 0, 0, 5, 16);

Warning

The width offset in the tensor must be a multiple of number of columns in an array (e.g. 16 in the case of Q16)

Write API

You can transfer data from the core array to OCM, or to a specified ROI of the OCM.

Similar to fetch APIs, the write functions are used to flow data from the array to OCM:

  writeTilesInRoi<RoiShape, IteratorType::XY_NO_BORDER>(qRoiData, ocmTensor4, 0, 0, 5, 16);

Writing data to an ROI of the OCM tensor:

  writeTilesInRoi<RoiShape, IteratorType::XY_NO_BORDER>(qRoiData, ocmTensor4, 0, 0, 5, 16);

Warning

The width offset in the tensor must be a multiple of number of columns in an array (e.g. 16 in the case of Q16)

Iterators

Iterators are used to bring tiles into core memory for processing and to write the output back to OCM, while providing the developer with extensive control over the order and orientation in which tiles are staged and processed. This page discusses the fundamentals, underlying conventions, and use of iterators – the API functions that traverse (“walk”) sequentially through each tile in a tensor, in a specific order determined by the function parameters.

The API includes several iterator methods that each iterate over tensors in a dimension-specific sequence. Iterators operate on one tile of data per iteration.

Because they are fundamental to tensor processing, it’s important to understand the conventions, terminology, and program flow underlying iterator use, regardless of the particular algorithm you’re implementing.

Tensor conventions, directions, and labels

Some important underlying concepts should be established before we discuss iterators in detail – particularly notation, the coordinate system, and some precise terminology.

Orientation and direction

Tensors and iterators are described using the coordinate system below. Dimensions are denoted in X, Y, Z, where…

  • X represents width, numbered from left to right.

  • Y represents height, numbered from top to bottom.

  • Z represents depth (or channel).

tensor

Note that you might see the same axis labeled differently in code or documentation depending on the context. That’s typically nothing more than semantic. Here are some common variations:

Dimension

EPU Hardware

Typical naming

X

core array east-west

width or ROW

Y

core array north-south

height or COL

Z

core register

depth or channel

Note

By convention, tensors in the API are described by denoting their dimensions in the format height x width (y * x)

Tiles

A tile is the amount of data which occupies every element in the array once. For instance, the size of a q8 EPU tile is an 8x8 array of 64 data elements. A q16 EPU tile is 16x16 array of 256 data elements.

tensor
Regions of Interest

After a large input tensor is copied from DDR to OCM, we typically want to flow smaller subsections of it, one at a time, into core memory. This means specifying smaller subdivisions of the input tensor into fixed-size regions of interest (RoI’s). The ability to define, manage, and manipulate RoI’s – small regions of larger tensors – is supported directly by functions in the API, by design. You can define an RoI to be almost any size/shape and copy it into the core array as long as it does not exceed the capacity of core registers in the EPU.

RoI Fundamentals
  • What is it? An RoI is a subsection of an input tensor in OCM.

  • Why is it used? To subdivide input data into smaller tensors that can be copied into core memory.

  • What size are they? An RoI can be defined as almost any size or shape, provided:

    • It doesn’t exceed core memory capacity.

    • It doesn’t exceed the dimensions of the input tensor.

Below is a sample 64x64 tensor subdivided into 8 RoI’s of 64x8 elements. The grid’s resolution is 8x8 elements, the definitive shape of a q8 tile.

tensor
The Big Picture – Tensors, RoI’s and Tiles

To confirm your understanding of the previous sections, let’s pull the all the components of organizational hierarchy together.

The diagram below shows the hierarchy of tensors and other data compoenents such as RoI’s and tiles that are fundamental to the API.

Glossary

Term

Definition

Tensor

The raw, full-size input tensor – typically initialized in DDR and copied to OCM where it’s further subdivided into RoIs for processing in the EPU. Example: An image used to perform a blur, edge detection, or other similar imaging algorithm.

RoI

Region of Interest. A tensor subsection of the raw input tensor, defined by the developer. Size and shape are user-defined based on the use-case and available capacity in core registers. An RoI can be dynamically defined and is really just another tensor extracted from a larger tensor dataset.

tile

A n*n slice of data processed within the array. Tile size is static, and determined by the dimensions of the core array in target hardware. A q8 core defines a tile as 8x8 elements. A q16 core defines a tile as 16x16 elements.

element

A single unit of data. A tensor described as 1024x256 means a shape that’s 1024 elements along the X axis and 256 elements along the Y axis for a total of 262,144 elements.

Before continuing to the next sections, it’s important to understand the material in all the sections above. RoI’s, tiles, xyz coordinates are fundamental to understanding how to use iterators.

Understanding iterators

A good grasp of the conventions used to describe tensors, RoI’s, and tiles is an important foundation for understanding how iterators traverse (walk) data, and how data flows within the core.

The data flow of end-to-end tensor flow-in flow-out from OCM to the array via RoI’s typically looks like this:

tensor

The numbered processes above perform these tasks:

  1. Tensor data (often just an RoI of a larger tensor) is flowed into the core registers using an iterator.

  2. An iterator walks the tensor, loading a tile at a time into the core array, performing the specified operation, then repeating that sequence until all the tiles have been processed. The output is written back to core registers.

  3. Using an iterator, that output data is flowed from the core registers back into OCM, with the specified offset to position the output RoI appropriately in the larger tensor.

The most common use cases require processing the entire input tensor by repeating steps 1 to 3 on consecutive, adjacent RoI’s until the entire processed version is written back to OCM. (See gaussian_blur.md)

Tensor Walk

The sections above have repeatedly referred to the walk of a tensor, which is just the sequence in which tiles in the array are traversed by the iterator.

Iterators are described using the X, Y, Z coordinate system discussed above, and are denoted by their major and minor dimensional “walk” of the tensor, ie. XY, YX, ZY. An XY walks Y first then X.

For example, a YX iterator of 8x8 elements would be walked in this order

tile flow

A YX iterator would walk through tiles of a 64x64 tensor in the order indicated by the numbered tiles below:

tensor
YX Iterator Example

Let’s step through the walk of the tensor illustrated above:

  1. Iteration starts at (0,0), then walks down each tile on the Y axis until it reaches the last tile in the column.

  2. The iteration then increments the x position by one tile and continues from the top of the tensor at (8,0), then walks down the Y axis until it reaches the last tile in the column.

  3. The iteration increments the x position by one tile and continues from the top of the tensor at (16,0), then walks down the Y axis until it reaches the last tile in the column and so on…

Simply put, the tensor is walked column by column.

An XY iterator would walk a 64x64 element tensor in the order indicated by the numbered tiles below:

tensor
XY Iterator Example

Let’s again step through the iteration:

  1. Iteration starts at (0,0), the first tile, then walks along the X axis until it hits the far right 8x8 tile in the row.

  2. Increments Y position, restarts from the left at (0,8), the ninth tile, then walks along the X axis, tile by tile, until it hits the far right tile in the row.

  3. Increments Y position, restarts from the left at (0,16), the 17th tile, then walks along the X axis until it hits the far right tile in the row and so on…

Simply put, the iterator walks the tensor tiles row by row.

Using Iterators

The API simplifies many of the tasks surrounding iteration by abstracting users from some of the more tedious coding required to do the math and flow control required to implement iteration.

The following iterator patterns are supported in the API:

  • YX iterator_inplane_yx.hpp (Border/NoBorder)

  • XY iterator_inplane_xy_noborder.hpp (NoBorder only)

  • ZY iterator_indepth_zy.hpp (Border only)

A typical iterator call looks like this:

fetchTilesInRoi<StreamSide::SOUTH, DataRoiShapeDesc, IteratorType::YX_BORDER>(ocmIn, qData, 0, 0, 0, widthOffset); // copy tiles from RoI in OCM to core registers
Iterator Parameters

Each `Iteratorcall takes in the following template parameters:

  • StreamSide (See also: common_defs.hpp) Value can be NORTH, SOUTH, EAST, or WEST. Specifies the direction data is “pushed” onto the core, which determines the orientation of the tensor data. (defaults to SOUTH)

  • RoiShape Shape of the “Region of Interest” within the tensor data. Ex: you only want to focus on a 2x2 portion of a 8x8 image.

  • DataShape Shape of the entire dataset being iterated over. This is usually the shape of the full input tensor itself.

And the following parameters:

  • Data (Tensor)

  • TileAction (Lambda function)

All tiles are consumed within each RoI in the pattern specified by the iterator – one tile per iteration.

Understanding the StreamSide Parameter

Iterators let you specify the direction in which tile data is loaded into and/or written from the core, which effectively determines the orientation of array data before and/or after core operations.

tensor

By default, data is flown in (fetched) from the SOUTH side, which preserves the original orientation of the input data in the tile array. Conversely, data is flown out (written) of the NORTH side by default. Specifying the flow direction is an efficient way to transpose the tile array as it’s transferred between OCM and core memory.

Example Flows

Below are samples of the same tensor loaded into the core array in each of the four directions. Notice the vertical transposition between SOUTH and NORTH, and the horizontal transposition between EAST and WEST.

Flow in from SOUTH (the original data orientation)
tensor
Flow in from NORTH
tensor

Note the vertical transposition between SOUTH and NORTH.

Flow in from EAST
tensor
Flow in from WEST
tensor

Note the horizontal transposition between EAST and WEST.

XY Example

The example below shows data flowing from OCM into the core from the SOUTH, using fetchAllTiles, then flowing output data into OCM from the NORTH edge of the core using writeAllTiles. By convention, a ‘fetch’ implies that tiles are coming from OCM into the core. ‘write’ implies that tiles are being copied from the core to OCM.

  qVar_t<int32_t> x[inOutShape::NUM_TILES];

  // Flow in data with the fetchAllTiles iterator
  fetchAllTiles<inOutShape, StreamSide::SOUTH, IteratorType::XY_NO_BORDER>(inp, x);
  // process tile by tile, at every cell: compute tile(r,c) += tile(r-1,c) + tile(r+1,c) + tile(r, c-1) + tile(r, c+1)
  for(std::int32_t tileNum = 0; tileNum < inOutShape::NUM_TILES; tileNum++) {
    qAllNeighbors<> = x[tileNum];
    x[tileNum] += (qEast<> + qSouth<> + qWest<> + qNorth<>);
  }
// Flow data out to OCM with the writeAllTiles iterator
  writeAllTiles<inOutShape, StreamSide::NORTH, IteratorType::XY_NO_BORDER>(x, out);

Note the inOutShape is used for both input and output since the output is an identically sized tensor. NUM_TILES is a built-in attribute that makes it easy get the total number of tiles to use for allocating tile arrays or to set loop parameters. Other valid operators are noted in global_operator_overloading.hpp

YX Example

The previous example used iterators that fetch all the tiles in the input tensor from OCM.

This example uses a version of the iterator called fetchTilesInRoi, which fetches tiles (and using WriteTilesInRoiwrites them back) from a defined RoI in the input tensor.

/**
   Flows tiles in plane col-major order with borders enabled.
   Left-most column tiles will see their left borders masked - these values are used in the compute. */
std::int32_t south_north_flow_d2_d1(inOutShape& inp, inOutShape& out) noexcept {
  qVar_t<std::int32_t> x[inOutShape::NUM_TILES];

  fetchTilesInRoi<inOutShape, RoiShape, StreamSide::SOUTH, IteratorType::YX_BORDER>(
    inp, x, RoiShapeDesc::d4beg, RoiShapeDesc::d3beg, RoiShapeDesc::d2beg, RoiShapeDesc::d1beg);

  // process tile by tile, at every cell: compute tile(r,c) += tile(r-1,c) + tile(r+1,c) + tile(r, c-1) + tile(r, c+1)
  for(std::int32_t tileNum = 0; tileNum < inOutShape::NUM_TILES; tileNum++) {
    qAllNeighbors<> = x[tileNum];
    x[tileNum] += (qEast<> + qSouth<> + qWest<> + qNorth<>);
  }

  writeTilesInRoi<inOutShape, RoiShape>(
    x, out, RoiShapeDesc::d4beg, RoiShapeDesc::d3beg, RoiShapeDesc::d2beg, RoiShapeDesc::d1beg);
  return 0;
}

Tiles are written out using writeTilesInRoi.

ZY Example

The third iterator pattern walks Z-first and Y secondarily.

This example also uses fetchTilesInRoi and WriteTilesInRoi.

  qVar_t<std::int32_t> x[OcmRoiOutputShape::NUM_TILES];

  constexpr std::size_t roiHeight = BorderedOcmRoiShape::NUM_ROWS;
  constexpr std::size_t numRois   = DdrInOutShape::NUM_ROWS / roiHeight;

  for(std::size_t i = 0; i < numRois; i++) {
    memCpy(ddrInp, ocmInp1, 0, 0, i * roiHeight);
    fetchTilesInRoi<BorderedOcmRoiShape, true, IteratorType::ZY_BORDER>(ocmInp1, x, 0, 0, 0, 0);
    for(std::size_t i = 0; i < BorderedOcmRoiShape::NUM_TILES; i++) {
      qAllNeighbors<> = x[i];
      x[i] += (qEast<> + qSouth<> + qWest<> + qNorth<>);
    }
    writeTilesInRoi<OcmRoiOutputShape>(x, ocmOut2, 0, 0, 0, 0);
    memCpy(ocmOut2, ddrOut, 0, 0, i * roiHeight);
    // Zero out x
    for(std::size_t i = 0; i < OcmRoiOutputShape::NUM_TILES; i++) {
      x[i] = 0;
    }
  }
}

Tiles are written out using writeTilesInRoi.

More Examples:
  • iterator_zy_roi_baseline.cpp

  • iterator_xy_noborder_roi_baseline.cpp

  • iterator_yx_roi_baseline.cpp

RAU (Random Access)

There are instances where data must be fetched/written to OCM in a sequence that differs from the order it’s stored in.

Typical use cases include:

  • Ego motion correction/image rotation: Data coordinates to be fetched are not known until run-time, but some particular rotation must be applied to the coordinates before they are fetched.

  • Image up/down scaling. Data might have to be fetched/written in a strided pattern (each core fetches every other element for instance).

  • Unaligned access. The flow APIs (fetch/write) only allow for data to be fetched from aligned start addresses (the start address must be a multiple of the number of column in the array e.g. 16 in Q16). For unaligned access, one would use RAU.

Flow and broadcast patterns are efficient but they assume sequential data transfer, so the SDK provides the RAU load/store functions to use when random access is required. RAU load/store let you specify offsets to select specific regions of data to transfer.

Below, see an example of a RAU load:

Each core loads data from a given location in OCM. That location is specified using four different offsets – batch, channel, row and column.

  MemAllocator                                  ocmMem;
  typedef OcmTensor<std::int32_t, 1, 2, 12, 15> ocmTensorShape;
  ocmTensorShape                                ocmTensor;
  ocmMem.allocate(ocmTensor);  // Allocate space for the OCM tensor.

  const qVar_t<std::uint32_t> batchOffset = 0;
  const qVar_t<std::uint32_t> chOffset    = 0;
  const qVar_t<std::uint32_t> rowOffset   = qRow<> * 2;  // Each array core can access its own row position value via
                                                         // qRow<>. Each core accesses every other row
  const qVar_t<std::uint32_t> colOffset = qCol<> * 2;    // Each array core can access its own column position value via
                                                         // qCol<> Each core accesses every other column
  // Configure the system to perform a Random access
  Rau::config(ocmTensor);

  // Fetch a tile worth of data. Each core has a unique offset.
  const qVar_t<std::int32_t> qData = Rau::Load::oneTile(batchOffset, chOffset, rowOffset, colOffset, ocmTensor);

Below, see an example of a RAU store:

  Rau::Store::oneTile(batchOffset, chOffset, rowOffset, colOffset, qData, ocmTensor);

Broadcast to Array

The Broadcast API provides one-way data transfer functionality to populate the array core from OCM. Specifically, it broadcasts tensor data from the OCM as a series of sequential values to all array cores simultaneously. The broadcast data is consumed by the array in sequential chunks.

Broadcast is designed specifically to enable every core to consume the same loop-invariant data object simultaneously from OCM, typically to perform a convolution while minimizing clock cycles.

Another way to think about broadcast is that every core consumes a copy of broadcast data in parallel, storing it in the core’s dedicated 64-bit broadcast register.

Each call to broadcast copies 64 bits of OCM data to each core. The source object itself can be larger, but must be consumed with multiple broadcast call since only 64 bits of that object is broadcast per function call.

Comparing Broadcast to Iterators

Feature

Iterator

Broadcast

Max. transfer size

65536 bytes

64 bits

Max. source object size

available OCM

available OCM

copy direction

OCM <–> core registers

OCM –> core broadcast registers

source data disposition

unchanged

unchanged

edge flow options

N, S, E, W

na (hardwired to WEST)

Broadcast steps

The diagram above illustrates the broadcast implementation process. In this example, each element in the input array is an i32, which means 2 elements can be consumed per broadcast (2 x 32-bits).

The steps to use broadcast are:

  1. OCM Setup: Linearize the tensor according to the array size

  2. Broadcast Setup: Set up hardware for broadcast

  3. Array-Side Consumption: Consume the broadcast in 64 bit blocks

Broadcast Vocabulary

These terms are used repeatedly to describe broadcast in documentation. They carry specific meaning in the context of the broadcast API.

broadcast v. Send 64-bits of a data object to every core, causing each to consume that data into its broadcast register.

broadcast register n. The 64-bit memory register in each core that is the dedicated destination for broadcast data.

consume v. Describes the action taken by every core when broadcast is called. That is, each core consumes a copy of the broadcast data. It’s not distinct from broadcast, it’s another way to describe what happens from the perspective of the cores when broadcast is called.

loop-invariant adj. The property of data (like a tensor) that indicates it remains unchanged through iteration, eg. a constant weights tensor broadcast to all cores in convolution.

POP v. The function attribute that indicates the broadcast function should ‘pop’ the next 64 bits from the staged variable.

source data n. The tensor (or other data object) in OCM used as broadcast input. It’s first serialized, staged, then broadcast in 64-bit chunks.

staging n. The required preparation for broadcast that designates the source array to be used as input the next time broadcast is called. Staging is a single function call: BroadcastStream::stage<WeightTensorShape, RoiShapeDesc>(Weights);

How Broadcast Uses Core Registers

By design, data from a broadcast is consumed by the broadcast register in every core. Recall that each core has two unique memory registers:

  • localMem - 4096 bytes (1024 x 32-bit registers)

  • broadcast register - 64 bits (2 x 32-bit registers)

Note

Regardless of the core array size of the target EPU, the broadcast registers are fixed to 64 bits wide.

You can also think about a broadcast from the receiver’s point of view – that is, from the perspective of the cores, each of which simultaneously consumes the same staged data object from OCM, 64-bits at a time. Since a single call to broadcast copies only 64-bits into each core, consuming any source object larger than 64-bits requires multiple calls to broadcast.

Ultimately, the size of that source tensor is up to you. While typical use cases entail broadcasting much smaller tensors, the only architectural limit to the size of the staged source data is the available OCM space.

OCM Setup

Broadcast setup is very similar to other OCM -> Array flows. Before broadcast, the tensor must be linearized (the size of the inner most dimension must be the same as the array size, so 8 for an 8x8 array).

Below, see an example of an appropriate tensor:

typedef DdrTensor<std::int32_t, 1, 1, Epu::coreDim, Epu::coreDim> DdrInOutShape;
typedef OcmTensor<std::int32_t, 1, 1, Epu::coreDim, Epu::coreDim> OcmInOutShape;

using RoiShape = DdrInOutShape;

typedef OcmTensor<char, RoiShape::NUM_BCH, RoiShape::NUM_CHN, RoiShape::NUM_ROWS, RoiShape::NUM_COLS> RoiShapeDesc;
  OcmInOutShape ocmInp;
Broadcast Setup

Assuming the tensor defined above was properly allocated and the memory copied from DDR, we can now configure hardware for broadcast of the tensor:

  // Broadcast data.
  BroadcastStream::stage<OcmInOutShape, RoiShapeDesc>(ocmInp);
Array-Side Consumption

To read a broadcast, you pop data from the broadcast bus using the qBroadcast variable. The broadcast consumer API exposes 64 bits of data at a time. This means that the number of data elements exposed by each pop depends on the size of the elements.

Data Element Size

Number of Elements per pop

32 bits

2

16 bits

4

8 bits

8

The broadcast is consumed via the qBroadcast variable. Continuing from the OCM setup example above, we will now walk through the broadcast of the values.

First, we setup variables to consume broadcast:

  constexpr std::int32_t numBroadcasts32 = DdrInOutShape::linearElemCount / getNumberOfWeightRegisters<std::int32_t>();

  // Array to hold all the weights
  qVar_t<std::int32_t> arr[getNumberOfWeightRegisters<std::int8_t>()];
  // Our output data
  qVar_t<std::int32_t> out[1];
  // Initialize output to zero
  out[0] = 0;

getNumberOfWeightRegisters() is provided as a convenience to get the size of an qVar_t array required to hold all values from a single broadcast.

We then retrieve values from broadcast in sequential order

  for(std::int32_t i = 0; i < numBroadcasts32; i++) {
    arr[0] = qBroadcast<0, std::int32_t, BroadcastAction::POP>;
    arr[1] = qBroadcast<1, std::int32_t>;

    out[0] += (arr[0] + arr[1]);
  }

In the example above we calculate the number of qBroadcast() "pop" calls required with the numBroadcasts32 variable and loop until the entire broadcast is read. Because the type being broadcast is 32 bits, we can get two elements from each pop of the 64 bit buffer.

The qBroadcast variable takes three arguments:

  1. a uint8_t index (for which element within the broadcast to retrieve)

  2. The type being retrieved

  3. A BroadcastAction: (default) BroadcastAction::PEEK or BroadcastAction::POP

The use of qBroadcast returns a qVar_t<T> where T is the type specified as the second template argument filled with the data from the index specified. If BroadcastAction::POP is specified, this will retrieve the next elements from the broadcast queue. If BroadcastAction::PEEK is specified (which is the default) the read occurs with the current contents of the buffer.

Note

A BroadcastAction::POP must occur before the first reading. Setup does not prime the ports for read.

The qVar_t variables are populated with the same value in all cores after the assignment from a broadcast.

Note

An invalid index for the type specified (e.g. index 2 on a 32 bit type) will result in a compile time error.

There are two ways to use broadcast:

  • OPTION 1 Call broadcast directly

    • If you call broadcast qBroadcast directly, then you may call it one or more times to consume the source data in 64-bit increments.

      For int 16 input, that will look similar to this:

for(std::int32_t i = 0; i < numBroadcasts16; i++) {
    arr[0] = qBroadcast<0, std::int16_t, BroadcastAction::POP>;
    arr[1] = qBroadcast<1, std::int16_t>;
    arr[2] = qBroadcast<2, std::int16_t>;
    arr[3] = qBroadcast<3, std::int16_t>;

qBroadcast<> returns a qVar_t.

  • OPTION 2 Use convolution

    • If you invoke broadcast indirectly, as part of a convolution, then you can let the convolution manage the mechanics of calling broadcast.

      That will look similar to this:

      qData[tileNum] = NN::convTileBlockFx16<FixedPoint16<numFracBits>, 3> (an expanded example is shown below)

The data that is currently staged is used automatically as input by the next Broadcast call whether it’s direct or part of a convolution.

The example below loops through the tiles in an RoI (region of interest), staging a predefined weights tensor then letting convolution manage the broadcast:

for(std::int32_t tileNum = 0; tileNum < OcmRoiShape::NUM_TILES; tileNum++) {
  BroadcastStream::stage<WeightTensorShape, RoiShapeDesc>(Weights);
  qVar_t<FixedPoint16<numFracBits>> input[4];
  input[0]       = qData[tileNum];
  qData[tileNum] = NN::convTileBlockFx16<FixedPoint16<numFracBits>, 3>(input, 1);

Note

You may have noticed that the way staged data is accessed over successive broadcasts makes each broadcast analogous to a 64-bit ‘POP’ from a FIFO queue. The primary distinction here is that data is not removed from the source with each POP. Instead, broadcast maintains and updates an internal pointer to ‘bookmark’ the previous 64 bits broadcast. It increments the pointer by 8 bytes each time there’s a ‘POP’, so it knows where to get the ‘next’ 64-bits automatically.

Broadcast API Functions

There are two available options for broadcasting data to the cores:

  1. Explicit broadcast Call the broadcast function directly. To trigger a broadcast, you can callqBroadcast<>. The qBroadcast<> function takes three arguments:

    1. a uint8_t index (for which element within the broadcast to retrieve).

    2. The data type being retrieved (i8, i16, or i32).

    3. BroadcastAction:

      1. PEEK (default. Allows you to read broadcast elements)

      2. POP (trigger a broadcast)

  2. Convolution (implicit broadcast) Call the convolution function, which manages calls to qbroadcast for you as part of the convolution operation. qData[tileNum] = NN::convTileBlockFx16<FixedPoint16<numFracBits>, 3>

Let’s look at the specifics of implementing these functions.

Broadcast Notes and Cautions

Here’s a quick review and a deeper dive into the inner workings of staging and broadcast to help wrap up the concepts discussed so far.

Understanding these points will help you both avoid common errors and confidently apply broadcast to implement a wide range of algorithms. The 64-bit transfer limit in particular entails some more subtle implications worth emphasizing:

  1. Broadcasting data larger than 64 bits As noted above, multiple calls to broadcast are required for any source object larger than 64 bits. The convolution function manages broadcast for you, as required to consume the entire source object in 64-bit increments.

  2. Determining the ‘next’ read offset into the source data To support multiple calls to broadcast, the API maintains and updates a pointer to the ‘next’ 64 bits in the source data object so subsequent calls continue the ‘walk’ through your source array seamlessly where it last left off. While the API does this for you, when you call broadcast directly, your code is responsible for tracking the loop iterations required to complete the broadcast of the entire object.

  3. No accumulation in the core A broadcast call destroys any existing data in the broadcast register of each core. You can copy broadcast register data into LocalMem if required, prior to the next broadcast call.

  4. Calling the Staging function instantiates a new source object So after staging data:

    1. You can invoke broadcast repeatedly until the staged object has been broadcast in its entirety. Each broadcast will automatically step through the source data on successive calls.

    2. If you call the staging function before the current source data has all been broadcast, a subsequent call to broadcast will ‘pop’ the first 64-bits of the new source object.

  5. Tracking broadcast iterations through the source tensor Your code must ensure both complete broadcast of the staged source data, and trap any attempt to broadcast beyond the last element of the source. Where the number of pops = number of 64 bit elements in your input tensor.

  6. The API can help you avoid miscalculating the number of iterations required to consume the source data:

    constexpr std::int32_t numBroadcasts8  = DdrInOutShape::linearElemCount / getNumberOfWeightRegisters<std::int8_t>();
    
    constexpr std::int32_t numBroadcasts16 = DdrInOutShape::linearElemCount / getNumberOfWeightRegisters<std::int16_t>();
    

    To get the number of iterations required, use the ratio of linearElemCount (of the input tensor) to getNumberOfWeightRegisters for the desired data type as shown above.

    • Too little => additional elements left over

    • Too many => processor will hang

Note

Attempting to POP data beyond the extent of the currently staged input data may cause the system to hang.

For a full example please visit the broadcast_stream_unit.cpp