# Neural Network Blocks¶

## Fully Connected Layers¶

- template<typename
`T`

>

INLINE std::enable_if<!std::is_integral<T>::value, qVar_t<T>>::type`fullyConnectedTileBlock`

(qVar_t<T>qMatrix[],conststd::int32_tnumCols)¶Computes Fully Connected Layer in the provided Fx16 precision.

ReturnqOutput A single tile output

Parameters

`qData`

: The activations of the input

`numCols`

: The width of the weight matrix in the FC LayerTemplate Parameters

`T`

: The type of the input values

- template<typename
`T`

, std::uint32_t`shiftAmount`

= 0>

INLINE std::enable_if<std::is_integral<T>::value, qVar_t<T>>::type`fullyConnectedTileBlock`

(qVar_t<std::int8_t>qMatrix[],conststd::int32_tnumCols)¶Computes Fully Connected Layer in int8 precision.

ReturnqOutput A single tile output

Parameters

`qData`

: The activations of the input

`numCols`

: The width of the weight matrix in the FC LayerTemplate Parameters

`T`

: The type of the input values

`shiftAmount`

: The amount to shift the output of the FC by

## Upsampling¶

- template<typename
`T`

, IfNotIntegerTy<T> = 0>

INLINE qVar_t<T>`interpolationTileBlock`

(qVar_t<T>qData)¶Computes upsampling (nearest neighbors or bilinear) in the provided Fx16 precision.

ReturnqOutput A single tile output

Parameters

`qData`

: The activations of the inputTemplate Parameters

`T`

: The type of the input values

- template<FracRepType
`shiftAmount`

= 0, typename`T`

, IfIntegerTy<T> = 0>

INLINE qVar_t<T>`interpolationTileBlock`

(qVar_t<std::int8_t>qData)¶Computes upsampling (nearest neighbors or bilinear) in int8 precision.

ReturnqOutput A single tile output

Parameters

`qData`

: The activations of the inputTemplate Parameters

`shiftAmount`

: The amount to shift the output of the interpolation by

`T`

: The type of the input values

## Convolutions¶

- template<typename
`T`

, std::uint32_t`filterSize`

= 1, bool`useVaping`

= false>

INLINE qVar_t<T>`convTileBlockFx16`

(qVar_t<T>qData[], std::uint32_tnumInCh)¶Computes convolutions in the provided Fx16 precision.

ReturnqOutput A single output channel

Parameters

`qData`

: The activations of the input

`numInCh`

: The number of input channels in the convolutoiunTemplate Parameters

`T`

: The type of the input values

`filterSize`

: The filter size of the convolution

`useVaping`

: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

- template<typename
`T`

, std::uint32_t`filterSize`

= 1, std::uint32_t`shiftAmount`

= 0, bool`useVaping`

= false>

INLINE qVar_t<T>`convTileBlockInt8`

(qVar_t<std::int8_t>qData[], std::uint32_tnumInCh)¶Computes convolutions in int8 precision.

ReturnqOutput A single output channel

Parameters

`qData`

: The activations of the input

`numInCh`

: The number of input channels in the convolutoiunTemplate Parameters

`T`

: The type of the input values

`filterSize`

: The filter size of the convolution

`shiftAmount`

: The amount to shift the output of the convolution by

`useVaping`

: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

- template<typename
`T`

, std::uint32_t`filterSize`

= 3, bool`useVaping`

= false>

INLINE qVar_t<T>`depthwiseConvTileBlockFx16`

(qVar_t<T>qData)¶Computes Depthwise convolutions in the provided Fx16 precision.

ReturnqOutput A single output channel

Parameters

`qData`

: The activations of the inputTemplate Parameters

`T`

: The type of the input values

`filterSize`

: The filter size of the convolution

`useVaping`

: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

- template<typename
`T`

, std::uint32_t`filterSize`

= 3, FracRepType`shiftAmount`

= 0, bool`useVaping`

= false>

INLINE qVar_t<T>`depthwiseConvTileBlockInt8`

(qVar_t<std::int8_t>qData)¶Computes Depthwise sconvolutions in int8 precision.

ReturnqOutput A single output channel

Parameters

`qData`

: The activations of the inputTemplate Parameters

`T`

: The type of the input values

`filterSize`

: The filter size of the convolution

`shiftAmount`

: The amount to shift the output of the convolution by

`useVaping`

: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

## Pooling¶

- template<typename
`OcmInTensorShape`

, typename`OcmOutTensorShape`

, std::int32_t`numRowFlowsPerInputWidth`

= roundUpToNearestMultiple(OcmInTensorShape::NUM_COLS, Epu::numArrayCores) / Epu::numArrayCores, std::int32_t`numRowFlowsPerOutputWidth`

= roundUpToNearestMultiple(OcmOutTensorShape::NUM_COLS, Epu::numArrayCores) / Epu::numArrayCores>

numOutputTiles`__pad0__`

¶Perform reshape from the input shape to the output shape. Currently, we support 3 shape layouts (C, H, W), (C, 1, H*W), and (1, 1, C*H*W). The implementation:

Uses a Row Iterator to flow in data:

Computes the input x, y, z locations for each core

Computes the input location in contiguous memory

Computes the output x, y, z for each core

Using the pitched width of the output tensor, computes the output location.

Rau::store valid data values.

The following reshapes are supported and tested:

* * +---------------+ * | | +-------+ * | | +-----+ | * | | +----------> | | | * | | | | + * +---------------+ +-----+ * 2D Tensor Shape 3D Tensor Shape * (C, 1, H * W) (C, H, W) * * * * +--------+ +---------------+ * | | | * +------+ | | | * | | | +----------> | | * | | + | | * +------+ +---------------+ * 3D Tensor Shape 2D Tensor Shape * (C, H, W) (C, 1, H * W) * * * * +--------+ * +-------+ | * +-------------+ +----------> | | | * 1D Tensor Shape | | + * (1, 1, C * H * W) +-------+ * 3D Tensor Shape * (C, H, W) * +------------+ * | * +-----------+ | +-----------> +-----------------+ * | | | 1D Tensor Shape * | | + (1, C, H, W) * +-----------+ * 3D Tensor Shape * (C, H, W) *

Parameters

`[in] ocmIn`

: The Ocm Input

`ocmOut`

: The Ocm outputTemplate Parameters

`OcmInTensorShape`

: The shape of the input ocm tensor

`OcmOutTensorShape`

: The shape of the output ocm tensor

- template<typename
`T`

, std::int32_t`filterSize`

, bool`useCentered`

= ((Epu::coreDim % filterSize) > 0), std::enable_if_t<((filterSize == 3) || (filterSize == 5) || (filterSize == 7)) && useCentered, std::int32_t> = 0>

INLINE qVar_t<T>`calculateSum`

(qVar_t<T>qData)¶Performs centered summing on data via rotations.

Performs summing on a tile of data. This first finds the sum column-wise, then row-wise, computing the sum of each 2x2 area. Repeating this process again produces the sum of a 3x3 area, and so on, based on the given filter size.

Parameters

`[in] qData`

: A qVar_tTemplate Parameters

`T`

: The type of the data in the qVar_tParameters

`[in] qData`

: A qVar_tTemplate Parameters

`filterSize`

: The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

`T`

: The type of the data in the qVar_t

- template<typename
`T`

, std::int32_t`filterSize`

, bool`useCentered`

= ((Epu::coreDim % filterSize) > 0)>

INLINE qVar_t<T>`calculateAvg`

(qVar_t<T>qData)¶Performs avgpooling on data.

Parameters

`[in] qData`

: A qVar_tTemplate Parameters

`T`

: The type of the data in the qVar_t

- template<typename
`T`

, std::int32_t`filterSize`

, bool`useCentered`

= ((Epu::coreDim % filterSize) > 0), std::enable_if_t<((filterSize == 3) || (filterSize == 5) || (filterSize == 7)) && useCentered, std::int32_t> = 0>

INLINE qVar_t<T>`calculateMax`

(qVar_t<T>qData)¶Performs centered maxpooling on data via rotations.

Performs maxpooling on a tile of data. This first finds the max column-wise, then row-wise, computing the max of each 2x2 area. Repeating this process again produces the max of a 3x3 area, and so on, based on the given filter size.

Parameters

`[in] qData`

: A qVar_tTemplate Parameters

`T`

: The type of the data in the qVar_tParameters

`[in] qData`

: A qVar_tTemplate Parameters

`filterSize`

: The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

`T`

: The type of the data in the qVar_t

- template<typename
`OcmInTensorShape`

, typename`OcmOutTensorShape`

, std::int32_t`strideSize`

, typename`T`

>

INLINE void`poolRau`

(qVar_t<T>qData[], OcmOutTensorShape &ocmOut)¶Pools an array of qVar_t’s into a tensor, compressing by a factor of filterSize. Uses Rau to write data to the output ocm tensor.

NOTE: In the future, when there is support for filters that are unaligned with a tile, we need to write out elements at different possitions on different tiles and will need to use the qMaxCond code that is currently commented out.

Parameters

`[in] qData`

: An array of qVar_t’s

`ocmOut`

: The ocm outTemplate Parameters

`OcmInTensorShape`

: The shape of the input ocm tensor

`OcmOutTensorShape`

: The shape of the output ocm tensor

`filterSize`

: The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

`T`

: The type of the data in the qVar_t’s @boolparam useOutputDims Whether to use the input tensor shape or the output tensor shape for the kernel

- template<typename
`OcmInTensorShape`

, typename`OcmOutTensorShape`

, std::int32_t`repeatFactor`

= OcmOutTensorShape::NUM_COLS, isOcmTensor<OcmInTensorShape> = 0, isOcmTensor<OcmOutTensorShape> = 0>

INLINE void`repeatAndAppend`

(OcmInTensorShape &ocmIn, OcmOutTensorShape &ocmOut)¶This function repeats and appends a linear OCM tensor by a specified factor. The tensor itself will come in a linearized format and then be replicated for the given factor.

For example, a tensor of dimensions <1, 16, 1, 1> with a factor of 128 is repeated and appended to a tensor of dimensions <1, 16, 1, 128>.

This function works for repeat factors <= 2048

Parameters

`ocmIn`

: The ocm in

`ocmOut`

: The ocm outTemplate Parameters

`OcmInTensorShape`

: The shape of the ocm in

`OcmOutTensorShape`

: The shape of the ocm out

`repeatFactor`

: The factor in which the tensor is repeated

- template<typename
`OcmTensorShape`

, typename`T`

, bool`channelwiseSoftmax`

= (OcmTensorShape::NUM_CHN == 1), isOcmTensor<OcmTensorShape> = 0, std::enable_if_t<channelwiseSoftmax, std::int32_t> = 0>

INLINE void`softmax`

(OcmTensorShape &ocmIn, OcmTensorShape &ocmOut)¶This function performs a softmax, or normalized exponential function, on the input OCM tensor and stores the result in the output OCM tensor. A softmax normalizes the input data to a probability distribution proportional to the exponentials of the inputs. In other words, softmax takes the exponential function of each input elements and then normalizes by dividing each of the elements by the sum of all the exponentials. The output has the same shape as the input but is bound to the interval (0, 1).

Additionally, there is a iteration that goes through the the scores and takes the maximum value for each channel. This maximum is subtracted from each score with a new range of (-inf, 0] so the output of exp is bound to the interval (0, 1]. This is done to ensure that the softmax has a countable sum (0, Height*Width) which is numerically stable while still ensuring that the maximum/largest scores are well represented in fixed-point.

The equation for softmax is:

softmax(x_i) = exp(x_i) / summation(exp(x_j) for all j)

The numerically stable softmax is: softmax(x_i) = exp(x_i - max(x)) / summation(exp(x_j - max(x)) for all j)

Before performing the exponential function, a check must be made to ensure that the exponential function is only applied to the valid data. If the exponential function is applied to invalid data, which is set to 0, it will produce exp(0) = 1. Due to the way that calculateTileAverage works, these unwanted 1s in the padding and border will be factored into the sum, making the sum larger than intended, and the final results smaller than intended.

In the future, we hope to have a way to check the valid bit during the flow. Then, we can compute the exponential function on the data coming from dataPort and store this result directly into qData without having to perform the math to calculate which elements are valid by hand.

When the exponential function is applied to the input data, each input channel is mapped to a single qVar_t in qChannelSum that is the element-wise sum of each of exponentials of the tiles composing that input channel so that the array of qVar_t’s can easily be passed to calculateTileAverage without having to do complicated indexing.

To compute the sum, calculateTileAverage is called with a denominator of 1, which produces the same effect as a calculateSum function would.

This function works for input tensors that are large and non aligned.

Parameters

`ocmIn`

: The ocm in

`ocmOut`

: The ocm outTemplate Parameters

`OcmTensorShape`

: The shape of the ocm in

`T`

: The type of the input values

`channelwiseSoftmax`

: If Softmax performs channelwise reduction or not

## Activation Functions¶

- template<FracRepType
`numFracBits`

, typename`T`

>

INLINE qVar_t<FixedPoint<T, numFracBits>>`sigmoid`

(qVar_t<FixedPoint<T, numFracBits>>qData)¶This function performs the sigmoid function on the input, which is a single tile: f(x) = 1/(1 + exp(-x)).

Parameters

`qData`

: A single tile of dataTemplate Parameters

`T`

: The type of the data

- template<FracRepType
`numFracBits`

, typename`T`

>

INLINE qVar_t<FixedPoint<T, numFracBits>>`tanh`

(qVar_t<FixedPoint<T, numFracBits>>qData)¶This function performs the tanh function on the input, which is a single tile: f(x) = (exp(2*x) - 1)/(exp(2*x) + 1).

Parameters

`qData`

: A single tile of dataTemplate Parameters

`T`

: The type of the data

- template<ReluMethod
`reluMethod`

, typename`T`

,typenamestd::enable_if_t<reluMethod == ReluMethod::REGULAR, std::int32_t> = 0>

INLINE qVar_t<T>`relu`

(qVar_t<T>qData)¶This function performs the vanilla ReLU function on the input, which is a single tile: f(x) = max(0, x).

This function performs the ReLU6 function on the input, which is a single tile: f(x) = min(max(0, x), 6).

@tparams The Relu Method chosen

Parameters

`qData`

: A single tile of dataTemplate Parameters

`T`

: The type of the data

- template<typename
`T`

>

INLINE qVar_t<T>`leakyRelu`

(constqVar_t<T> &qData, Talpha= 0.1)¶This function performs the leaky ReLU function on the input, which is a single tile: f(x) = x > 0 ? x : x * alpha.

Parameters

`qData`

: A single tile of data

`alpha`

: The scale term for the negative data.Template Parameters

`T`

: The type of the data

## LSTM¶

- template<typename
`OcmInTensorShape`

, typename`OcmKernelTensorShape`

, typename`OcmCellTensorShape`

, typename`OcmHiddenTensorShape`

, typename`OcmRecurrentKernelTensorShape`

, typename`OcmBiasTensorShape`

, typename`OcmOutputTensorShape`

, std::int32_t`timeWindow`

= OcmOutputTensorShape::NUM_CHN>

INLINE void`lstmVanilla`

(OcmInTensorShapeocmIn, OcmKernelTensorShapeocmKernel, OcmCellTensorShapeocmCell, OcmHiddenTensorShapeocmHidden, OcmRecurrentKernelTensorShapeocmRecurrentKernel, OcmBiasTensorShapeocmBias, OcmOutputTensorShapeocmOutput)¶Performs LSTM Block with Standard Activations. The forget, input, and output gates all use sigmoid as their activations while the cell gate uses tanh as its activation. A final tanh activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

NOTE: LSTM implementation needs to be expanded for more activations than what is currently done. 1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow: ouput weights, input weights, cell weights, forget weights 2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint. 3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are used to reduce error accumulation.

Parameters

`[in] ocmIn`

: The x_t activation tensor of length M

`[in] ocmKernel`

: The W ocm weights that interact with x_t of size 4*M by N

`[in/out]`

: ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

`[in] ocmHidden`

: The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

`[in] ocmRecurrentKernel`

: The U ocm weights that interact with h_(t-1) of size 4*N by N

`[in] ocmBias`

: The b ocm biases that are used in computing the pre-act gate values of size 4*N

`[out] ocmOutput`

: The outputs of each time step (h_t)Template Parameters

`timeWindow`

: Number of time steps being processed in the LSTM call

`OcmInTensorShape`

: Shape is of <1, 1, 1, M>

`OcmKernelTensorShape`

: Shape is of <1, 4*M, 1, N> for dense

`OcmCellTensorShape`

: Shape is of <1, 1, 1, N>

`OcmHiddenTensorShape`

: Shape is of <1, 1, 1, N>

`OcmRecurrentKernelTensorShape`

: Shape is of <1, 4*N, 1, N> for dense

`OcmBiasTensorShape`

: Shape is of <1, 4, 1, N> for dense

`OcmOutputTensorShape`

: is of <1, timeWindow, 1, N> for dense

- template<typename
`OcmInTensorShape`

, typename`DdrKernelTensorShape`

, typename`OcmKernelTensorShape`

, typename`OcmCellTensorShape`

, typename`OcmHiddenTensorShape`

, typename`OcmRecurrentKernelTensorShape`

, typename`OcmBiasTensorShape`

, typename`OcmOutputTensorShape`

, std::int32_t`timeWindow`

= OcmOutputTensorShape::NUM_CHN>

INLINE void`lstmLargeInput`

(OcmInTensorShapeocmIn, DdrKernelTensorShapeddrKernel, OcmKernelTensorShapeocmKernelBuffer0, OcmKernelTensorShapeocmKernelBuffer1, OcmCellTensorShapeocmCell, OcmHiddenTensorShapeocmHidden, OcmRecurrentKernelTensorShapeocmRecurrentKernel, OcmBiasTensorShapeocmBias, OcmOutputTensorShapeocmOutput)¶Performs LSTM Block with Standard Activations for a large input length ( > 1024 elements) and small hidden length (<= 1024). Performs LSTM Block with Standard Activations. The forget, input, and output gates all use sigmoid as their activations while the cell gate uses tanh as its activation. A final tanh activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

NOTE: LSTM implementation needs to be expanded for more activations than what is currently done. 1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow: ouput weights, input weights, cell weights, forget weights 2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint. 3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are used to reduce error accumulation. 4. Multiple Ocm Buffers are passed in to ensure double buffering in the kernel.

Parameters

`[in] ocmIn`

: The x_t activation tensor of length M

`[in] ddrKernel`

: The W drr weights that interact with the input. 4*M by N

`[in] ocmKernelBuffer0`

: The W ocm weights buffer 0 that interact with the input. Flown in by Ddr by N

`[in] ocmKernelBuffer1`

: The W ocm weights buffer 1 that interact with the input. Flown in by Ddr by N

`[in/out]`

: ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

`[in] ocmHidden`

: The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

`[in] ocmRecurrentKernel`

: The U ocm weights that interact with h_(t-1) of size 4*N by N

`[in] ocmBias`

: The b ocm biases that are used in computing the pre-act gate values of size 4*N

`[out] ocmOutput`

: The outputs of each time step (h_t)Template Parameters

`timeWindow`

: Number of time steps being processed in the LSTM call

`OcmInTensorShape`

: Shape is of <1, 1, 1, M>

`ddrKernelTensorShape`

: Shape is of <1, factor(ddrKenelTensorShape::NUM_CHN), 1, N> for dense

`OcmKernelTensorShape`

: Shape is of <1, factor of M, 1, N> for dense

`OcmCellTensorShape`

: Shape is of <1, 1, 1, N>

`OcmHiddenTensorShape`

: Shape is of <1, t+1, 1, N>

`OcmRecurrentKernelTensorShape`

: Shape is of <1, 4*N, 1, N> for dense

`OcmBiasTensorShape`

: Shape is of <1, 4, 1, N> for dense

`OcmOutputTensorShape`

: is of <1, timeWindow, 1, N> for dense

- template<typename
`OcmInTensorShape`

, typename`OcmKernelTensorShape`

, typename`OcmCellTensorShape`

, typename`OcmHiddenTensorShape`

, typename`OcmRecurrentKernelTensorShape`

, typename`OcmBiasTensorShape`

>

INLINE void`lstmClipping`

(OcmInTensorShapeocmIn, OcmKernelTensorShapeocmKernel, OcmCellTensorShapeocmCell, OcmHiddenTensorShapeocmHidden, OcmRecurrentKernelTensorShapeocmRecurrentKernel, OcmBiasTensorShapeocmBias)¶Performs LSTM Block with Clipping Activations for a single time series. Performs LSTM Block with Standard Activations. The forget, input, and output gates all use [0, 1] clipping as their activations while the cell gate uses [-1, 1] clipping as its activation. A final [-1, 1] clipping activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

NOTE: LSTM implementation needs to be expanded for more activations than what is currently done. 1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow: ouput weights, input weights, cell weights, forget weights 2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint. 3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are used to reduce error accumulation.

Parameters

`[in] ocmIn`

: The x_t activation tensor of length M

`[in] ocmKernel`

: The W ocm weights that interact with x_t of size 4*M by N

`[in/out]`

: ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

`[in] ocmHidden`

: The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

`[in] ocmRecurrentKernel`

: The U ocm weights that interact with h_(t-1) of size 4*N by N

`[in] ocmBias`

: The b ocm biases that are used in computing the pre-act gate values of size 4*NTemplate Parameters

`OcmInTensorShape`

: Shape is of <1, 1, 1, M>

`OcmKernelTensorShape`

: Shape is of <1, 4*M, 1, N> for dense

`OcmCellTensorShape`

: Shape is of <1, 1, 1, N>

`OcmHiddenTensorShape`

: Shape is of <1, 1, 1, N>

`OcmRecurrentKernelTensorShape`

: Shape is of <1, 4*N, 1, N> for dense

`OcmBiasTensorShape`

: Shape is of <1, 4, 1, N> for dense