# Neural Network Blocks¶

## Fully Connected Layers¶

- template<typename T, std::int32_t numCols>

INLINE std::enable_if<!std::is_integral<T>::value, qVar_t<T>>::type fullyConnectedTileBlock(container::NDArray<qVar_t<T>, numCols> &qMatrix)¶Computes Fully Connected Layer in the provided Fx16 precision.

- Parameters

qData– The activations of the input

numCols– The width of the weight matrix in the FC Layer- Template Parameters

T– The type of the input values- Returns
qOutput A single tile output

- template<typename T, std::int32_t numCols, std::uint32_t shiftAmount = 0>

INLINE std::enable_if<std::is_integral<T>::value, qVar_t<T>>::type fullyConnectedTileBlock(container::NDArray<qVar_t<std::int8_t>, numCols> &qMatrix)¶Computes Fully Connected Layer in int8 precision.

- Parameters

qData– The activations of the input

numCols– The width of the weight matrix in the FC Layer- Template Parameters

T– The type of the input values

shiftAmount– The amount to shift the output of the FC by- Returns
qOutput A single tile output

- template<typename T, std::int32_t numCols>

INLINE std::enable_if<!std::is_integral<T>::value && numCols != 0, qVar_t<T>>::type fullyConnectedTileBlock(qVar_t<T> qMatrix[])¶Computes Fully Connected Layer in the provided Fx16 precision.

- Parameters

qData– The activations of the input

numCols– The width of the weight matrix in the FC Layer- Template Parameters

T– The type of the input values- Returns
qOutput A single tile output

- template<typename T, std::int32_t numCols, std::uint32_t shiftAmount = 0>

INLINE std::enable_if<std::is_integral<T>::value && numCols != 0, qVar_t<T>>::type fullyConnectedTileBlock(qVar_t<std::int8_t> qMatrix[])¶Computes Fully Connected Layer in int8 precision.

- Parameters

qData– The activations of the input

numCols– The width of the weight matrix in the FC Layer- Template Parameters

T– The type of the input values

shiftAmount– The amount to shift the output of the FC by- Returns
qOutput A single tile output

## Upsampling¶

- template<typename T, IfNotIntegerTy<T> = 0>

INLINE qVar_t<T> interpolationTileBlock(qVar_t<T> qData)¶Computes upsampling (nearest neighbors or bilinear) in the provided Fx16 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values- Returns
qOutput A single tile output

- template<FracRepType shiftAmount = 0, typename T, IfIntegerTy<T> = 0>

INLINE qVar_t<T> interpolationTileBlock(qVar_t<std::int8_t> qData)¶Computes upsampling (nearest neighbors or bilinear) in int8 precision.

- Parameters

qData– The activations of the input- Template Parameters

shiftAmount– The amount to shift the output of the interpolation by

T– The type of the input values- Returns
qOutput A single tile output

## Convolutions¶

- template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, bool useVaping = false>

INLINE qVar_t<T> convTileBlockFx16(container::NDArray<qVar_t<T>, numInCh> &qData)¶Computes convolutions in the provided Fx16 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

numInCh– The number of input channels in the convolutoiun

filterSize– The filter size of the convolution

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, std::uint32_t reductionFactor = 2>

INLINE qVar_t<T> convTileBlockInt8AndReduce(container::NDArray<qVar_t<std::int8_t>, numInCh> &qData)¶Computes convolutions in int8 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

numInCh– The number of input channels in the convolutoiun

filterSize– The filter size of the convolution

shiftAmount– The amount to shift the output of the convolution by

reductionFactor– The number of partitions in X and Y to reduce the output of the conv over. strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t filterSize = 3, bool useVaping = false>

INLINE qVar_t<T> depthwiseConvTileBlockFx16(qVar_t<T> qData)¶Computes Depthwise convolutions in the provided Fx16 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

filterSize– The filter size of the convolution

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, bool useVaping = false, bool sameWeights = false>

INLINE qVar_t<T> convStrideOf2x2TileBlockInt8(container::NDArray<qVar_t<std::int8_t>, numInCh * filterSize * filterSize> &qData)¶Computes convolutions in int8 precision with a height, width stride of 2, 2.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

numInCh– The type of the input values

filterSize– The filter size of the convolution

shiftAmount– The amount to shift the output of the convolution by

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t filterSize = 3, FracRepType shiftAmount = 0, bool useVaping = false>

INLINE qVar_t<T> depthwiseConvTileBlockInt8(qVar_t<std::int8_t> qData)¶Computes Depthwise convolutions in int8 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

filterSize– The filter size of the convolution

shiftAmount– The amount to shift the output of the convolution by

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, bool useVaping = false>

INLINE qVar_t<T> convTileBlockFx16(qVar_t<T> qData[])¶Computes convolutions in the provided Fx16 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

numInCh– The number of input channels in the convolutoiun

filterSize– The filter size of the convolution

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, std::uint32_t reductionFactor = 2>

INLINE qVar_t<T> convTileBlockInt8AndReduce(qVar_t<std::int8_t> qData[])¶Computes convolutions in int8 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

numInCh– The number of input channels in the convolutoiun

filterSize– The filter size of the convolution

shiftAmount– The amount to shift the output of the convolution by

reductionFactor– The number of partitions in X and Y to reduce the output of the conv over. strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, bool useVaping = false, bool sameWeights = false>

INLINE qVar_t<T> convStrideOf2x2TileBlockInt8(qVar_t<std::int8_t> qData[])¶Computes convolutions in int8 precision with a height, width stride of 2, 2.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

numInCh– The type of the input values

filterSize– The filter size of the convolution

shiftAmount– The amount to shift the output of the convolution by

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

- template<typename T, std::uint32_t groupSize = 4, std::uint32_t filterSize = 3, std::uint32_t stride = 1, FracRepType shiftAmount = 0, bool useVaping = false, bool sameWeights = false>

INLINE qVar_t<T> groupwiseConvTileBlockInt8(qVar_t<std::int8_t> qData[], std::uint32_t outputIteration = 0)¶Computes Groupwise convolutions in int8 precision.

- Parameters

qData– The activations of the input- Template Parameters

T– The type of the input values

groupSize– The number of channels in a group

filterSize– The filter size of the convolution

stride– The stride of the convolution

shiftAmount– The amount to shift the output of the convolution by

useVaping– Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy- Returns
qOutput A single output channel

## Pooling¶

- template<typename OcmInTensorShape, typename OcmOutTensorShape, std::int32_t numRowFlowsPerInputWidth = roundUpToNearestMultiple(OcmInTensorShape::NUM_COLS, core_array::numArrayCores) / core_array::numArrayCores, std::int32_t numRowFlowsPerOutputWidth = roundUpToNearestMultiple(OcmOutTensorShape::NUM_COLS, core_array::numArrayCores) / core_array::numArrayCores>

numOutputTiles __pad0__¶Perform reshape from the input shape to the output shape. Currently, we support 3 shape layouts (C, H, W), (C, 1, H*W), and (1, 1, C*H*W). The implementation:

Uses a Row Iterator to flow in data:

Computes the input x, y, z locations for each core

Computes the input location in contiguous memory

Computes the output x, y, z for each core

Using the pitched width of the output tensor, computes the output location.

rau::store valid data values.

The following reshapes are supported and tested:

* * +---------------+ * | | +-------+ * | | +-----+ | * | | +----------> | | | * | | | | + * +---------------+ +-----+ * 2D Tensor Shape 3D Tensor Shape * (C, 1, H * W) (C, H, W) * * * * +--------+ +---------------+ * | | | * +------+ | | | * | | | +----------> | | * | | + | | * +------+ +---------------+ * 3D Tensor Shape 2D Tensor Shape * (C, H, W) (C, 1, H * W) * * * * +--------+ * +-------+ | * +-------------+ +----------> | | | * 1D Tensor Shape | | + * (1, 1, C * H * W) +-------+ * 3D Tensor Shape * (C, H, W) * +------------+ * | * +-----------+ | +-----------> +-----------------+ * | | | 1D Tensor Shape * | | + (1, C, H, W) * +-----------+ * 3D Tensor Shape * (C, H, W) *

- Param ocmIn

[in]The Ocm Input- Param ocmOut
The Ocm output

- Template Parameters

OcmInTensorShape– The shape of the input ocm tensor

OcmOutTensorShape– The shape of the output ocm tensor

- template<typename T, std::int32_t filterSize, bool useCentered = ((core_array::coreDim % filterSize) > 0), std::enable_if_t<((filterSize == 3) || (filterSize == 5) || (filterSize == 7)) && useCentered, std::int32_t> = 0>

INLINE qVar_t<T> calculateSum(qVar_t<T> qData)¶Performs centered summing on data via rotations.

Performs summing on a tile of data. This first finds the sum column-wise, then row-wise, computing the sum of each 2x2 area. Repeating this process again produces the sum of a 3x3 area, and so on, based on the given filter size.

- Parameters

qData–[in]A qVar_t

qData–[in]A qVar_t- Template Parameters

T– The type of the data in the qVar_t

filterSize– The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

T– The type of the data in the qVar_t

- template<typename T, std::int32_t filterSize, bool useCentered = ((core_array::coreDim % filterSize) > 0)>

INLINE qVar_t<T> calculateAvg(qVar_t<T> qData)¶Performs avgpooling on data.

- Parameters

qData–[in]A qVar_t- Template Parameters

T– The type of the data in the qVar_t

- template<typename T, std::int32_t filterSize, bool useCentered = ((core_array::coreDim % filterSize) > 0), std::enable_if_t<((filterSize == 3) || (filterSize == 5) || (filterSize == 7)) && useCentered, std::int32_t> = 0>

INLINE qVar_t<T> calculateMax(qVar_t<T> qData)¶Performs centered maxpooling on data via rotations.

Performs maxpooling on a tile of data. This first finds the max column-wise, then row-wise, computing the max of each 2x2 area. Repeating this process again produces the max of a 3x3 area, and so on, based on the given filter size.

- Parameters

qData–[in]A qVar_t

qData–[in]A qVar_t- Template Parameters

T– The type of the data in the qVar_t

filterSize– The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

T– The type of the data in the qVar_t

- template<typename OcmInTensorShape, typename OcmOutTensorShape, std::int32_t strideSize, typename T>

INLINE void poolRau(qVar_t<T> qData[], OcmOutTensorShape &ocmOut)¶Pools an array of qVar_t’s into a tensor, compressing by a factor of filterSize. Uses Rau to write data to the output ocm tensor.

NOTE: In the future, when there is support for filters that are unaligned with a tile, we need to write out elements at different possitions on different tiles and will need to use the qMaxCond code that is currently commented out.

- Parameters

qData–[in]An array of qVar_t’s

ocmOut– The ocm out- Template Parameters

OcmInTensorShape– The shape of the input ocm tensor

OcmOutTensorShape– The shape of the output ocm tensor

filterSize– The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

T– The type of the data in the qVar_t’s @boolparam useOutputDims Whether to use the input tensor shape or the output tensor shape for the kernel

- template<typename OcmInTensorShape, typename OcmOutTensorShape, std::int32_t repeatFactor = OcmOutTensorShape::NUM_COLS, isOcmTensor<OcmInTensorShape> = 0, isOcmTensor<OcmOutTensorShape> = 0>

INLINE void repeatAndAppend(OcmInTensorShape &ocmIn, OcmOutTensorShape &ocmOut)¶This function repeats and appends a linear OCM tensor by a specified factor. The tensor itself will come in a linearized format and then be replicated for the given factor.

For example, a tensor of dimensions <1, 16, 1, 1> with a factor of 128 is repeated and appended to a tensor of dimensions <1, 16, 1, 128>.

This function works for repeat factors <= 2048

- Parameters

ocmIn– The ocm in

ocmOut– The ocm out- Template Parameters

OcmInTensorShape– The shape of the ocm in

OcmOutTensorShape– The shape of the ocm out

repeatFactor– The factor in which the tensor is repeated

- template<typename OcmInTensorShape, typename OcmOutTensorShape, typename T, isOcmTensor<OcmInTensorShape> = 0, isOcmTensor<OcmOutTensorShape> = 0>

INLINE void globalMaxPool(OcmInTensorShape &ocmIn, OcmOutTensorShape &ocmOut)¶This function performs a global max pool on the input OCM tensor and stores the result in the output OCM tensor. A global max pool compresses a tensor by mapping each input channel to a single element that is the max of the values in that channel. The output is a row vector of sequential data.

For example, a tensor of dimensions <1, 512, 8, 8> would be compressed to a tensor of dimensions <1, 1, 1, 512>.

When the data is flown in, each input channel maps to a single qVar_t that is the element-wise compares of each of the tiles composing that input channel so that the array of qVar_t’s can easily be passed to calculateTileMaximum without having to do complicated indexing.

When transferring the data after calculateTileMaximum to qOutput[], there is an outCond that determines the manner in which qOutput[] is populated. qOutput[] is populated sequentially with the max for each input channel. Since each tile in qData[] contains the max for that corresponding input channel, we can copy the 0th element of qData[0], the 1st element of qData[1], and so on into qOutput[0]. When we get to the 65th input channel, we copy the 0th element of qData[64] to qOutput[1]. For the 66th channel, we copy the 1st element of qData[65] to qOutput[1]. And so on for all of the input channels.

This function works for inputs with dimensions:

height, width >= 1

height * width <= 128 *core_array::coreDim *core_array::coreDim

1 <= num_in_ch <= 120

In summary, this works for up to 120 channels and up to 128 tiles per channel. Tiles may be unaligned.

- Parameters

ocmIn– The ocm in

ocmOut– The ocm out- Template Parameters

OcmInTensorShape– The shape of the ocm in

OcmOutTensorShape– The shape of the ocm out

T– The FixedPoint Library Type

- template<typename OcmTensorShape, typename T, bool channelwiseSoftmax = (OcmTensorShape::NUM_CHN == 1), isOcmTensor<OcmTensorShape> = 0, std::enable_if_t<channelwiseSoftmax, std::int32_t> = 0>

INLINE void softmax(qVar_t<T> qInOutBuffer[])¶This function performs a softmax, or normalized exponential function, on the input data buffer and stores the result in the output data buffer. A softmax normalizes the input data to a probability distribution proportional to the exponentials of the inputs. In other words, softmax takes the exponential function of each input elements and then normalizes by dividing each of the elements by the sum of all the exponentials. The output has the same shape as the input but is bound to the interval (0, 1).

Additionally, there is a iteration that goes through the the scores and takes the maximum value for each channel. This maximum is subtracted from each score with a new range of (-inf, 0] so the output of exp is bound to the interval (0, 1]. This is done to ensure that the softmax has a countable sum (0, Height*Width) which is numerically stable while still ensuring that the maximum/largest scores are well represented in fixed-point.

The equation for softmax is:

softmax(x_i) = exp(x_i) / summation(exp(x_j) for all j)

The numerically stable softmax is: softmax(x_i) = exp(x_i - max(x)) / summation(exp(x_j - max(x)) for all j)

Before performing the exponential function, a check must be made to ensure that the exponential function is only applied to the valid data. If the exponential function is applied to invalid data, which is set to 0, it will produce exp(0) = 1. Due to the way that calculateTileAverage works, these unwanted 1s in the padding and border will be factored into the sum, making the sum larger than intended, and the final results smaller than intended.

In the future, we hope to have a way to check the valid bit during the flow. Then, we can compute the exponential function on the data coming from dataPort and store this result directly into qData without having to perform the math to calculate which elements are valid by hand.

When the exponential function is applied to the input data, each input channel is mapped to a single qVar_t in qChannelSum that is the element-wise sum of each of exponentials of the tiles composing that input channel so that the array of qVar_t’s can easily be passed to calculateTileAverage without having to do complicated indexing.

To compute the sum, calculateTileAverage is called with a denominator of 1, which produces the same effect as a calculateSum function would.

This function works for input tensors that are large and non aligned.

- Parameters

qInOutBuffer– The data buffer- Template Parameters

OcmTensorShape– The shape of the ocm in

T– The type of the input values

channelwiseSoftmax– If Softmax performs channelwise reduction or not

- template<typename OcmTensorShape, typename T, bool channelwiseSoftmax = (OcmTensorShape::NUM_CHN == 1), isOcmTensor<OcmTensorShape> = 0>

INLINE void softmax(OcmTensorShape &ocmIn, OcmTensorShape &ocmOut)¶This function performs a softmax, or normalized exponential function, on the input OCM tensor and stores the result in the output OCM tensor. This function works for input tensors that are large and non aligned.

- Parameters

ocmIn– The ocm in

ocmOut– The ocm out- Template Parameters

OcmTensorShape– The shape of the ocm in

T– The type of the input values

channelwiseSoftmax– If Softmax performs channelwise reduction or not

## Activation Functions¶

- template<FracRepType numFracBits, typename T>

INLINE qVar_t<FixedPoint<T, numFracBits>> sigmoid(qVar_t<FixedPoint<T, numFracBits>> qData)¶This function performs the sigmoid function on the input, which is a single tile: f(x) = 1/(1 + exp(-x)).

- Parameters

qData– A single tile of data- Template Parameters

T– The type of the data

- template<FracRepType numFracBits, typename T>

INLINE qVar_t<FixedPoint<T, numFracBits>> tanh(qVar_t<FixedPoint<T, numFracBits>> qData)¶This function performs the tanh function on the input, which is a single tile: f(x) = (exp(2*x) - 1)/(exp(2*x) + 1).

- Parameters

qData– A single tile of data- Template Parameters

T– The type of the data

- template<ReluMethod reluMethod, typename T, typename std::enable_if_t<reluMethod == ReluMethod::REGULAR, std::int32_t> = 0>

INLINE qVar_t<T> relu(qVar_t<T> qData)¶This function performs the vanilla ReLU function on the input, which is a single tile: f(x) = max(0, x).

This function performs the ReLU6 function on the input, which is a single tile: f(x) = min(max(0, x), 6).

@tparams The Relu Method chosen

@tparams The Relu Method chosen

- Parameters

qData– A single tile of data

qData– A single tile of data- Template Parameters

T– The type of the data

T– The type of the data

- template<typename T>

INLINE qVar_t<T> leakyRelu(const qVar_t<T> &qData, T alpha = 0.1)¶This function performs the leaky ReLU function on the input, which is a single tile: f(x) = x > 0 ? x : x * alpha.

- Parameters

qData– A single tile of data

alpha– The scale term for the negative data.- Template Parameters

T– The type of the data

## LSTM¶

- template<typename OcmInTensorShape, typename OcmKernelTensorShape, typename OcmCellTensorShape, typename OcmHiddenTensorShape, typename OcmRecurrentKernelTensorShape, typename OcmBiasTensorShape, typename OcmOutputTensorShape, std::int32_t timeWindow = OcmOutputTensorShape::NUM_CHN>

INLINE void lstmVanilla(OcmInTensorShape ocmIn, OcmKernelTensorShape ocmKernel, OcmCellTensorShape ocmCell, OcmHiddenTensorShape ocmHidden, OcmRecurrentKernelTensorShape ocmRecurrentKernel, OcmBiasTensorShape ocmBias, OcmOutputTensorShape ocmOutput)¶Performs LSTM Block with Standard Activations. The forget, input, and output gates all use sigmoid as their activations while the cell gate uses tanh as its activation. A final tanh activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

NOTE: LSTM implementation needs to be expanded for more activations than what is currently done. 1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow: ouput weights, input weights, cell weights, forget weights 2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint. 3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are used to reduce error accumulation.

- Parameters

ocmIn–[in]The x_t activation tensor of length M

ocmKernel–[in]The W ocm weights that interact with x_t of size 4*M by N

[in/out]– ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

ocmHidden–[in]The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

ocmRecurrentKernel–[in]The U ocm weights that interact with h_(t-1) of size 4*N by N

ocmBias–[in]The b ocm biases that are used in computing the pre-act gate values of size 4*N

ocmOutput–[out]The outputs of each time step (h_t)- Template Parameters

timeWindow– Number of time steps being processed in the LSTM call

OcmInTensorShape– Shape is of <1, 1, 1, M>

OcmKernelTensorShape– Shape is of <1, 4*M, 1, N> for dense

OcmCellTensorShape– Shape is of <1, 1, 1, N>

OcmHiddenTensorShape– Shape is of <1, 1, 1, N>

OcmRecurrentKernelTensorShape– Shape is of <1, 4*N, 1, N> for dense

OcmBiasTensorShape– Shape is of <1, 4, 1, N> for dense

OcmOutputTensorShape– is of <1, timeWindow, 1, N> for dense

- template<typename OcmInTensorShape, typename DdrKernelTensorShape, typename OcmKernelTensorShape, typename OcmCellTensorShape, typename OcmHiddenTensorShape, typename OcmRecurrentKernelTensorShape, typename OcmBiasTensorShape, typename OcmOutputTensorShape, std::int32_t timeWindow = OcmOutputTensorShape::NUM_CHN>

INLINE void lstmLargeInput(OcmInTensorShape ocmIn, DdrKernelTensorShape ddrKernel, OcmKernelTensorShape ocmKernelBuffer0, OcmKernelTensorShape ocmKernelBuffer1, OcmCellTensorShape ocmCell, OcmHiddenTensorShape ocmHidden, OcmRecurrentKernelTensorShape ocmRecurrentKernel, OcmBiasTensorShape ocmBias, OcmOutputTensorShape ocmOutput)¶Performs LSTM Block with Standard Activations for a large input length ( > 1024 elements) and small hidden length (<= 1024). Performs LSTM Block with Standard Activations. The forget, input, and output gates all use sigmoid as their activations while the cell gate uses tanh as its activation. A final tanh activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

NOTE: LSTM implementation needs to be expanded for more activations than what is currently done. 1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow: ouput weights, input weights, cell weights, forget weights 2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint. 3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are used to reduce error accumulation. 4. Multiple Ocm Buffers are passed in to ensure double buffering in the kernel.

- Parameters

ocmIn–[in]The x_t activation tensor of length M

ddrKernel–[in]The W drr weights that interact with the input. 4*M by N

ocmKernelBuffer0–[in]The W ocm weights buffer 0 that interact with the input. Flown in by Ddr by N

ocmKernelBuffer1–[in]The W ocm weights buffer 1 that interact with the input. Flown in by Ddr by N

[in/out]– ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

ocmHidden–[in]The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

ocmRecurrentKernel–[in]The U ocm weights that interact with h_(t-1) of size 4*N by N

ocmBias–[in]The b ocm biases that are used in computing the pre-act gate values of size 4*N

ocmOutput–[out]The outputs of each time step (h_t)- Template Parameters

timeWindow– Number of time steps being processed in the LSTM call

OcmInTensorShape– Shape is of <1, 1, 1, M>

ddrKernelTensorShape– Shape is of <1, factor(ddrKenelTensorShape::NUM_CHN), 1, N> for dense

OcmKernelTensorShape– Shape is of <1, factor of M, 1, N> for dense

OcmCellTensorShape– Shape is of <1, 1, 1, N>

OcmHiddenTensorShape– Shape is of <1, t+1, 1, N>

OcmRecurrentKernelTensorShape– Shape is of <1, 4*N, 1, N> for dense

OcmBiasTensorShape– Shape is of <1, 4, 1, N> for dense

OcmOutputTensorShape– is of <1, timeWindow, 1, N> for dense

- template<typename OcmInTensorShape, typename OcmKernelTensorShape, typename OcmCellTensorShape, typename OcmHiddenTensorShape, typename OcmRecurrentKernelTensorShape, typename OcmBiasTensorShape>

INLINE void lstmClipping(OcmInTensorShape ocmIn, OcmKernelTensorShape ocmKernel, OcmCellTensorShape ocmCell, OcmHiddenTensorShape ocmHidden, OcmRecurrentKernelTensorShape ocmRecurrentKernel, OcmBiasTensorShape ocmBias)¶Performs LSTM Block with Clipping Activations for a single time series. Performs LSTM Block with Standard Activations. The forget, input, and output gates all use [0, 1] clipping as their activations while the cell gate uses [-1, 1] clipping as its activation. A final [-1, 1] clipping activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

NOTE: LSTM implementation needs to be expanded for more activations than what is currently done. 1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow: ouput weights, input weights, cell weights, forget weights 2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint. 3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are used to reduce error accumulation.

- Parameters

ocmIn–[in]The x_t activation tensor of length M

ocmKernel–[in]The W ocm weights that interact with x_t of size 4*M by N

[in/out]– ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

ocmHidden–[in]The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

ocmRecurrentKernel–[in]The U ocm weights that interact with h_(t-1) of size 4*N by N

ocmBias–[in]The b ocm biases that are used in computing the pre-act gate values of size 4*N- Template Parameters

OcmInTensorShape– Shape is of <1, 1, 1, M>

OcmKernelTensorShape– Shape is of <1, 4*M, 1, N> for dense

OcmCellTensorShape– Shape is of <1, 1, 1, N>

OcmHiddenTensorShape– Shape is of <1, 1, 1, N>

OcmRecurrentKernelTensorShape– Shape is of <1, 4*N, 1, N> for dense

OcmBiasTensorShape– Shape is of <1, 4, 1, N> for dense