Neural Network Blocks

Fully Connected Layers

template<typename T, std::int32_t numCols>
INLINE std::enable_if<!std::is_integral<T>::value, qVar_t<T>>::type fullyConnectedTileBlock(qVar_t<T> qMatrix[])

Computes Fully Connected Layer in the provided Fx16 precision.

Return

qOutput A single tile output

Parameters
  • qData: The activations of the input

  • numCols: The width of the weight matrix in the FC Layer

Template Parameters
  • T: The type of the input values

template<typename T, std::int32_t numCols, std::uint32_t shiftAmount = 0>
INLINE std::enable_if<std::is_integral<T>::value, qVar_t<T>>::type fullyConnectedTileBlock(qVar_t<std::int8_t> qMatrix[])

Computes Fully Connected Layer in int8 precision.

Return

qOutput A single tile output

Parameters
  • qData: The activations of the input

  • numCols: The width of the weight matrix in the FC Layer

Template Parameters
  • T: The type of the input values

  • shiftAmount: The amount to shift the output of the FC by

Upsampling

template<typename T, IfNotIntegerTy<T> = 0>
INLINE qVar_t<T> interpolationTileBlock(qVar_t<T> qData)

Computes upsampling (nearest neighbors or bilinear) in the provided Fx16 precision.

Return

qOutput A single tile output

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

template<FracRepType shiftAmount = 0, typename T, IfIntegerTy<T> = 0>
INLINE qVar_t<T> interpolationTileBlock(qVar_t<std::int8_t> qData)

Computes upsampling (nearest neighbors or bilinear) in int8 precision.

Return

qOutput A single tile output

Parameters
  • qData: The activations of the input

Template Parameters
  • shiftAmount: The amount to shift the output of the interpolation by

  • T: The type of the input values

Convolutions

template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, bool useVaping = false>
INLINE qVar_t<T> convTileBlockFx16(qVar_t<T> qData[])

Computes convolutions in the provided Fx16 precision.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • numInCh: The number of input channels in the convolutoiun

  • filterSize: The filter size of the convolution

  • useVaping: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, bool useVaping = false>
INLINE qVar_t<T> convTileBlockInt8(qVar_t<std::int8_t> qData[])

Computes convolutions in int8 precision.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • numInCh: The number of input channels in the convolutoiun

  • filterSize: The filter size of the convolution

  • shiftAmount: The amount to shift the output of the convolution by

  • useVaping: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, std::uint32_t reductionFactor = 2>
INLINE qVar_t<T> convTileBlockInt8AndReduce(qVar_t<std::int8_t> qData[])

Computes convolutions in int8 precision.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • numInCh: The number of input channels in the convolutoiun

  • filterSize: The filter size of the convolution

  • shiftAmount: The amount to shift the output of the convolution by

  • reductionFactor: The number of partitions in X and Y to reduce the output of the conv over. strategy

template<typename T, std::uint32_t filterSize = 3, bool useVaping = false>
INLINE qVar_t<T> depthwiseConvTileBlockFx16(qVar_t<T> qData)

Computes Depthwise convolutions in the provided Fx16 precision.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • filterSize: The filter size of the convolution

  • useVaping: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

template<typename T, std::uint32_t numInCh, std::uint32_t filterSize = 1, std::uint32_t shiftAmount = 0, bool useVaping = false>
INLINE qVar_t<T> convStrideOf2x2TileBlockInt8(qVar_t<std::int8_t> qData[])

Computes convolutions in int8 precision with a height, width stride of 2, 2.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • numInCh: The type of the input values

  • filterSize: The filter size of the convolution

  • shiftAmount: The amount to shift the output of the convolution by

  • useVaping: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

template<typename T, std::uint32_t filterSize = 3, FracRepType shiftAmount = 0, bool useVaping = false>
INLINE qVar_t<T> depthwiseConvTileBlockInt8(qVar_t<std::int8_t> qData)

Computes Depthwise convolutions in int8 precision.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • filterSize: The filter size of the convolution

  • shiftAmount: The amount to shift the output of the convolution by

  • useVaping: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

template<typename T, std::uint32_t groupSize = 4, std::uint32_t filterSize = 3, std::uint32_t stride = 1, FracRepType shiftAmount = 0, bool useVaping = false>
INLINE qVar_t<T> groupwiseConvTileBlockInt8(qVar_t<std::int8_t> qData[], std::uint32_t outputIteration = 0)

Computes Groupwise convolutions in int8 precision.

Return

qOutput A single output channel

Parameters
  • qData: The activations of the input

Template Parameters
  • T: The type of the input values

  • groupSize: The number of channels in a group

  • filterSize: The filter size of the convolution

  • stride: The stride of the convolution

  • shiftAmount: The amount to shift the output of the convolution by

  • useVaping: Whether or not the convolution will implement a Virutal Array Partitioning (VAP) strategy

Pooling

template<typename OcmInTensorShape, typename OcmOutTensorShape, std::int32_t numRowFlowsPerInputWidth = roundUpToNearestMultiple(OcmInTensorShape::NUM_COLS, Epu::numArrayCores) / Epu::numArrayCores, std::int32_t numRowFlowsPerOutputWidth = roundUpToNearestMultiple(OcmOutTensorShape::NUM_COLS, Epu::numArrayCores) / Epu::numArrayCores>
numOutputTiles __pad0__

Perform reshape from the input shape to the output shape. Currently, we support 3 shape layouts (C, H, W), (C, 1, H*W), and (1, 1, C*H*W). The implementation:

  1. Uses a Row Iterator to flow in data:

  2. Computes the input x, y, z locations for each core

  3. Computes the input location in contiguous memory

  4. Computes the output x, y, z for each core

  5. Using the pitched width of the output tensor, computes the output location.

  6. Rau::store valid data values.

The following reshapes are supported and tested:

*
*  +---------------+
*  |               |                          +-------+
*  |               |                         +-----+  |
*  |               |     +---------->        |     |  |
*  |               |                         |     |  +
*  +---------------+                         +-----+
*  2D Tensor Shape                          3D Tensor Shape
*  (C, 1, H * W)                            (C, H, W)
*
*
*
*      +--------+                           +---------------+
*               |                           |               |
*    +------+   |                           |               |
*    |      |   |        +---------->       |               |
*    |      |   +                           |               |
*    +------+                               +---------------+
*  3D Tensor Shape                         2D Tensor Shape
*  (C, H, W)                               (C, 1, H * W)
*
*
*
*                                           +--------+
*                                          +-------+ |
*    +-------------+     +---------->      |       | |
*    1D Tensor Shape                       |       | +
*    (1, 1, C * H * W)                     +-------+
*                                          3D Tensor Shape
*                                          (C, H, W)
*       +------------+
*                   |
*    +-----------+  |   +----------->     +-----------------+
*    |           |  |                     1D Tensor Shape
*    |           |  +                     (1, C, H, W)
*    +-----------+
*    3D Tensor Shape
*    (C, H, W)
*

Parameters
  • [in] ocmIn: The Ocm Input

  • ocmOut: The Ocm output

Template Parameters
  • OcmInTensorShape: The shape of the input ocm tensor

  • OcmOutTensorShape: The shape of the output ocm tensor

template<typename T, std::int32_t filterSize, bool useCentered = ((Epu::coreDim % filterSize) > 0), std::enable_if_t<((filterSize == 3) || (filterSize == 5) || (filterSize == 7)) && useCentered, std::int32_t> = 0>
INLINE qVar_t<T> calculateSum(qVar_t<T> qData)

Performs centered summing on data via rotations.

Performs summing on a tile of data. This first finds the sum column-wise, then row-wise, computing the sum of each 2x2 area. Repeating this process again produces the sum of a 3x3 area, and so on, based on the given filter size.

Parameters
  • [in] qData: A qVar_t

Template Parameters
  • T: The type of the data in the qVar_t

Parameters
  • [in] qData: A qVar_t

Template Parameters
  • filterSize: The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

  • T: The type of the data in the qVar_t

template<typename T, std::int32_t filterSize, bool useCentered = ((Epu::coreDim % filterSize) > 0)>
INLINE qVar_t<T> calculateAvg(qVar_t<T> qData)

Performs avgpooling on data.

Parameters
  • [in] qData: A qVar_t

Template Parameters
  • T: The type of the data in the qVar_t

template<typename T, std::int32_t filterSize, bool useCentered = ((Epu::coreDim % filterSize) > 0), std::enable_if_t<((filterSize == 3) || (filterSize == 5) || (filterSize == 7)) && useCentered, std::int32_t> = 0>
INLINE qVar_t<T> calculateMax(qVar_t<T> qData)

Performs centered maxpooling on data via rotations.

Performs maxpooling on a tile of data. This first finds the max column-wise, then row-wise, computing the max of each 2x2 area. Repeating this process again produces the max of a 3x3 area, and so on, based on the given filter size.

Parameters
  • [in] qData: A qVar_t

Template Parameters
  • T: The type of the data in the qVar_t

Parameters
  • [in] qData: A qVar_t

Template Parameters
  • filterSize: The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

  • T: The type of the data in the qVar_t

template<typename OcmInTensorShape, typename OcmOutTensorShape, std::int32_t strideSize, typename T>
INLINE void poolRau(qVar_t<T> qData[], OcmOutTensorShape &ocmOut)

Pools an array of qVar_t’s into a tensor, compressing by a factor of filterSize. Uses Rau to write data to the output ocm tensor.

NOTE: In the future, when there is support for filters that are unaligned with a tile, we need to write out elements at different possitions on different tiles and will need to use the qMaxCond code that is currently commented out.

Parameters
  • [in] qData: An array of qVar_t’s

  • ocmOut: The ocm out

Template Parameters
  • OcmInTensorShape: The shape of the input ocm tensor

  • OcmOutTensorShape: The shape of the output ocm tensor

  • filterSize: The size of the filter (e.g. filterSize = 2 for a 2x2 filter)

  • T: The type of the data in the qVar_t’s @boolparam useOutputDims Whether to use the input tensor shape or the output tensor shape for the kernel

template<typename OcmInTensorShape, typename OcmOutTensorShape, std::int32_t repeatFactor = OcmOutTensorShape::NUM_COLS, isOcmTensor<OcmInTensorShape> = 0, isOcmTensor<OcmOutTensorShape> = 0>
INLINE void repeatAndAppend(OcmInTensorShape &ocmIn, OcmOutTensorShape &ocmOut)

This function repeats and appends a linear OCM tensor by a specified factor. The tensor itself will come in a linearized format and then be replicated for the given factor.

For example, a tensor of dimensions <1, 16, 1, 1> with a factor of 128 is repeated and appended to a tensor of dimensions <1, 16, 1, 128>.

        This function works for repeat factors <= 2048

Parameters
  • ocmIn: The ocm in

  • ocmOut: The ocm out

Template Parameters
  • OcmInTensorShape: The shape of the ocm in

  • OcmOutTensorShape: The shape of the ocm out

  • repeatFactor: The factor in which the tensor is repeated

template<typename OcmInTensorShape, typename OcmOutTensorShape, typename T, isOcmTensor<OcmInTensorShape> = 0, isOcmTensor<OcmOutTensorShape> = 0>
INLINE void globalMaxPool(OcmInTensorShape &ocmIn, OcmOutTensorShape &ocmOut)

This function performs a global max pool on the input OCM tensor and stores the result in the output OCM tensor. A global max pool compresses a tensor by mapping each input channel to a single element that is the max of the values in that channel. The output is a row vector of sequential data.

For example, a tensor of dimensions <1, 512, 8, 8> would be compressed to a tensor of dimensions <1, 1, 1, 512>.

When the data is flown in, each input channel maps to a single qVar_t that is the element-wise compares of each of the tiles composing that input channel so that the array of qVar_t’s can easily be passed to calculateTileMaximum without having to do complicated indexing.

When transferring the data after calculateTileMaximum to qOutput[], there is an outCond that determines the manner in which qOutput[] is populated. qOutput[] is populated sequentially with the max for each input channel. Since each tile in qData[] contains the max for that corresponding input channel, we can copy the 0th element of qData[0], the 1st element of qData[1], and so on into qOutput[0]. When we get to the 65th input channel, we copy the 0th element of qData[64] to qOutput[1]. For the 66th channel, we copy the 1st element of qData[65] to qOutput[1]. And so on for all of the input channels.

This function works for inputs with dimensions:

In summary, this works for up to 120 channels and up to 128 tiles per channel. Tiles may be unaligned.

Parameters
  • ocmIn: The ocm in

  • ocmOut: The ocm out

Template Parameters
  • OcmInTensorShape: The shape of the ocm in

  • OcmOutTensorShape: The shape of the ocm out

  • T: The FixedPoint Library Type

template<typename OcmTensorShape, typename T, bool channelwiseSoftmax = (OcmTensorShape::NUM_CHN == 1), isOcmTensor<OcmTensorShape> = 0, std::enable_if_t<channelwiseSoftmax, std::int32_t> = 0>
INLINE void softmax(OcmTensorShape &ocmIn, OcmTensorShape &ocmOut)

This function performs a softmax, or normalized exponential function, on the input OCM tensor and stores the result in the output OCM tensor. A softmax normalizes the input data to a probability distribution proportional to the exponentials of the inputs. In other words, softmax takes the exponential function of each input elements and then normalizes by dividing each of the elements by the sum of all the exponentials. The output has the same shape as the input but is bound to the interval (0, 1).

Additionally, there is a iteration that goes through the the scores and takes the maximum value for each channel. This maximum is subtracted from each score with a new range of (-inf, 0] so the output of exp is bound to the interval (0, 1]. This is done to ensure that the softmax has a countable sum (0, Height*Width) which is numerically stable while still ensuring that the maximum/largest scores are well represented in fixed-point.

The equation for softmax is:

softmax(x_i) = exp(x_i) / summation(exp(x_j) for all j)

The numerically stable softmax is: softmax(x_i) = exp(x_i - max(x)) / summation(exp(x_j - max(x)) for all j)

Before performing the exponential function, a check must be made to ensure that the exponential function is only applied to the valid data. If the exponential function is applied to invalid data, which is set to 0, it will produce exp(0) = 1. Due to the way that calculateTileAverage works, these unwanted 1s in the padding and border will be factored into the sum, making the sum larger than intended, and the final results smaller than intended.

In the future, we hope to have a way to check the valid bit during the flow. Then, we can compute the exponential function on the data coming from dataPort and store this result directly into qData without having to perform the math to calculate which elements are valid by hand.

When the exponential function is applied to the input data, each input channel is mapped to a single qVar_t in qChannelSum that is the element-wise sum of each of exponentials of the tiles composing that input channel so that the array of qVar_t’s can easily be passed to calculateTileAverage without having to do complicated indexing.

To compute the sum, calculateTileAverage is called with a denominator of 1, which produces the same effect as a calculateSum function would.

This function works for input tensors that are large and non aligned.

Parameters
  • ocmIn: The ocm in

  • ocmOut: The ocm out

Template Parameters
  • OcmTensorShape: The shape of the ocm in

  • T: The type of the input values

  • channelwiseSoftmax: If Softmax performs channelwise reduction or not

Activation Functions

template<FracRepType numFracBits, typename T>
INLINE qVar_t<FixedPoint<T, numFracBits>> sigmoid(qVar_t<FixedPoint<T, numFracBits>> qData)

This function performs the sigmoid function on the input, which is a single tile: f(x) = 1/(1 + exp(-x)).

Parameters
  • qData: A single tile of data

Template Parameters
  • T: The type of the data

template<FracRepType numFracBits, typename T>
INLINE qVar_t<FixedPoint<T, numFracBits>> tanh(qVar_t<FixedPoint<T, numFracBits>> qData)

This function performs the tanh function on the input, which is a single tile: f(x) = (exp(2*x) - 1)/(exp(2*x) + 1).

Parameters
  • qData: A single tile of data

Template Parameters
  • T: The type of the data

template<ReluMethod reluMethod, typename T, typename std::enable_if_t<reluMethod == ReluMethod::REGULAR, std::int32_t> = 0>
INLINE qVar_t<T> relu(qVar_t<T> qData)

This function performs the vanilla ReLU function on the input, which is a single tile: f(x) = max(0, x).

This function performs the ReLU6 function on the input, which is a single tile: f(x) = min(max(0, x), 6).

@tparams The Relu Method chosen

Parameters
  • qData: A single tile of data

Template Parameters
  • T: The type of the data

template<typename T>
INLINE qVar_t<T> leakyRelu(const qVar_t<T> &qData, T alpha = 0.1)

This function performs the leaky ReLU function on the input, which is a single tile: f(x) = x > 0 ? x : x * alpha.

Parameters
  • qData: A single tile of data

  • alpha: The scale term for the negative data.

Template Parameters
  • T: The type of the data

LSTM

template<typename OcmInTensorShape, typename OcmKernelTensorShape, typename OcmCellTensorShape, typename OcmHiddenTensorShape, typename OcmRecurrentKernelTensorShape, typename OcmBiasTensorShape, typename OcmOutputTensorShape, std::int32_t timeWindow = OcmOutputTensorShape::NUM_CHN>
INLINE void lstmVanilla(OcmInTensorShape ocmIn, OcmKernelTensorShape ocmKernel, OcmCellTensorShape ocmCell, OcmHiddenTensorShape ocmHidden, OcmRecurrentKernelTensorShape ocmRecurrentKernel, OcmBiasTensorShape ocmBias, OcmOutputTensorShape ocmOutput)

Performs LSTM Block with Standard Activations. The forget, input, and output gates all use sigmoid as their activations while the cell gate uses tanh as its activation. A final tanh activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

        NOTE: LSTM implementation needs to be expanded for more activations than what is currently done.
             1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow:
                 ouput weights, input weights, cell weights, forget weights
             2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint.
             3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are
                 used to reduce error accumulation.

Parameters
  • [in] ocmIn: The x_t activation tensor of length M

  • [in] ocmKernel: The W ocm weights that interact with x_t of size 4*M by N

  • [in/out]: ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

  • [in] ocmHidden: The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

  • [in] ocmRecurrentKernel: The U ocm weights that interact with h_(t-1) of size 4*N by N

  • [in] ocmBias: The b ocm biases that are used in computing the pre-act gate values of size 4*N

  • [out] ocmOutput: The outputs of each time step (h_t)

Template Parameters
  • timeWindow: Number of time steps being processed in the LSTM call

  • OcmInTensorShape: Shape is of <1, 1, 1, M>

  • OcmKernelTensorShape: Shape is of <1, 4*M, 1, N> for dense

  • OcmCellTensorShape: Shape is of <1, 1, 1, N>

  • OcmHiddenTensorShape: Shape is of <1, 1, 1, N>

  • OcmRecurrentKernelTensorShape: Shape is of <1, 4*N, 1, N> for dense

  • OcmBiasTensorShape: Shape is of <1, 4, 1, N> for dense

  • OcmOutputTensorShape: is of <1, timeWindow, 1, N> for dense

template<typename OcmInTensorShape, typename DdrKernelTensorShape, typename OcmKernelTensorShape, typename OcmCellTensorShape, typename OcmHiddenTensorShape, typename OcmRecurrentKernelTensorShape, typename OcmBiasTensorShape, typename OcmOutputTensorShape, std::int32_t timeWindow = OcmOutputTensorShape::NUM_CHN>
INLINE void lstmLargeInput(OcmInTensorShape ocmIn, DdrKernelTensorShape ddrKernel, OcmKernelTensorShape ocmKernelBuffer0, OcmKernelTensorShape ocmKernelBuffer1, OcmCellTensorShape ocmCell, OcmHiddenTensorShape ocmHidden, OcmRecurrentKernelTensorShape ocmRecurrentKernel, OcmBiasTensorShape ocmBias, OcmOutputTensorShape ocmOutput)

Performs LSTM Block with Standard Activations for a large input length ( > 1024 elements) and small hidden length (<= 1024). Performs LSTM Block with Standard Activations. The forget, input, and output gates all use sigmoid as their activations while the cell gate uses tanh as its activation. A final tanh activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

        NOTE: LSTM implementation needs to be expanded for more activations than what is currently done.
             1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow:
                 ouput weights, input weights, cell weights, forget weights
             2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint.
             3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are
                 used to reduce error accumulation.
             4. Multiple Ocm Buffers are passed in to ensure double buffering in the kernel.

Parameters
  • [in] ocmIn: The x_t activation tensor of length M

  • [in] ddrKernel: The W drr weights that interact with the input. 4*M by N

  • [in] ocmKernelBuffer0: The W ocm weights buffer 0 that interact with the input. Flown in by Ddr by N

  • [in] ocmKernelBuffer1: The W ocm weights buffer 1 that interact with the input. Flown in by Ddr by N

  • [in/out]: ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

  • [in] ocmHidden: The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

  • [in] ocmRecurrentKernel: The U ocm weights that interact with h_(t-1) of size 4*N by N

  • [in] ocmBias: The b ocm biases that are used in computing the pre-act gate values of size 4*N

  • [out] ocmOutput: The outputs of each time step (h_t)

Template Parameters
  • timeWindow: Number of time steps being processed in the LSTM call

  • OcmInTensorShape: Shape is of <1, 1, 1, M>

  • ddrKernelTensorShape: Shape is of <1, factor(ddrKenelTensorShape::NUM_CHN), 1, N> for dense

  • OcmKernelTensorShape: Shape is of <1, factor of M, 1, N> for dense

  • OcmCellTensorShape: Shape is of <1, 1, 1, N>

  • OcmHiddenTensorShape: Shape is of <1, t+1, 1, N>

  • OcmRecurrentKernelTensorShape: Shape is of <1, 4*N, 1, N> for dense

  • OcmBiasTensorShape: Shape is of <1, 4, 1, N> for dense

  • OcmOutputTensorShape: is of <1, timeWindow, 1, N> for dense

template<typename OcmInTensorShape, typename OcmKernelTensorShape, typename OcmCellTensorShape, typename OcmHiddenTensorShape, typename OcmRecurrentKernelTensorShape, typename OcmBiasTensorShape>
INLINE void lstmClipping(OcmInTensorShape ocmIn, OcmKernelTensorShape ocmKernel, OcmCellTensorShape ocmCell, OcmHiddenTensorShape ocmHidden, OcmRecurrentKernelTensorShape ocmRecurrentKernel, OcmBiasTensorShape ocmBias)

Performs LSTM Block with Clipping Activations for a single time series. Performs LSTM Block with Standard Activations. The forget, input, and output gates all use [0, 1] clipping as their activations while the cell gate uses [-1, 1] clipping as its activation. A final [-1, 1] clipping activation is applied on the new cell timestep computed and that is then used in an elementwise multiply to generate the next hidden state vector.

        NOTE: LSTM implementation needs to be expanded for more activations than what is currently done.
             1. The Weight packing is assumed to follow the slicing order of what is done in Tensorflow:
                 ouput weights, input weights, cell weights, forget weights
             2. Due to the small magnitude values computed/recurrently used, the compute is in FixedPoint.
             3. Users can specifiy the representation but it is recommend that at least 8 fractional bits are
                 used to reduce error accumulation.

Parameters
  • [in] ocmIn: The x_t activation tensor of length M

  • [in] ocmKernel: The W ocm weights that interact with x_t of size 4*M by N

  • [in/out]: ocmCell The Cell Tensor at c_(t-1) which gets updated each call to c_t of size N

  • [in] ocmHidden: The Hidden Tensor at h_(t-1) which gets updated each call to h_t of size N

  • [in] ocmRecurrentKernel: The U ocm weights that interact with h_(t-1) of size 4*N by N

  • [in] ocmBias: The b ocm biases that are used in computing the pre-act gate values of size 4*N

Template Parameters
  • OcmInTensorShape: Shape is of <1, 1, 1, M>

  • OcmKernelTensorShape: Shape is of <1, 4*M, 1, N> for dense

  • OcmCellTensorShape: Shape is of <1, 1, 1, N>

  • OcmHiddenTensorShape: Shape is of <1, 1, 1, N>

  • OcmRecurrentKernelTensorShape: Shape is of <1, 4*N, 1, N> for dense

  • OcmBiasTensorShape: Shape is of <1, 4, 1, N> for dense