EPU Architecture

The quadric EPU is a grid of cores built for high performance, surrounded by an ample memory hierarchy that helps keep data local to compute at all times.

A program through the lens of quadric Hardware

Most programs on the EPU follow the pattern explained below:

  1. Data is copied from the host memory (DDR) to the (On Chip Memory) OCM.

  2. Data from the OCM is then transfered to Core memory.

  3. Some mathematical operations are performed on the data.

  4. Data is then transfered back to OCM.

  5. Data is then finally transferred back to DDR.

Note

Examples of a full end to end program that perform the above behavior can be found here.

Common EPU Configurations

The EPU architecture is built around an NxN grid of cores which work in parallel.

EPU’s of various dimensions are often refered to q8, q16, and q32 for shorthand as described below

  • q8 8 x 8 cores for a total of 64 core elements

  • q16 16 x 16 cores for a total of 256 core elements

  • q32 32 x 32 cores for a total of 1024 core elements

Memory Hierarchy

The EPU has three distinct memory segments:

  • Array Memory (Core Local Memory on the EPU)

    • There is a 1024 x 32 bit register file on each Core.

    • Example: a q8 has 1024 (depth) * 32bits (width) * 64 (cores) = 0.26 Mb of Local Memory

  • On Chip Memory (OCM) Memory the EPU uses as an intermediate buffer between the Array and DDR.

    • Default is 8MB

  • DDR (RAM in the host system)

Tip

The memories described above are depicted in the diagram below, and are operated by data fetching API’s described in Data Access Patterns.

Core memory

Each core the the core array has two dedicated registers:

  • Core Register: - 1024 x 32-bits designed to hold a data element from an input tensor.

  • Broadcast Register: - 2 x 32-bit designed to receive elements in parallel with the broadcast registers of other cores from the broadcast bus. Example use cases are to receive elements of weights tensors in a neural network or other loop-invariant tensors.

Border array

Although your code doesn’t have to manage or allocate them directly, there are also additional cores that form a border around the NxN array. These are used by the system to enable calculations where the value of neighboring elements is required. The detailed mechanics of using border cores are managed for you through numerous API functions. But your code retains control – most relevant API functions expose a simple binary option to include border values or not in a given operation. These API’s are explained further in Data Access Patterns.