The quadric EPU is a grid of cores built for high performance, surrounded by an ample memory hierarchy that helps keep data local to compute at all times.
A program through the lens of quadric Hardware¶
Most programs on the EPU follow the pattern explained below:
Data is copied from the host memory (DDR) to the (On Chip Memory) OCM.
Data from the OCM is then transfered to Core memory.
Some mathematical operations are performed on the data.
Data is then transfered back to OCM.
Data is then finally transferred back to DDR.
Examples of a full end to end program that perform the above behavior can be found here.
Common EPU Configurations¶
The EPU architecture is built around an
NxN grid of cores which work in parallel.
EPU’s of various dimensions are often refered to q8, q16, and q32 for shorthand as described below
q8 8 x 8 cores for a total of 64 core elements
q16 16 x 16 cores for a total of 256 core elements
q32 32 x 32 cores for a total of 1024 core elements
The EPU has three distinct memory segments:
Array Memory (Core Local Memory on the EPU)
There is a 1024 x 32 bit register file on each Core.
Example: a q8 has 1024 (depth) * 32bits (width) * 64 (cores) = 0.26 Mb of Local Memory
On Chip Memory (OCM) Memory the EPU uses as an intermediate buffer between the Array and DDR.
Default is 8MB
DDR (RAM in the host system)
The memories described above are depicted in the diagram below, and are operated by data fetching API’s described in Data Access Patterns.
Each core the the core array has two dedicated registers:
Core Register: - 1024 x 32-bits designed to hold a data element from an input tensor.
Broadcast Register: - 2 x 32-bit designed to receive elements in parallel with the broadcast registers of other cores from the broadcast bus. Example use cases are to receive elements of weights tensors in a neural network or other loop-invariant tensors.
Although your code doesn’t have to manage or allocate them directly, there are also additional cores that form a border around the
NxN array. These are used by the system to enable calculations where the value of neighboring elements is required. The detailed mechanics of using border cores are managed for you through numerous API functions. But your code retains control – most relevant API functions expose a simple binary option to include border values or not in a given operation. These API’s are explained further in Data Access Patterns.