Graph Mode (Neural Networks)

Intro

The quadric SDK’s Graph Mode is used to schedule frozen neural network graphs for inferencing on the quadric EPU.

Requirements

  • A configuration file.

  • A frozen neural network Graph. Currently the quadric SDK only supports Onnx (.onnx) (via PyTorch) and TensorFlow (.pb) frozen Graphs. Examples of freezing graphs are noted in the examples :ref:`below <PyTorch Example>.

  • Inference frame. Saved as in Numpy format (*.npy).

Tip

Examples of exporting frozen Onnx and TensorFlow files are shown below.

The Configuration File

Each graph command must be accompanied by a config file, expressed as an .ini. The .ini file controls how the SDK lowers your network from Python to EPU-runnable assembly. There are some user-configurable options you should be aware of:

Setting

Meaning

Valid Values

NUM_CORES

Specifies quadric architecture - q8, q16, or q32. NUM_CORES specifies the number of cores along one edge of the array.

8, 16, 32

NUM_BORDERS

The number of border cores specified by the simulator. As of now, this does not change. The q8, q16, q32 all have 4 border cores currently.

Currently, NUM_BORDERS=4

DDR_RD_BW

DDR read total bandwidth in Megabits per second. (i.e. 256000 mbps = 32 gigabytes per second)

Any integer number of mbps

DDR_WT_BW

DDR write total bandwidth in Megabits per second. (i.e. 256000 mbps = 32 gigabytes per second)

Any integer number of mbps

QUANT_TYPE

Number system to use

fx16 or int8. if int8 is specified, you must provide a CALIBRATION_DIR with sample data (as seen in mobilenet)

INST_MEM_SIZE

Instruction memory depth in units of instructions. Do not change.

Ant integer number of instructions

OCM_SIZE

OCM size in bytes, default is 8mb for the q8

Any integer number of bytes

DDR_RD_AVG_EFF

DDR Read average efficiency, default is 75

Any percentage

DDR_RD_MAX_EFF

DDR Read maximum efficiency, default is 80

Any percentage

DDR_WT_AVG_EFF

DDR Write average efficiency, default is 75

Any percentage

DDR_WT_MAX_EFF

DDR Write maximum efficiency, default is 80

Any percentage

TEMP_DIR

Output location for file generation.

Path to folder

PyTorch Example

Requirements

  1. Installation of python3, PyTorch, and torchvision

  2. Python packages onnxruntime, torch, numpy, matplotlib

  3. The quadric SDK

Getting Started

In the tutorial below we’re going to export an image classification neural network from the PyTorch framework to an onnx file then lower the aforementioned onnx file through the quadric SDK. It’s recommended that you have a good knowledge of PyTorch before going through this guide. Please see guides on PyTorch here.

Overview

The guide below is going to show you how to train a simple digit classification neural network using PyTorch. The guide will then show you how to export that network in Onnx format. Finally, the guide will show you how to run the network that you trained in the previous steps on your host computer on the quadric SDK. This process will enable you to profile your neural network’s performance through quadric’s EPU architectural simulator.

The PyTorch Network

The network is a convolutional neural network (CNN) designed to classify digits of the MNIST dataset. The layers of the network can be seen below:

Net(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
  (bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (gap): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc2): Linear(in_features=64, out_features=10, bias=True)
  (drop1): Dropout(p=0.25, inplace=False)
  (softmax): Softmax(dim=1)
)

Steps

In your Python editor of choice, please copy the following script:

import numpy as np
import matplotlib.pyplot as plt

import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 64, 3)
        self.bn2 = nn.BatchNorm2d(64)
        self.pool2 = nn.MaxPool2d(2, 2)
        self.gap = nn.AdaptiveAvgPool2d((1, 1))
        self.fc1 = nn.Linear(64, 10)
        self.drop1 = nn.Dropout(p=0.25)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = F.relu(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = F.relu(x)
        x = self.pool2(x)
        x = self.gap(x)
        x = self.drop1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = self.softmax(x)
        return x

net = Net()
print(net)

batch_size_train = 128
batch_size_test = 4

train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST('data/', train=True, download=True,
                               transform=torchvision.transforms.Compose([
                                   torchvision.transforms.ToTensor(),
                                   torchvision.transforms.Normalize(
                                       (0.1307,), (0.3081,))
                               ])),
    batch_size=batch_size_train, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST('data/', train=False, download=True,
                               transform=torchvision.transforms.Compose([
                                   torchvision.transforms.ToTensor(),
                                   torchvision.transforms.Normalize(
                                       (0.1307,), (0.3081,))
                               ])),
    batch_size=batch_size_test, shuffle=True)

Get random training images

dataiter = iter(train_loader)
images, labels = dataiter.next()

def imshow(img):
    plt.imshow(np.transpose(img, (1, 2, 0)))
    plt.show()

imshow(torchvision.utils.make_grid(images))

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adadelta(net.parameters(), lr=2.0, rho=0.9)

for epoch in range(24):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
       # get the inputs; data is a list of [inputs, labels]
       inputs, labels = data
       # zero the parameter gradients
       optimizer.zero_grad()
       # forward + backward + optimize
       outputs = net(inputs)
       loss = criterion(outputs, labels)
       loss.backward()
       optimizer.step()
       # print statistics
       running_loss += loss.item()
       if i == len(train_loader) - 1:    # print every 2000 mini-batches
           print('[%d, %5d] loss: %.3f' %
                 (epoch + 1, i + 1, running_loss / len(train_loader)))
           running_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

Show the sample image we’re about to save.

plt.imshow(np.transpose(images[0].numpy(), (1, 2, 0)).reshape(28, 28))
plt.show()

Export the network as an onnx file

input = images[0].reshape(1, 1, 28, 28)
torch.onnx.export(net, input, "mnist.onnx", verbose=True)
np.save('test.npy', input.numpy())

Performing an inference on the network using onnxruntime

Once you’ve trained your neural network an appropriate amount and the loss seems to have flattened, try executing your network using the onnx runtime. This process will require you to execute the code seen below. This code should output the result of the classification of the image and display the resulting image.

import onnxruntime as nxrun
import numpy as np

sess = nxrun.InferenceSession("mnist.onnx")

input_name = sess.get_inputs()[0].name
data = np.load('test.npy')
result = sess.run(None, {input_name: data})
print(result[0].argmax())
print(result[0])
from PIL import Image
Image.fromarray(np.load('test.npy').reshape(28, 28) * 255).show()

 input_name = sess.get_inputs()[0].name
 data = np.load('test.npy')
 result = sess.run(None, {input_name: data})
 print(result[0].argmax())
 print(result[0])
 from PIL import Image
 Image.fromarray(np.load('test.npy').reshape(28, 28) * 255).show()

Lowering the PyTorch Network

To finally lower the network, we’ll want to include an .ini file that will include a sample configuration for a q8. You can try changing the NUM_CORES parameter in the file below to either 8, 16, or 32 to view the change in performance results. In the example shown below, we’ll save this file as mnist.ini

[DEFAULT]
PROJECT_NAME=mnist
NUM_BORDERS=4
NUM_CORES=8
OCM_SIZE=8388608
DEBUG=0
UNITTEST=1
PROFILE=0
TENSOR_SPLIT=0
FUSE_MAXPOOL=0
PACK_FLOWS=1
QUANT_TYPE=fx16
CALIBRATION_DIR=
FX16_FRAC_BITS=6
LOOP_TRANSFORMATION=0
TEMP_DIR=output

[PREPROC]
USE_PREPROC=0
INP_TRANSFORM=memCpyImageSplit

[SIMULATOR]
INST_MEM_SIZE=16384
DDR_WT_BW=128000
DDR_RD_BW=128000

Finally execute the quadric SDK in Graph Mode on the network utilizing your frozen graph, frame saved from your dataset above, and configuration file.

$ docker run -w /ws -v `pwd`:/ws quadric.io/graphsim:0.8 graph mnist.onnx test.npy mnist.ini

The quadric SDK’s output will tell us some profiling information about how the network ran, which is described in more detail here:

TotalCycles: 114,264

predication  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23,659
compute      : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23,600
data_array   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 40,333
mac          : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 20,753
data_external: ▏ 349
data_ocm     : ▇▇▇▇▇▇ 5,570

TensorFlow Example

The quadric SDK can also lower TensorFlow networks saved as frozen graphs in .pb format. This example includes a guide on inferencing a convolutional neural network used to classify digits on the SDK.

Warning

This example requires SDK version 0.8.10

Requirements

  1. Installation of python3, TensorFlow, and tensorflow_datasets

  2. Python package numpy

  3. The quadric SDK

The TensorFlow Network

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
import numpy as np

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)


def normalize_img(image, label):
    """Normalizes images: `uint8` -> `float32`."""
    return tf.cast(image, tf.float32) / 255., label


ds_train = ds_train.map(
    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)


ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)


model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=32, kernel_size=3,
                           padding='valid', input_shape=(28, 28, 1)),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.MaxPooling2D(pool_size=2),
    tf.keras.layers.Conv2D(filters=16, kernel_size=3, padding='valid'),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.MaxPooling2D(pool_size=2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(768, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train,
    epochs=1,
    validation_data=ds_test,
)

print(model.summary())
# Freezing the graph as a pb
full_model = tf.function(lambda x: model(x))
full_model = full_model.get_concrete_function(
    tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()
layers = [op.name for op in frozen_func.graph.get_operations()]
print("-" * 60)
print("Frozen model layers: ")
for layer in layers:
    print(layer)
print("-" * 60)
print("Frozen model inputs: ")
print(frozen_func.inputs)
print("Frozen model outputs: ")
print(frozen_func.outputs)
# Save frozen graph to disk
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                  logdir=".",
                  name=f"mnist.pb",
                  as_text=False)
# Export an image
input = ds_test.as_numpy_iterator().next()[0][0].reshape(1, 1, 28, 28)
np.save('test.npy', input)

This Python code uses an input (mnist.pb) and exports an image (test.npy) which can be run in the quadric SDK. Note that our input model comes from TensorFlow and our input is passed via NumPy.

Lowering the TensorFlow Network

Save the configuration file below as mnist.ini to instruct the quadric SDK in how to run your neural network.

[DEFAULT]
PROJECT_NAME=mnist_tf
NUM_BORDERS=4
NUM_CORES=8
OCM_SIZE=8388608
DEBUG=0
UNITTEST=1
PROFILE=0
TENSOR_SPLIT=0
FUSE_MAXPOOL=1
PACK_FLOWS=1
QUANT_TYPE=int8
GLOBAL_SCALE=4.00
CALIBRATION_DIR=
FX16_FRAC_BITS=6
PRE_TRANSPOSE_INPUT=1
LOOP_TRANSFORMATION=0
TEMP_DIR=/ws
[PREPROC]
USE_PREPROC=0
CONVERT_TO_INT8_PRE_CODEGEN=1
[SIMULATOR]
INST_MEM_SIZE=16384
DDR_WT_BW=128000
DDR_RD_BW=128000

Finally, we can run the quadric SDK:

$ docker run -w /ws -v `pwd`:/ws quadric.io/graphsim:0.8.8 graph mnist.pb test.npy mnist.ini

The quadric SDK will lower the model, creating a .cpp and .hpp file in the ./ws folder. These files contain a representation of the network in C++, which can then be compiled down to EPU-runnable assembly, using the quadric SDK and QLLVM. The SDK’s lowering performs three key optimizations:

  1. Quantization: Using fewer bits to store and compute wherever possible (i.e. using integer math). We quantize using int8.

  2. Fusing: Combining different layers and operations to net performance benefits

  3. Removing Redundancy: Optimizes out extraneous data layout transforms and mathematical operations

The quadric SDK’s output will tell us some profiling information about how the network ran, which is described in more detail here:

TotalCycles: 163,035

predication  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50,504
compute      : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38,782
data_array   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 65,967
mac          : ▇▇▇ 5,224
data_external: ▏ 240
data_ocm     : ▇ 2,318