Graph Mode (Neural Networks)¶
Intro¶
The quadric SDK’s Graph Mode is used to schedule frozen neural network graphs for inferencing on the quadric EPU.
Requirements¶
A frozen neural network Graph. Currently the quadric SDK only supports Onnx (.onnx) (via PyTorch) and TensorFlow (.pb) frozen Graphs. An example of freezing graphs is noted in the example below.
Inference frame. Saved as in Numpy format (*.npy).
Tip
An example of exporting a frozen Onnx file is shown below.
The Configuration File¶
Each graph command must be accompanied by a config file, expressed as an .ini
. The .ini
file controls how the SDK lowers your network from Python to EPU-runnable assembly. There are some user-configurable options you should be aware of:
Setting |
Meaning |
Valid Values |
---|---|---|
|
Toggles execution of the code on the quadric q16 M.2 devkit. |
0, 1 |
|
Toggles execution of the code on the quadric q8 FPGA. |
0, 1 |
|
Specifies quadric architecture - q8, q16, or q32. |
8, 16, 32 |
|
The number of border cores specified by the simulator. As of now, this does not change. The q8, q16, q32 all have 4 border cores currently. |
Currently, |
|
DDR read total bandwidth in Megabits per second. (i.e. 256000 mbps = 32 gigabytes per second) |
Any integer number of mbps |
|
DDR write total bandwidth in Megabits per second. (i.e. 256000 mbps = 32 gigabytes per second) |
Any integer number of mbps |
|
Number system to use |
|
|
Instruction memory depth in units of instructions. Do not change. |
Ant integer number of instructions |
|
OCM size in bytes, default is 8mb for the q8 |
Any integer number of bytes |
|
DDR Read average efficiency, default is 75 |
Any percentage |
|
DDR Read maximum efficiency, default is 80 |
Any percentage |
|
DDR Write average efficiency, default is 75 |
Any percentage |
|
DDR Write maximum efficiency, default is 80 |
Any percentage |
|
Output location for file generation. |
Path to folder |
PyTorch Example¶
Requirements¶
Installation of python3, PyTorch, and torchvision
Python packages onnxruntime, torch, numpy, matplotlib
The quadric SDK
Getting Started¶
In the tutorial below we’re going to export an image classification neural network from the PyTorch framework to an onnx file then lower the aforementioned onnx file through the quadric SDK. It’s recommended that you have a good knowledge of PyTorch before going through this guide. Please see guides on PyTorch here.
Overview¶
The guide below is going to show you how to train a simple digit classification neural network using PyTorch. The guide will then show you how to export that network in Onnx format. Finally, the guide will show you how to run the network that you trained in the previous steps on your host computer on the quadric SDK. This process will enable you to profile your neural network’s performance through quadric’s EPU architectural simulator.
The PyTorch Network¶
The network is a convolutional neural network (CNN) designed to classify digits of the MNIST dataset. The layers of the network can be seen below:
Net(
(conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
(bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(gap): AdaptiveAvgPool2d(output_size=(1, 1))
(fc2): Linear(in_features=64, out_features=10, bias=True)
(drop1): Dropout(p=0.25, inplace=False)
(softmax): Softmax(dim=1)
)
Steps¶
In your Python editor of choice, please copy the following script:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3)
self.bn1 = nn.BatchNorm2d(32)
self.pool1 = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(32, 64, 3)
self.bn2 = nn.BatchNorm2d(64)
self.pool2 = nn.MaxPool2d(2, 2)
self.gap = nn.AdaptiveAvgPool2d((1, 1))
self.fc1 = nn.Linear(64, 10)
self.drop1 = nn.Dropout(p=0.25)
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = F.relu(x)
x = self.pool1(x)
x = self.conv2(x)
x = self.bn2(x)
x = F.relu(x)
x = self.pool2(x)
x = self.gap(x)
x = self.drop1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = self.softmax(x)
return x
net = Net()
print(net)
batch_size_train = 128
batch_size_test = 4
train_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('data/', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_train, shuffle=True)
test_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('data/', train=False, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_test, shuffle=True)
Get random training images¶
dataiter = iter(train_loader)
images, labels = dataiter.next()
def imshow(img):
plt.imshow(np.transpose(img, (1, 2, 0)))
plt.show()
imshow(torchvision.utils.make_grid(images))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adadelta(net.parameters(), lr=2.0, rho=0.9)
for epoch in range(24): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i == len(train_loader) - 1: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / len(train_loader)))
running_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
Show the sample image we’re about to save.¶
plt.imshow(np.transpose(images[0].numpy(), (1, 2, 0)).reshape(28, 28))
plt.show()
Export the network as an onnx file¶
input = images[0].reshape(1, 1, 28, 28)
torch.onnx.export(net, input, "mnist.onnx", verbose=True)
np.save('test.npy', input.numpy())
Performing an inference on the network using onnxruntime¶
Once you’ve trained your neural network an appropriate amount and the loss seems to have flattened, try executing your network using the onnx runtime. This process will require you to execute the code seen below. This code should output the result of the classification of the image and display the resulting image.
import onnxruntime as nxrun
import numpy as np
sess = nxrun.InferenceSession("mnist.onnx")
input_name = sess.get_inputs()[0].name
data = np.load('test.npy')
result = sess.run(None, {input_name: data})
print(result[0].argmax())
print(result[0])
from PIL import Image
Image.fromarray(np.load('test.npy').reshape(28, 28) * 255).show()
input_name = sess.get_inputs()[0].name
data = np.load('test.npy')
result = sess.run(None, {input_name: data})
print(result[0].argmax())
print(result[0])
from PIL import Image
Image.fromarray(np.load('test.npy').reshape(28, 28) * 255).show()
Lowering the PyTorch Network¶
To finally lower the network, we’ll want to include an .ini
file that will include a sample configuration for a q8. You can try changing the NUM_CORES
parameter in the file below to either 8, 16, or 32 to view the change in performance results. In the example shown below, we’ll save this file as mnist.ini
[DEFAULT]
PROJECT_NAME=mnist
NUM_BORDERS=4
NUM_CORES=8
OCM_SIZE=8388608
DEBUG=0
UNITTEST=1
PROFILE=0
TENSOR_SPLIT=0
FUSE_MAXPOOL=0
PACK_FLOWS=1
QUANT_TYPE=int8
GLOBAL_SCALE=8
CALIBRATION_DIR=
FX16_FRAC_BITS=6
LOOP_TRANSFORMATION=0
USE_VAP=0
DENSE_SHIFT=1
TEMP_DIR=output
[PREPROC]
USE_PREPROC=0
INP_TRANSFORM=memCpyImageSplit
[SIMULATOR]
INST_MEM_SIZE=16384
DDR_WT_BW=128000
DDR_RD_BW=128000
Finally execute the quadric SDK in Graph Mode on the network utilizing your frozen graph, frame saved from your dataset above, and configuration file.
$ docker run -w /ws -v `pwd`:/ws quadric.io/graphsim:0.9 graph mnist.onnx test.npy mnist.ini
The quadric SDK’s output will tell us some profiling information about how the network ran, which is described in more detail here:
TotalCycles: 114,264
predication : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23,659
compute : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23,600
data_array : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 40,333
mac : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 20,753
data_external: ▏ 349
data_ocm : ▇▇▇▇▇▇ 5,570