CS 250B: Modern Computer Systems

Hardware Acceleration Case Study
Neural Network Accelerators

Sang-Woo Jun

Many slides adapted from Hyoukjun Kwon’s Gatech “Designing CNN Accelerators”
Usefulness of Deep Neural Networks

- No need to further emphasize the obvious
Convolutional Neural Network for Image/Video Recognition
ImageNet Top-5 Classification Accuracy Over the Years

15 million images 1000 classes in the ImageNet challenge

AlexNet, The Beginning

“The first* fast** GPU-accelerated Deep Convolutional Neural Network to win an image recognition contest

image-net.org “ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2017,” 2017
Convolutional Neural Networks Overview

“Convolution”

“Neural Network”

goldfish: 0.002%
shark: 0.08%
magpie: 0.02%
Palace: 89%
Paper towel: 1.4%
Spatula: 0.001%
Training vs. Inference

- **Training:** Tuning parameters using training data
  - Backpropagation using stochastic gradient descent is the most popular algorithm
  - Training in data centers and distributing trained data is a common model*
  - Because training algorithm changes rapidly, GPU cluster is the most popular hardware *(Low demand for application-specific accelerators)*

- **Inference:** Determining class of a new input data
  - Using a trained model, determine class of a new input data
  - Inference usually occurs close to clients
  - Low-latency and power-efficiency is required *(High demand for application specific accelerators)*
Deep Neural Networks (“Fully Connected”*)

- Each layer may have a different number of neurons

An Artificial Neuron

- Effectively weight vector multiplied by input vector to obtain a scalar
- May apply activation function to output
  - Adds non-linearity

Sigmoid

Rectified Linear Unit (ReLU)

Jed Fox, “Neural Networks 101,” 2017
Convolution Layer

Convolution layer

Optional pooling layer

-1 x 3) + (0 x 0) + (1 x 1) + (-2 x 2) + (0 x 6) + (2 x 2) + (-1 x 2) + (0 x 4) + (1 x 1) = 3
Convolution Example

Typically adds zero padding to source matrix to maintain dimensions

\[
\begin{array}{ccc}
1 & 2 & 3 \\
-2 & 0 & -1 \\
5 & -2 & 4 \\
\end{array}
\times
\begin{array}{cccc}
0 & 1 & 0 & 1 \\
2 & 4 & 3 & 1 \\
5 & 2 & 7 & 2 \\
4 & 1 & 8 & 4 \\
5 & 0 & 1 & 5 \\
0 & 0 & 0 & 5 \\
\end{array}
= \begin{array}{c}
44 \\
\end{array}
\]

Channel partial sum[0][0] =
\[
1 \times 0 + 2 \times 1 + 3 \times 0 + (-2) \times 2 + 0 \times 4 + (-1) \times 3 + 5 \times 5 + (-2) \times 2 + 4 \times 7
= 44
\]
Multidimensional Convolution

- “Feature Map” usually has multiple layers
  - An image has R, G, B layers, or “channels”
- One layer has many convolution filters, which create a multichannel output map

![Input feature map](image)

![3x3x3 filter](image)

![Output feature map](image)
Multiple Convolutions

Filter 0

Filter 1

Input feature map

Output feature map 0

Output feature map 1
Example Learned Convolution Filters

Alex Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012
Multidimensional Convolution

Input image

Filter bank (to be learned)

Feature maps
Computation in the Convolution Layer

for(n=0; n<N; n++) { // Input feature maps (IFMaps)
  for(m=0; m<M; m++) { // Weight Filters
    for(c=0; c<C; c++) { // IFMap/Weight Channels
      for(y=0; y<H; y++) { // Input feature map row
        for(x=0; x<H; x++) { // Input feature map column
          for(j=0; j<R; j++) { // Weight filter row
            for(i=0; i<R; i++) { // Weight filter column
              O[n][m][x][y] += W[m][c][i][j] * I[n][c][y+j][x+i];
            }
          }
        }
      }
    }
  }
}
Pooling Layer

- Reduces size of the feature map
  - Max pooling, Average pooling, ...

### Max pooling example

<p>| | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>31</td>
<td>7</td>
<td>44</td>
<td>33</td>
<td></td>
</tr>
<tr>
<td>65</td>
<td>35</td>
<td>40</td>
<td>46</td>
<td></td>
</tr>
<tr>
<td>46</td>
<td>29</td>
<td>32</td>
<td>30</td>
<td></td>
</tr>
<tr>
<td>24</td>
<td>49</td>
<td>8</td>
<td>64</td>
<td></td>
</tr>
</tbody>
</table>

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>65</td>
<td>46</td>
</tr>
<tr>
<td>46</td>
<td>64</td>
</tr>
</tbody>
</table>
Real Convolutional Neural Network
-- AlexNet

Simplified intuition: Higher order information at later layer

Alex Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012
Real Convolutional Neural Network -- VGG 16

Contains 138 million weights and 15.5G MACs to process one 224 × 224 input image
There are Many, Many Neural Networks

- GoogLeNet, ResNet, YOLO, ...
  - Share common building blocks, but look drastically different

GoogLeNet (ImageNet 2014 winner)

ResNet (ImageNet 2015 winner)
Beware/Disclaimer on Accelerators

- This field is advancing very quickly/messy right now
- Lots of papers/implementations always beating each other, with seemingly contradicting results
  - Eyes wide open!
The Need For Neural Network Accelerators

- Remember: “VGG-16 requires 138 million weights and 15.5G MACs to process one $224 \times 224$ input image”
  - CPU at 3 GHz, 1 IPC, (3 Giga Operations Per Second – GOPS): 5+ seconds per image
  - Also significant power consumption!
    - (Optimistically assuming 3 GOPS/thread at 8 threads using 100 W, 0.24 GOPS/W)

<table>
<thead>
<tr>
<th></th>
<th>CPU (Intel DuoCore, 2.7GHz)</th>
<th>GPU (GTX480)</th>
<th>mGPU (GT335m)</th>
<th>NeuFlow on Xilinx Virtex 6</th>
<th>NeuFlow on IBM 45 nm process</th>
</tr>
</thead>
<tbody>
<tr>
<td>Real GOPs</td>
<td>1.1</td>
<td>294</td>
<td>54</td>
<td>147</td>
<td>1164</td>
</tr>
<tr>
<td>Power (W)</td>
<td>30</td>
<td>220</td>
<td>30</td>
<td>10</td>
<td>5</td>
</tr>
<tr>
<td>GOPs/W</td>
<td>0.04</td>
<td>1.34</td>
<td>1.8</td>
<td>14.7</td>
<td>230</td>
</tr>
</tbody>
</table>

* Old data (2011), and performance varies greatly by implementation, some reporting 3+ GOPS/thread on an i7. Trend is still mostly true!

Farabet et. al., “NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision”
Two Major Layers

- **Convolution Layer**
  - Many small (1x1, 3x3, 11x11, ...) filters
    - Small number of weights per filter, relatively small number in total vs. FC
  - Over 90% of the MAC operations in a typical model

- **Fully-Connected Layer**
  - N-to-N connection between all neurons, large number of weights
Spatial Mapping of Compute Units

- Typically a 2D matrix of Processing Elements
  - Each PE is a simple multiply-accumulator
  - Extremely large number of PEs
  - Very high peak throughput!

- Is memory the bottleneck (Again)?
Memory Access is (Typically) the Bottleneck (Again)

- 100 GOPS requires over 300 Billion weight/activation accesses
  - Assuming 4 byte floats, 1.2 TB/s of memory accesses

- AlexNet requires 724 Million MACs to process a 227 x 227 image, over 2 Billion weight/activation accesses
  - Assuming 4 byte floats, that is over 8 GB of weight accesses per image
  - 240 GB/s to hit 30 frames per second

- An interesting question:
  - Can CPUs achieve this kind of performance?
  - With SIMD and good caching, maybe, but not at low power

“About 35% of cycles are spent waiting for weights to load from memory into the matrix unit ...” – Jouppi et. al., Google TPU
Spatial Mapping of Compute Units 2

- Optimization 1: On-chip network moves data (weights/activations/output) between PEs and memory for reuse
- Optimization 2: Small, local memory on each PE
  - Typically using a Register File, a special type of memory with zero-cycle latency, but at high spatial overhead
- Cache invalidation/work assignment... how?
  - Computation is very regular and predictable

A class of accelerators deal only with problems that fit entirely in on-chip memory. This distinction is important.
Different Strategies of Data Reuse

- **Weight Stationary**
  - Try to maximize local weight reuse

- **Output Stationary**
  - Try to maximize local partial sum reuse

- **Row Stationary**
  - Try to maximize inter-PE data reuse of all kinds

- **No Local Reuse**
  - Single/few global on-chip buffer, no per-PE register file and its space/power overhead

Weight Stationary

- Keep weights cached in PE register files
  - Effective for convolution especially if all weights can fit in PEs

- Each activation is broadcast to all PEs, and computed partial sum is forwarded to other PEs to complete computation
  - Intuition: Each PE is working on an adjacent position of an input row

Weight stationary convolution for a row in the convolution:

- Partial sum of a previous activation row if any
- Partial sum for stored for next activation row, or final sum

nn-X, nuFlow, and others
Output Stationary

- Keep partial sums cached on PEs – Work on subset of output at a time
  - Effective for FC layers, where each output depend on many input/weights
  - Also for convolution layers when it has too many layers

- Each weight is broadcast to all PEs, and input relayed to neighboring PEs
  - Intuition: Each PE is working on an adjacent position in an output sub-space

ShiDianNao, and others
Row Stationary

- Keep as much related to the same filter row cached... Across PEs
  - Filter weights, input, output...
- Not much reuse in a PE
  - Weight stationary if filter row fits in register file

Eyeriss, and others
Row Stationary

- Lots of reuse across different PEs
  - Filter row reused horizontally
  - Input row reused diagonally
  - Partial sum reused vertically

- Even further reuse by interleaving multiple input channels and multiple filters
No Local Reuse

- While in-PE register files are fast and power-efficient, they are not space efficient.
- Instead of distributed register files, use the space to build a much larger global buffer, and read/write everything from there.

Google TPU, and others
Google TPU Architecture
Static Resource Mapping

Map And Fold For Efficient Use of Hardware

Replication

AlexNet
Layer 3-5

Folding

AlexNet
Layer 2

Physical PE Array

Unused PEs are Clock Gated

Physical PE Array

Requires a flexible on-chip network

Overhead of Network-on-Chip Architectures

- Eyeriss PE
- Bus
- Mesh
- Crossbar Switch

Throughput
Power Efficiency Comparisons

- Any of the presented architectures reduce memory pressure enough that memory access is no longer the dominant bottleneck
  - Now what’s important is the power efficiency

Goal becomes to reduce as much DRAM access as possible!

Joel Emer et. al., “Hardware Architectures for Deep Neural Networks,” tutorial from ISCA 2017
Power Efficiency Comparisons

* Some papers report different numbers [1] where NLR with a carefully designed global on-chip memory hierarchy is superior.


Power Consumption Comparison Between Convolution and FC Layers

- Data reuse in FC is inherently low
  - Unless we have enough on-chip buffers to keep all weights, systems methods are not going to be enough

Next: Model Compression