Politehnica University of Timisoara [626509]

Politehnica University of Timisoara
Faculty of Automation and Computers
Department of Computer and
Information Technology
ACCELERATING TENSOROPERATIONS FOR MACHINELEARNING .A H ARDWARE IMPLEMENTATION
Diploma Project
Alexandru RACOVAN
Supervisor:
Conf. dr. ing. Lucian PRODAN
Timisoara
June, 2019

Table of Contents
1 Introduction 7
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Theoretical Foundation 15
2.1 Mathematical Background . . . . . . . . . . . . . . . . . . . 15
2.1.1 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Matrix Properties . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Basic Tensor Concepts and Operations . . . . . . . . 17
2.2 Machine Learning Concepts . . . . . . . . . . . . . . . . . . 18
2.3 Hardware Elements . . . . . . . . . . . . . . . . . . . . . . . 20
3 Proposed Solution and Development Methodology 21
4 Implementation 23
5 Prototyping 25
6 Usage and Experimental Results 27
7 Conclusions 29
1

2 TABLE OF CONTENTS

Figure list
2.1 Illustration of a simple deep learning model . . . . . . . . . 19
2.2 A dummy figureb . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 A dummy figurec . . . . . . . . . . . . . . . . . . . . . . . . . 20
3

4 FIGURE LIST

Table list
2.1 Dummy table 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5

6 TABLE LIST

Chapter 1
Introduction
1.1 Problem Statement
If you used a facial recognition system decades ago, you probably didn’t
find memories of experience. In fact, you had to spend hours to make a
device recognize your face by correcting it’s errors. Today just glance at
your phone and it unlocks. Artificial Intelligence (AI) it’s so common in
our days that we take it for granted without acknowledging the tremen-
dous computing that takes place behind the display.
The first question which comes in mind after making the difference be-
tween the results of the same algorithms, but in different decades is:
What happened? On a bigger scale we can ask:
Why some algorithms that were basically invented in the 1950s
through the 1980s only now causing such a transformation
of business and society? [1]
The answer is as easy as it looks. Namely, the technology became pow-
erful enough to make the promises of AI became real.
In essence, Artificial intelligence is a huge series of mathematical for-
mulas which can learn to handle their errors. If we talk about the facial
recognition system, there are millions of computations per second for
the program to recognize a face. This requires such a computational
power that wasn’t available until some years ago or it was too expensive
to buy. If we think about what Marvin Minksy – a well known per-
sonality in Artificial Intelligence and co-founder of the Massachusetts
Institute of Technology’s AI laboratory – considered to be a supercom-
puter in 1960 we would conclude that our smart-phone processor is
7

8 CHAPTER 1. INTRODUCTION
probably billion times better and million times cheaper.
Today an Intel Dual Xeon 2690v4 can run more than 1 TFLOPS (Terra
Floating Point Operations per Second) [2]. But even in those conditions,
we need more computational power for complex artificial intelligence
systems in order to make them feasible for real time applications.
A Central Processing Unit (CPU) is definitely the most important ele-
ment of a computing device. In present it can handle scalar or vectorial
instructions but the computationally intensive ones are sent to exter-
nal specialized processing units. Such units are controlled by the CPU
and don’t have control on how the information is flowing through them.
The most known device which can execute complex instructions is the
Graphical Processing Unit (GPU) designed to manipulate and alter mem-
ory faster in order to accelerate the image creation in a framebuffer. On
the other side we have devices such as Tensor Processing Unit (TPU)
which is an application-specific integrated circuit (ASIC) designed to be
an Artificial Intelligence accelerator. The big advantage of those devices
is the low time to process specific sets of instructions.
The supplementary instructions feature has always assumed to be fixed
instruction set that not all CPUs from a CPU family will need in order
to execute their tasks. There were some attempts in the past from In-
tel and AMD to extend the basic instruction set of the processor. One
of those was MMX instruction set which allows Pentium CPU to per-
form multimedia based instruction faster with much more efficiency,
but it got outdated quickly due to the higher frequency of bottlenecks.
In modern Intel CPUs the Advanced Vector Extensions (AVX) are used
to extend the 128 bits SSE register in order to allow further parallelism
using code vectorization but this is useful only for applications which
already use SSE instructions.
Data is the fuel for Artificial Intelligence and in the past four years we
created more than ten times the information gathered between the be-
ginning of the human race and 2015 [1]. Usually, the data we are stor-
ing now represents information from more than one dimension for the
object or phenomenon it describes. Due to this fact we should have the
possibility to process data in more than two dimensions. In mathemat-
ics we could represent those dimensions with tensors. They can model

1.2. STATE-OF-THE-ART 9
entities from scalars (tensors of order 1) to n dimensional data (ten-
sors of order n).
In order to reduce the latency and the throughput of tensor operations
in the Artificial Intelligence applications I propose a Complex Instruc-
tion Set Computer (CISC) architecture specialized in processing the data
represented as tensors.
1.2 State-of-the-art
Despite of the tremendous evolution of AI techniques, some learning
algorithms such as Deep Neural Network (DNN) or Generative Adver-
sarial Network (GAN), demand huge computational power and memory
storage in order to process data. On the other side if we talk about
training data over wireless network we will observe that it is not evenly
distributed over a large number of devices with constrained resources,
whereas a device has ownership on different amount of data compared
with other which usually represents only a small fraction [3].As a result
of those two pugnacious circumstances it may be difficult to implement
AI applications on user devices.
The main method for centralized learning, when it comes to cloud com-
puting, is to offload the data to a cloud center where it is processed
and the results are sent back to the user. This solution may be very
ineffective due to its two major unfavorable conditions that reduce the
chances of success. The first drawback of the aforementioned method is
that there can be an additional third party without any good intentions
which can attack the cloud center getting the access to the training
data, violating the right of data privacy. On the other hand the latency
of the data transmission can be very high due to the restricted size of
the communication resources corresponding to current technology. As
a conclusion, this type of processing data through cloud centers is not
reasonable when a real-time experience is required or for the scenarios
where data privacy is of big interest such as military, health or bank
systems [4].
In the past years a considerable amount of computer systems suffered
a transition from heterogeneous to pervasive computing. In this evolu-
tionary process the Graphics Processing Units (GPUs) adopted a general-

10 CHAPTER 1. INTRODUCTION
purpose (GPGPU) attitude instead of the graphics-specialized one in or-
der to boost the performance of computing intensive applications from
different perspectives. Due to their high efficiency for complex tasks
they became fundamental components for High-Performance Comput-
ing (HPC) and Artificial Intelligence systems [5]. The popularity given
by Artificial Intelligence and games to GPUs made NVIDIA to produce
seven new generations in less than ten years [6],[7],[8]. Each new gen-
eration brings a new micro-architecture, hardware specifications and
also a new version of CUDA. This continuous change combined with a
low percentage of undisclosed characteristics beside those provided by
the vendor makes from GPUs an not so friendly environment with the
developers giving also an inflexible nature from the portability point of
view.
In his work, Volkov [9] gives substantial reasons in order to support
the idea that inaccurate instruction latencies, which can be small but
added together can have a big influence on the overall accuracy of esti-
mating the performance. The persons interested to know in depth GPU
instructions latencies characteristics, usually researchers, are forced to
search in non academic sources because of the literature absence espe-
cially for the newer generations, making another group of people to be
discouraged to interact directly with such a device. As a conclusion, one
of the major drawbacks of the GPUs may be represented by the secrecy
of the vendors.
The current state of deep learning acceleration with hardware solutions
is dominated by clusters of GPUs as GPGPUs being able to train up to 1
billion parameter networks with only 3 machines in very few days and
in theory Andrew Ng et al. [10] propose a possible scaling scaling to net-
works with over 11 billion parameters using only 16 machines. LeCun et
al., Krizhevsky et al. and Ciresan et al. [11], [12], [13] proved that each
step forward in scale of the machines can drive a proliferation of new
results in supervised visual recognition, and can even learn to detect
objects when trained from unlabeled images alone. Major contributions
in building the largest systems are also made by LeCun and Dean et
al. using 16000 CPU cores in 1000 machines. The first problem they
encountered in building a large cluster of GPU/CPU was the communi-
cation bottleneck. In a data parallel model which employs many GPUs,
each GPU keeps a complete copy of the neural network parameters but
computes a gradient using a different subset of training data. An impor-

1.2. STATE-OF-THE-ART 11
tant factor is that all the network parameters must fit on a single GPU,
generating a limitation due to the memory size which is 32GB – ap-
prox. 8 billion floating-point parameters – for a 7000 $ NVIDIA Quadro
GV100 [14]. For a training image this GPU is capable to compute the
gradient in milliseconds but copying those values to other machines will
take somewhere around 10 seconds due to the Ethernet’s speed. Par-
allelizing the gradient computation is an effective approach only if all
the GPUs are in the same server and share a high-speed bus because
each GPU is responsible for a relatively small part of the whole neu-
ral network, reducing the bandwidth requirements considerably. This
method also needs a frequent synchronization and is still inefficient to
be used with Ethernet communication [12]. The second major prob-
lem in building such large system is that the communication amongst
different GPUs and computation management increase significantly the
complexity of the algorithm design, sometimes being very hard to maxi-
mize the performance.
Since 2015 Google deployed in data-centers an ASIC called Tensor Pro-
cessing Unit (TPU) with the purpose of accelerating the inference phase
of neural networks. The most interesting feature of this circuit is the
Matrix Multiply Unit (MMU) which consists in 65536 (256×256) 8-bit
Multiply-Accumulate (MAC) Units, producing 92 TOPS. Due to the fact
that Deep Neural Networks (DNNs) have a wide range of applications,
TPUs may be reused for solutions in speech, vision, search ranking,
translation and many more unlike the most of the ASICs. As instruc-
tions are sent relatively slow through the PCIe bus, Google researchers
decided to use the CISC architecture for their chip, with 4-stage pipeline,
where each instruction executes in a separate stage [15].As described in
its Patent Application Publication, the Neural Network Processor has a
Matrix Computation Unit (MCU) and a Vector Computation Unit (VCU).
The MCU receives numerous sets of weights and activation inputs for
NN layers and generates a plurality of accumulated values based on
those absorpted values. The VCU is coupled to the MCU and apply an
activation function to each generated value received, generating numer-
ous activated values for the NN layers [16].
Even if the hardware is described as an accelerator for inference, due
to its versatility, it can be used for many other applications, e.g. con-
volutional or fully-connected neural network training, linear or logistic
regression, clustering, image processing, crypto-mining and video en-

12 CHAPTER 1. INTRODUCTION
coding. The connection between the controlling machine and the TPU is
made through the PCIe from the Host Interface. There are three types
of data to be transmitted using this interface: weights, activations, con-
trol instructions. The sets of weights are stored into off-chip DDR3
DRAM memory because they are not expected to change over time. Fre-
quently used read and write instructions for the activations determined
the creation of a Memory Buffer or a High Bandwidth Memory as close
as possible to the MCU. All of the computations in a TPU are made with
16-bit floating point numbers, fact that can be categorized as a ma-
jor drawback, due to the range of the numbers that can be expressed
and not to the precision which is not considered such a big problem for
this number representation. Despite of above mentioned observations,
Google doesn’t sell directly the ASIC. They grouped more TPUs together
into Pods and offer Cloud Services for AI.
The latest Movidius chip aims to combine devices used every day with
the capabilities of AI. In order to do so, they created a Vision Process-
ing Unit capable to make inference more feasible on devices like drones,
cameras, robots, among others. This SoC includes vision accelerators, a
Neural Compute Engine, imaging accelerators and it is capable to per-
form 4 TOPS with an impressive power consumption of 1.5 W. It also
features 16 programmable Very Long Instruction Word (VLIW) vector
processors intended to accelerate NNs by performing parallel process-
ing of the workloads [17]. On the other hand, for processors used in
data-center accelerators, Intel proposed the Lake Crest architecture in
order to provide the needed flexibility to support the matrix multipli-
cation and convolution, while improving the efficiency of the hardware
core components. In Deep Learning applications the data movement
instructions are used frequently, slowing the processing time. To solve
this problem, for this type of architecture, the standard cache hierarchy
was completely removed and the on-chip memory is managed directly
from the software [18].
When we are talking about the algorithm acceleration, one of the strongest
competitors for GPU is the Field Programmable Gate Array (FPGA). It
has a reconfigurable hardware configuration, unlike the fixed configu-
ration of the GPU. The primitives like matrix multiplication and convo-
lution, used in Deep Learning, processed on FPGA gives a better per-
formance per Watt almost in any case comparing to GPU [19]. Even
though, programming such a device requires deep understanding of

1.3. OUTLINE 13
hardware that many people in the field may not posses, giving the feel-
ing that FPGAs are a specialist architecture. In the past years, as a
consequence of the lack of popularity, many FPGA tools adopted also a
software-level programming model for the purpose of making the pro-
gramming more attractive among the mainstream software development
practitioners. FPGAs are situated somewhere between the General Pur-
pose Processors (GPPs) and because they benefit of a higher level of
flexibility and also gives a very good performance. For Deep Learning,
the most useful type of FPGA is the one which supports partial dynamic
reconfiguration, where some parts of it can be reconfigured while others
are running. As a result individual layers can be changed without dis-
rupting the functionality of the rest. From the birds eye the difference
between the GPU and FPGA is that the first one has a fixed architecture
and the algorithm must be made according to it in order to benefit from
its performance while at the second one, the architecture is tailored to
the needs of the algorithm, allowing the user to explore the optimization
at the algorithm level [20].
1.3 Outline
In the aforementioned subchapters of the Diploma documentation was
descriebed the motivation for my approach as well as the current top
hardware accelerators for Artificial Intelligence with their benefits and
drawbacks. The Theoretical Foundation chapter comprises the underly-
ing principles from mathematics that are considered to be primitives for
Machine Learning and the ones of the hardware that stands behind the
acceleration together with an overview of the architectures used in the
development of the Diploma project. As a result generated by the previ-
ous chapters, I proposed a solution which is comprehensively detailed
in the 4th chapter, namely Proposed Solution and Development Method-
ology. The chapter which follows will contain the actual implementation
of the accelerator. In the last chapter the study of my experimental re-
sults is presented alongside with the proposed additions that can be
made inthe future.

14 CHAPTER 1. INTRODUCTION

Chapter 2
Theoretical Foundation
2.1 Mathematical Background
This section is intended to provide the linear and multilinear algebra
essential concepts that are going to be used in the rest of the Diploma.
Multilinear algebra is fundamentally linear algebra but it has more than
one vector space at the same time. The ojects that stand at the founda-
tion are called tensors which are generalizations of vectors.
2.1.1 Vector Spaces
The easiest example of a vector space is Rn, representing the ntimes
Cartesian product of Rwith itself. To be more precise: Rn=R:::R.
A vectorvinRis ann-tuple of real numbers of type: (v1;v2;:::;v n)with
the following basic properties:
c(v1;v2;:::;v n) := (cv1;cv 2;:::;cv n);
wherecrepresents a scalar; and
(v1;v2;:::;v n) + (w1;w2;:::;w n) := (v1+w1;v2+w2;:::;v n+wn)
The additive identity is represented by the zerovector :(0;0;:::;0).
In a more substantially way, a vector space Vover a field Fis a set
fv;w;x;:::gof vectors, together with a set fa;b;c;:::gof scalars, that is
closed under the taking of linear combinations and where 0v= 0and
1v=v[21].
15

16 CHAPTER 2. THEORETICAL FOUNDATION
2.1.2 Matrix Properties
In this subsection are presented the basic matrix operations used for
the rest of my Diploma. A matrix can be considered as a tensot of order
2.
The set Fnconsists of vectors vof the form
v=2
4v1
:::
vn3
5;
where (v1;v2;:::;v n)2Fare thecomponents ofvandRorCare represented
byF. Hence, the elements of Fnare considered to be column vectors .
Since F1=Fwe can say that a scalar is also a vector.
If 2Fandv2F, then v2Fnis given by
v=2
4 v1
:::
vn3
5:
Ifv;w2Fn, thenvandwarelinearly dependent if there exists 2Fsuch
that one of the following relations are true:
w= v_v= w:
Furthermore, the vectors v1;v2;:::;v nplaced side by side form the matrix
A,[v1;v2;:::;v m]:
which has nrows and mcolumns. The entries of matrix Aare the
components of vectors v1;v2;:::;v m. The size of matrix Aismnand can
be denoted as A2Fmn. Ifm=nthe order of AisnandAis asquare
matrix . The row index of matrix Ais denoted with iand for the columns
withj. Respecting this convention, we can use the following notations
row i(A)andcolj(A). Hence,
A=2
4row 1(A)
:::
row n(A)3
5=
col1(A)::: col n(A)
Partitioned matrices are of the form
A=2
4A11::: A 1n
::: ::: :::
Am;1::: A mn3
5

2.1. MATHEMATICAL BACKGROUND 17
LetA2FnmandB2Fml. Then,AB2Fnlis theproduct ofAandB
and can be written as
AB=2
4row 1(A)B
:::
row n(A)B3
5=
col1(B)A:::col n(B)A
The multiplication of partitioned matrices is exemplified below
A B
C DE F
G H
=AE+BG AF +BH
CE+DG CF +DH
Theidentitymatrix is denoted by In, hasnndimension and contains
1’s on the diagonal while elsewhere are 0’s. It satisfies the following
relation
AIm=InA=A
A fundamental operation for matrices is the transpose . Ifv2F, then the
transpose vTofvis defined to be the row vector
vT,[x1:::vn]2F1n
LetC2Cnm. Then,C=A+|B, whereA;B2Rnm. Therefore, the
complex conjugate CofCis
C,A|B
while thecomplex conjugate transpose CofCis
C=CT
The rank of A2Fnmis defined by
rankA ,dimR(A)
It can be seen that the rank of Ais equal to the number of linearly
independent columns of AoverF[22].
If a matrix Ais nonsingular and its rankA =n=m, then it has a unique
inverse such that
AA1=In
2.1.3 Basic Tensor Concepts and Operations
In the beginning of this subsection a new type of vector product is de-
fined, to be specific, the tensor product which is denoted by the symbol

. The product of two vectors, vandwis called a tensor of order 2.

18 CHAPTER 2. THEORETICAL FOUNDATION
A higher order tensors can be formed by adding a new vector uin the
product, giving us a third-order tensor
u
v
w
It is observable that order 0 tensors are scalars, order 2 tensors are
vectors, while order 2 tensors are matices.
A set of tensors of order r,Trforms a vector space in a natural way. If T
andSare both tensors of order r, then T+ Sis also a tensor of order
r. So that
Tr=V
V
:::
V=V
r
But if tensor Thas order randSis a tensor of order s, then the product
R
Sis a tensor of order r+s. In addition, the multiplication of a scalar
with a tensor product make the following equalities true
T
( S) = ( T)
S= (T
S)
and tensor products are distributive over addition [21]
R
(S+T) =R
S+R
T;
(S+T)
R=S
R+T
R
2.2 Machine Learning Concepts
TheMachine Learning Approach suggests that the difficulties encoun-
tered by the systems based on hard-coded knowledge can be solved with
the ability to acquire their own knowledge. This capability can be ob-
tained by finding patterns from raw sets of inputs. The performance of
machine learning algorithms is heavily influenced by the representation
that is used for the data. Every information set from representation can
be considered as a feature. This representation dependence is in gen-
eral an obstacle when dealing with the computers.
Many problems can be solved with artificial intelligence by designing
the corresponding set of features to extract for that particullary type
of problem. For simple tasks such as speaker identification from voice
sample the feature extraction is not hard, but when the system is en-
gaged in a complex system the selection is much difficult to be made.

2.2. MACHINE LEARNING CONCEPTS 19
Deep learning solves the principal obstacle of representation learning
by using simpler representations for the initial representation enabling
the system to assemble the complex representations out of simpler con-
cepts. The figure 2.1 exemplifies how a system based on deep learning
can recognize a person by combining elementary concepts, such as oject
parts, which are also defined as corners and contours.
Figure 2.1: Illustration of a simple deep learning model
………I1
I2
I3
InH1
HnO1
OnInput
layerHidden
layerOuput
layer

20 CHAPTER 2. THEORETICAL FOUNDATION
2.3 Hardware Elements
SI AICI E UNTEXT RANDOM
Figure 2.2: A dummy figureb
SI AICI
Figure 2.3: A dummy figurec
THIS IS A RANDOM TEXT
Table 2.1: Dummy table 1

Chapter 3
Proposed Solution and
Development Methodology
21

22 CHAPTER 3. PROPOSED SOLUTION AND DEVELOPMENT METHODOLOGY

Chapter 4
Implementation
23

24 CHAPTER 4. IMPLEMENTATION

Chapter 5
Prototyping
25

26 CHAPTER 5. PROTOTYPING

Chapter 6
Usage and Experimental Results
27

28 CHAPTER 6. USAGE AND EXPERIMENTAL RESULTS

Chapter 7
Conclusions
29

30 CHAPTER 7. CONCLUSIONS

Bibliography
[1] FORBES AI: ISSUE 01,
https://www.forbes.com/sites/intelai/2018/07/17/the-rise
-in-computing-power-why-ubiquitous-artificial-intelligence
-is-now-a-reality/#5db10df61d3fl
[2] Dr Donald Kinghorn: Intel Core-i9 7900X and 7980XE Skylake-X
Linux Linpack Performance,
https://www.pugetsystems.com/labs/hpc/Intel-Core-i9
-7900X-and-7980XE-Skylake-X-Linux-Linpack-Performance-1059/
[3] J. Park, S. Samarakoon, M. Bennis, M. Debbah. Wireless network
intelligence at the edge . CoRR, vol. abs/1812.02858, 2018.
[4] W. House. Consumer data privacy in a networked world: A frame-
work for protecting privacy and promoting innovation in the global
digital economy . J. Privacy and Confidentiality, 2013.
[5] Yehia Arafa, Abdel-Hameed Badawy, Gopinath Chennupati, Nan-
dakishore Santhi, Stephan Eidenbenz. Instructions’ Latencies Char-
acterization for NVIDIA GPGPUs , 2019.
[6] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos.
Demystifying gpu microarchitecture through microbenchmarking .
IEEE International Symposium on Performance Analysis of Systems
Software (ISPASS), March 2010, pp. 235–246.
[7] X. Mei, X. Chu. Dissecting gpu memory hierarchy through mi-
crobenchmarking . IEEE Trans. Parallel Distrib. Syst, vol. 28, no. 1,
pp. 72–86, Jan. 2017.
[8] X. Mei, X. Chu. A micro-benchmark suite for amd gpus . 39th Inter-
national Conference on Parallel Processing Workshops, 2010, pp.
387–396.
31

32 BIBLIOGRAPHY
[9] V. Volkov. A microbenchmark to study gpu performance models . Pro-
ceedings of the 23rd ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, ser. PPoPP ’18. New York, NY, USA:
ACM, 2018, pp. 421–422.
[10] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, N. Andrew.
Deep learning with COTS HPC systems . Proceedings of the 30th
International Conference on Machine Learning, PMLR 28(3):1337-
1345, 2013.
[11] Y. LeCun, F. J. Huang, L. Bottou. Learning methods for generic ob-
ject recognition with invariance to pose and lighting . Computer Vision
and Pattern Recognition, volume 2, pp. 97–104, 2004.
[12] A. Krizhevsky, I. Sutskever, G. Hinton. Imagenet classification with
deep convolutional neural networks . Advances in Neural Information
Processing Systems 25, pp. 1106–1114, 2012.
[13] D. C. Ciresan, U. Meier, J. Schmidhuber. Multicolumn deep neu-
ral networks for image classification . Computer Vision and Pattern
Recognition, pp. 3642–3649, 2012.
[14] Bob Sherbin: Jensen Huang Keynotes NVIDIA’s 2018 GPU Tech-
nology Conference,
https://blogs.nvidia.com/blog/2018/03/26/live-jensen
-huang-keynote-2018-gtc
[15] D. Patterson and the Google TPU Team. In-Datacenter Performance
Analysis of a Tensor Processing Unit . Proceedings of the 44th Annual
International Symposium on Computer Architecture, pp 1-12, 2017.
[16] J. Ross et al., Inventors; Google Inc.,Assignee. NEURAL NETWORK
PROCESSOR . United States Patent Application No. 14/844,524, Sep.
3, 2015 .
[17] Paul Alcorn: Intel Unveils Movidius Myriad X Vision Processing
Unit,
https://www.tomshardware.com/news/intel-movidius
-vpu-ai-inference,35327.html
[18] Naveen Rao: Intel R
NervanaTMNeural Network Processors (NNP)
Redefine AI Silicon,
https://www.intel.ai/intel-nervana-neural-network-processors
-nnp-redefine-ai-silicon/#gs.h3pbus

BIBLIOGRAPHY 33
[19] J. Fowers, G. Brown, P. Cooke, and G. Stitt. A performance and
energy comparison of fpgas, gpus, and multicores for sliding-window
applications . Proceedings of the ACM/SIGDA international sympo-
sium on Field Programmable Gate Arrays, pages 47–56. ACM, 2012.
[20] G. Lacey, G. W. Taylor, S. Areibi. Deep Learning on FPGAs: Past,
Present, and Future . arXiv preprint arXiv:1602.04283, 2016.
[21] P. Renteln. Manifolds, Tensors, And Forms . Cambridge University
Press, 2014.
[22] D. S. Bernstein. Matrix Mathematics: Theory, Facts, and Formulas .
Princeton University Press, 2009.
[23] Michel Goossens, Frank Mittelbach, and Alexander Samarin. The
LATEX Companion . Addison-Wesley, Reading, Massachusetts, 1993.
[24] Albert Einstein. Zur Elektrodynamik bewegter Körper . (German)
[On the electrodynamics of moving bodies ]. Annalen der Physik,
322(10):891–921, 1905.

Similar Posts