Computing System and Network Architectures in High [601645]
Computing System and Network Architectures in High
Frequency Trading Financial Applications
Sorin Zoican ,Marius Vochin
University POLITEHNICA of Bucharest , Romania
[anonimizat] , [anonimizat]
Abstract:In this paper, w einvestigate computi ng system s and
network architecture s, dedicated to high frequency trading
applications and evaluate their performances .Both a high
processing speed and low network latency are important for
high-frequency traders. The financial market literature suggests,
however, that extremely high speeds discourage other traders
from participating in the market , therefore harm ing the quality
of financial markets. We find that the existing medium cost
technology is enough to promote an optimal trading speed and
therefore postulate further investment in low latency technology
to be inefficient from a technical and economical point of view.
Keywords – computer unified device architecture, high frequency
trading algorithms , network latency .
I.INTRODUCTION
Algorithmic trading is computer based trading widely
used by financial investors. High frequency trading ( HFT) is a
subset of algorithmic trading that relies on very fast computers
to process information (e.g., stock price changes) and send
buy or sell orders before competito r traders do. Such orders
are transmitted over a communication network that links the
trader server with exchange’s servers as illustrated in Figure
1.
Figure 1. High frequency trading system
High frequency traders are seeking "arbitrage opportunities :"
price misalignments that create (short -lived) profit
opportunities. A low latency is important for HFTs : first to
process new information quickly, then to act on it (buy or sell)
before anybody else does. As a consequence, they need highperformance com munication across a network and a powerful
computer. The network communication rate, or network
latency is necessary to process market events before the
competitors and a computational power is required to make
the decision faster. Nowadays, processing and decision times
are of the order of microseconds or less. Financial trades are
made using complex algorithms running on supercomputers
and exchanging data over a low latency network in
microseconds [1].
While speed allows HFTs to maintain a competitive
advantage over their peers, it is not as straightforward speed is
good for the market. In [2] the authors find that a low enough
trading latency (sub-milliseconds) can increase trading costs,
i.e., generate a higher bid -ask spread. In short, it is not optima l
for financial markets to become dominated by HFT
speculative strategies. How small should be the overall
latency (including computing time and data transfers over the
network)?
This paper evaluates high frequency trading medium cost
solutions from atechnical perspective and emphasizes the two
main components of HFT platform: the network latency and
the computational power obtained using a heterogeneous
architecture based on Computer Unified Devices Architecture
(CUDA). The paper investigates the limits in computations
and transfer speed from the architectural and technological
point of view and suggests an answer to the fundamental
question: is it necessary to increase more the speed of HFT
systems or the nowadays po ssible solutions are enough and
more speed is not necessary due the major implications in the
financial systems?
II.HIGHFREQUENCY TRADINGALGORITHMS
The class of online algorithms suitable for HFT
information -processing applications comprises the following
algorithms: online mean algorithm ,online variance algorithm
andonline regression algorithm [3].For the first two
algorithms we consider a singleinput (new information) X, a
parameter (known as forgotten factor) which measures the
weight given to past information relative to the most recent
piece of information, and the current time indexn.
Thetime-weighted meanof information signals, ˆX,is:
ˆ ˆ( ) (1 ) ( 1)X n X X n (1)
withˆ(0) 0X
The variance ˆVis:
ˆ ˆ( ) (1 ) ( 1)X n X X n (2)
2 ˆ ˆ ˆ( ) (1 )[ ( )] ( 1)V n X X n V n (3)
withˆ(0) 0X andˆ(0) 1V.
The third algorithm projects noisy signals Y on a s et of
variables X, to separate information from the noise. W e
consider matrix
11
. .
. .
1nX
X
X and vector1
.
.
nY
Y
Y with1,…,nX X and
1,…,nY Y the lastn valuesof inputs XandY. The
regression algorithm find s a correlation factor0
1
β
such as Y Xβ ε
, whereε represents noise. This
correlation factor is determined as:
1( ) ( )n nβ M V (4)
with
( ) ( 1) (1 )Tn n M M X X (5)
( ) ( 1) (1 )Tn n V V X Y (6)
Relation (5) may be rewritten as:
2ˆ( )( ) ( 1) (1 )ˆ ˆ( ) ( )n nX nn n
nX n nX n
M M (7)
whereˆ( )X nand2ˆ( )X nare the means of last ninputs of
Xand lastn values of2Xrespectively.
Following the same calculations, relation (6) may be re –
written as:
ˆ( )
( ) ( 1) (1 )
ˆ( )nY n
n n
nP n
V V (8)
whereˆ( )Y nandˆ( )P nare the means of last ninputs of Yand
lastn values of product P XY respectively.
Initially0 0
(0)
0 0
M and1
(0)
1
V .
From relation (4) we have:
1 22 12 11
2 21 11 12( ) ( ) ( )
( ) ( ) ( )M n M n V n
M n M n V n
(9)with
11 22 12 211
( ) ( ) ( ) ( )M n M n M n M n
Relation (1) is the online mean algorithm, relations (2) and
(3) represent the online variance algorithm and relations (7),
(8) and (9) illustrate the online regression algorithm (based on
the online mean algorithm).
Section IV will evaluate the execution time of these
specific HFT algorithms.
III.COMPUTING SYSTEM AND NETWORK
ARCHITECTURE FOR APPLICATIONS IN
FINANCE
This section presents the framework for parallel
computing based on graphical processing units existing in the
computing system ,andthe network architecture for a nHFT
solution.
The graphic processing units ( GPU) are now extensively
used in high -performance computing based on a data parallel
programming model that may launch tens of thousands of
threads simultaneously and executing the same code on
different data. The technology that integrates specific
hardware and programming model is known as Computer
Unified Device Architecture (CUDA – proprietary of Nvidia)
or Open Computing Language (OpenCL –open source). Both
approaches have similar performances [4]. In this paper we
focus on the CUDA approach. The cores in the GPU may be
used to perform general purpose computations in parallel.
CUDA device s haveseveralmultiprocessor s withmany cores
each, memory blocks: global memory, constant memory,
shared memories and registers . All threads to be executed are
organized in a grid comprising of blocks. The threads in
blocks belong ingto the same multiprocessor are running in
parallel. A CUDA program has one or more sections that are
executed on either the CPU or CUDA -GPU. The sections with
no data parallelism are impl emented in host CPU, and the
sections with data paralleli sm are running in the CUDA -GPU
asdata-parallel functions, called kernels.Figure 2 illustrates
the basic CUDA architecture ( CUDA-Tesla, the grid
organization and how thecode is executing in a hybri d CPU-
GPU computing system [6].This computing manner will
speed up program execution . Many improvements have been
made in CUDA architectures to achieve more computation
power [5, 6].
The CUDA architectures continuously evol vedas is
illustrated in Figure 3 [5].The CUDA – Fermi architecture
introduces the floating point operations and it is the first
complete GPU computing architecture. A set of special
arithmetic units – Fused Multiply -Add (FMA) and Special
Function Units (SPU) computes 64 bits floatin g points
operations.
The CUDA -Kepler architecture [7]has a set of new
features comparing with CUDA- Fermi architecture. This
architecture is characterized by an increased GPU utilization
and more efficient design for parallelization programs. Figure
4 shows the main architectural improvements introduces by
CUDA-Kepler architectures comparing with the CUDA –
Fermi architecture. The main features of Kepler architecture
are:
Dynamic parallelism – GPU has the capability to
generate threads for itself, syn chronize the results
and control the scheduling in hardware.
Hyper queues (Hyper -Q)- enables multiple CPU to
launch up to 32 threads on a single GPU in the same
time using connections managed in hardware
Grid management unit – manages grids to support
dynamic parallelism (prioritize and queue the
pending or suspended threads in a grid until they are
ready to execute)
NVIDIA GPUDirect – set the using of multiple GPU
on a single computer or in different computers
(connected in a network) to transfer data wi thout
needing the CPU memory
The CUDA – Maxwell architecture [8] retains the same
CUDA programming model as in previous CUDA – Fermi
and CUDA – Kepler architectures but introduces a new design
for the multiprocessor to improve control logic partitioning ,
workload balancing, threads scheduling, number of
instructions per clock cycle and energy efficiency.
Figure 2. CUDAhardware architecture , grid organization program
flow
Figure 3. The evolution of CUDA architectures
Figure 4. Architectural improvements in Kepler CUDA
The CUDA – Pascal architecture [9]introduces a unified
memory between CPU and GPU and a new transfer interface
called NVLink that is a high -bandwidth, energy -efficient. This
interface allows data transfers up to 12 times faster than the
traditional PCI interface. The Table 1 shows the main features
and performance of the CUDA architectures.
Table 1– The features of CUDA architectures
CoresClock
(MHz)Memory
Bandwidth
(GB/s)Memory
size (GB)Gflops
**Power
(W)
TESLA
C8701281350 76.8 1.5 120 170
FERMI
GTX4804801401 177.4 2 598 100
KEPLER
GK1102496 837 288.04 82995 150
MAXWEL L
GM2003072 1000 336.5 12 6070 250
PASCAL*
GP100NA NA 1000 16 NA NA
* announced to be x10 faster than Maxwell, expected by the second quarte r of 2016
** single precision
Alow latency network s olution must be designed to achieve
best performance of HFT system.This network should have a
scalable architecture with performance that adjusts to meet
financial markets requirements. The main feat ures of such
architecture are [10]:
Predictability
Low-latency and reliability
Direct connection from LANservers toWAN
link.
Simplified network topologies
Point to point instead ofhierarchical shortest
routes
Support for increasing bandwidth
Security t o transfer large data volumes
Diversity and self healing
The network congestion will occur when an ingress
interface is transmitting data faster than the egress interface
can transmit and/or when multiple ingress interfaces transmit
concurrently to one egr ess interface.
Buffering and queuing are the leading sources of latency
[10, 11]. However, buffers are necessary when the congestion
exists so that traffic is not dropped causing retransmissions or
data losses. The siz ing and allocation of buffers require
balance between congestion management and low-latency.
AnHFT system is characterized by micro bursts (short time
periods, typically microseconds) when the bandwidth of the
egress interface may be overrun. M icro bursts may cause
packet delays or even dropp ing.The solution is tooversize the
network bandwidth , to minimize the possibility of
overflowing the buffer s capacity.
The needed HFT network performance is achieved using
high-speed communication, such as 10 or 40Gigabit Ethernet ,
customPCI Express , and FPGA technology to obtain sub –
microsecond for data transfer time [10,11].
In anHFT trading solution the network latency is caused
by: LAN switches, the operating system software network
stack(TCP/UDP), and the transport radio or fiber
metropolitan links that are set up for traders operating across
multiple venues. Software dec apsulation and encapsulation of
application protocols, such as FIX (Financial Information
eXchange) will cause m essaging latencies . There are also
functions such as multiplexing and message filtering by
customer, venue, or traded security symbol. Solutions based
on Cisco's switches afford, at best, on the order of 250 ns
latency. Nexus 7000 series features an FPGA upgrade option,
into it is possible to load proprietary user code (e.g., a
FIX/TCP offload). [12]
A media-agnostic FPGA -based switch with o ptional
motherboard (such as xCelor's XPM2 ) has switching latency
averages 2.5 ns through the FPGA, in contrast to 200+ns
through the Xeon processor. Usingport multiplexing and
multi-cast filtering this switch can deliver outbound orders to
the exchange engine at 90 ns latency and zero -jitter.[13].The
FPGA solution from Enyxoffers a comprehensive line of
FPGA-based appliances with bandwidth -optimized
multiplexed Ethernet over radio orfiber,multi-venue
distribution, per -symbol and other parameters f iltering,
improving bandwidth utilization by approximately 40%, and
overcoming radio link reliability issues .TheHFTnetwork solution is shown in Figure 5.
Figure 5. Network soluti on for aHFT system
The Cisco Nexus 3000 switches [14] have a simplified
management that enhance network visibility, improve
monitoring and offers low -latency, high programmability and
high-density switches. They are suitable for high performance
computin g, high-frequency trading, massively scalable data
center, cloud networks and virtualization. They have large
buffers and table sizes for 10, 25, 40, 50 and 100 Giga
Ethernet connectivity.
The Cisco Nexus 7000 switches [15] are modular
switches run ning onCisco NX operating system , and open
source programmable tools to deploy software -defined
networks. They have application awareness with 10, 40, and
100 Gigabit Ethernet interfaces and provide performance
analytics. The design of these switches applies the following
principles: infrastructure scalability, operational continuity
and transport flexibility. The number of slots is up to 18, with
a bandwidth up to 1.3 Tb per second per slot. The number of
ports is between 192 and 3072 depend ing on the line rate
density (10, 40 or 100 Gigabit).
The Cisco Catalyst 4900M switch [16] has eight low
latency10 Gigabit Ethernet ports and it provides services
required for a collapsed LAN .
The Cisco ASA 5585 -X[17]is the next generation
firewall, with high scalability, im proved performance and
security. The main features of these firewalls are: throughput
up to 40 Gbps, up to 10 billions concurrent firewall
connections, VPN clustering and load balancing, AES
encryption VPN throughput up to 5Gbs, maximum 12
integrated netwo rk ports and 72 GB memory.
The Cisco unified computing s ystem [18] is a next –
generation data center platform that integrates computing,
networking, storage access and virtualization into a low
latency lossless 10 GB Ethernet and Intel x86 architecture
server system , designed to reduce total cost and increase
business agility. This solution may be enhanced by addin g the
CUDA GPU in the Intel server architecture.
The network latency is very low and the trading latency
will be imposed mainly by the HFT algorithmexecution time
as will be discuss in next section.
IV.COMPLEXITY AND PERFORMANCE
EVALUATION
TheHFT algorithms presented in section II have been
implemented and the execution time have been evaluated on a
computing system with processor Intel Core I7 4960@3.6
GHz, 53GFlops andondifferent CUDA platforms presented
above.Intel processors use Intel Hyper-Threading Technology
todeliver two proces sing threads per physical core so more
workisdone in parallel, completing tasks sooner .Operating
system sche duling policy has a great impact in the execution
time.
The results are illustrated in Figure 6.
Figure 6. The execution time forHFT algorithms on Intel I7
platform
One can observe that the execution time varies between 10
ms to 90 ms depending on theHFT algorithm complexity. The
execution time is calculated for one iteration (one feed/order).
The speedup may be calculated using the Amdahl’s law
[19]:
1
1speedupff
N where fis the serial part of code
(that could not be parallelized) and Nis the number of
processors (cores) running in parallel. Using the above
formula, we can compare the execution time on Intel
processor and the CUDA considering the ratio
11CLK CUDA
CLK IntelfTr
T f
of the execution time of the same
workload on Intel processor, T and CUDA core1T. The
above relation will be changed as:1rspeedupff
N.
The serial fraction of the code is mainly given by the
memory transfers ( seeFigure 2.). If the memory bandwidth is
large, then the time spent to transfer data between CPU and
GPU memories will be shorter ,and we can assume that fis
negligible ,therefore speedup rN.The largest number of
cores, in a multiprocessor, that can run in parallel is 32 (the
warp length) and it depends on the number or registers in the
multiprocessor, the number of registers , and the amount of
shared memory used by the thread.Themaximnumberofcores (threads) that can run in
parallel is :
min{numberof registersinmultiprocessormaxthreads
numberof registerusedbythread
, ,32}sizeof sharedmemoryinmultiprocessor
sizeof sharedmemoryusedbythread.
As the number of registers and shared memory location for
the above HFT algorithms are up to 100 , and the number of
registers and memory locations are abo ut thousands , results
that 32 maxthreads . The number of cores running in parallel
is32.N numberof multiprocessors .
We should be aware that Intel processor and CUDA cores
execute one instruction in different number of processor
cycles as in Tabl e 2 (assuming that the CUDA occupancy is
maximal andIntel processor has 53 GFlops).
Table 2. The ratio r(CUDA/Intel)
Number of
multiprocessorsGflops per
corer
TESLA- C8704 1 0.02
FERMI- GTX48010 2 0.04
KEPLER – GK11016 6 0.11
MAXWELL – GM20012 16 0.30
The CUDA grid must be ca refully chosen to obtain largest
occupancy. ForHFT applications a one-dimensional grid is
defined with a number of blocks multiple
of32.numberof multiprocessors . The number of feeds to be
processed is equal with the number of blocks defi ned in the
grid and it should be correlated with the number of ports in
the internet interface (up to 3072 in thisproposed solution).
The execution time using CUDA technol ogy is illu strated in
Figure 7 and the speedup comparing with Intel platform is
illustrated in Figure 8.
Figure 7. The computational time of HFT algorithms using
MAXWELL CUDA technology
From Figure 7 one can observe that the computational time
is lower than one milliseconds.
Figure 8. Comparison between Intel i7 and different CUDA
platforms
Considering the network latency (f orthe above presented
solution) negligible , the tradin g time will be lower than
milliseconds .
V.CONCLUSION
This pap er evaluates the perfo rmanceof high frequency
trading solutions to answer the question if more speed is
necessary or the current possible solutions , with reasonable
costs, are adequate to financial markets requirements. This
work found that the computational time of basic HFT
algorithms using the CUDA technology is below the threshold
given by the financial researches as reasonable to maintain a
good health of financial markets. Network latency , forHFT
dedicated architecture, is practical ly negligible compari ng
with the comput ational time. For basicHFTalgorithms and
network architectures with low latency design , the current
medium cost computing platforms and networks solutions are
good enough to achieve the performance imposed by the
financial markets regulations. Further wo rk should investigate
the possibility to extend the computational power to
implement more complex HFT algorithms and analyze more
feeds in parallel. The new CUDA Pascal architecture ,
announced for second quarter of 2016, may achieve this goal
as it is claimed to be ten times faster than previous CUDA –
Maxwell architecture ,alreadysuitable to implement basic
HFT algorithms in sub -milliseconds.
ACKNOWLEDGEMENT
The authors are truly grateful to Marius Zoican for his
support and feedback on high frequency trading.
REFERENCES
[1]Michael J. Mcgowan, “The Rise of Computerized High Frequency
Trading: Use and Controversy”, Duke Law & Technology Review, No. 016,
2010,
http://scholarship.law.duke.edu/cgi/viewcontent.cgi?article=1211&context=dl
tr, accessed February 2016
[2]Menkveld, Albert J. and Zoican, Marius A., Need for Speed? Exchange
Latency and Liquidity (June 2, 2015). Tinbergen Institute Discussion Paper
14-097/IV/DSF78. Available at SSRN: http://ssrn.com/abstract=2442690 or
http://dx.doi.org/10.2139/ssrn.2442690[3]Jacob Loveless, S asha Stoikov, and Rolf Waeber, “Online Algorithms in
High-frequency Trading”, ACMQUEUE, October 7, 2013, Volume 11, issue
8,https://queue.acm.org/detail.cfm?id=2534976 , accesed February 2016
[4]Sangjin Han, “On Fair Comparison between CPU and GPU” ,
http://www.eecs.berkeley.edu/~sangjin/2013/02/12/CPU -GPU-
comparison.html , accessed February 2016
[5]Marc Hamilton, “GPU Technology: Past, Present, Future ”,
http://www.nvidia.com.tw/content/PDF/GTC/keynote/marc -hamilton-nvidia-
keynote.pdf , accesed February 2016
[6] “Programming Guide CUDA Toolkit Documentation ”,
http://docs.nvidia.com/cuda/cuda -c-programming -guide/#axzz3zOJeunTV ,
accessed February 2016
[7] Tuning CUDA Applicatio ns forKepler,Application Note , DA-06288-
001_v7.5, September 2015 ,
http://lutgw1.lunet.edu/cuda/pdf/Kepler_Tuning_Guide.pdf ,accessed
February 2016
[8] Tuning CUDA Applications fo rMaxwell,Application Note , Da-07173-
001_V7.5, September 2015 ,
http://lutgw1.lunet.edu/cuda/pdf/Maxwell_Tuning_Guide.pdf , accessed
February 2016
[9]“Nvidia: Pascal Is 10X Maxwell , Launching in 2016”,
http://wccftech.com/nvidia -pascal-gpu-gtc-2015/#ixzz3zOHsvY9S , accessed
February 2016
[10] “A Primer on Real World High Frequency Trading Taxonomy”,
https://www.arista.com/assets/data/pdf/HFT/HFTTradingNetworkPrimer.pdf ,
accessed February 2016
[11] Abhishek Sharma, “Cisco Low Latency Solution”,
http://www.ciscoknowledgenetwork.com/files/245_Cisco_Low_Latency_coll
ateral.pdf?PRIORITY_CODE =, accesed February 2016
[12] “Algo Speed High -Frequency Trading Sol ution”,
http://www.cisco.com/c/dam/en_us/solutions/industries/docs/finance/c22 –
658397-01_sOview.pdf , accessed February 2016
[13] Aly Khaled, “FPGA -based acceleration for high -frequency trading”,
http://www.eetasia.com/STATIC/PDF/201408/EEOL_2014AUG14_PL_TA_
01.pdf?SOURCES=DOWNLOAD , accessed February 20 16
[14] “Cisco Nexus 3000 Series Switches”,
http://www.cisco.com/c/en/us/products/switches/nexus -3000-series-
switches/index.html , accessed February 2016
[15] “Cisco Nexus 7000 Series Switches”,
http://www.cisco.com/c/en/us/products/switches/nexus -7000-series-
switches/index.html , accessed February 2016
[16] “Cisco Catalyst 4900M Switch”,
http://www.cisco.com/c/en/us/products/switches/catalyst -4900m-
switch/index.html , accessed February 2016
[17] “Cisco ASA 5 585-X Next-Generation Firewall Data Sheet”,
http://www.cisco.com/c/en/us/products/collateral/security/asa -5500-series-
next-generation -firewalls/datasheet -c78-730903.html , accessed February
2016
[18] “Cisco Unified Computing System”,
http://www.cisco.com/c/dam/en/us/solutions/collateral/data -center-
virtualization/unified -computing/at_a_glance_c45 -523181.pdf , accesed
February 2016
[19] Rodgers, David P. (June 1985). "Improvements in multiprocessor system
design". ACM SIGARCH Computer Architecture News archive (New York,
NY, USA: ACM) 13 (3): 225 –231.doi:10.1145/327070.327215 .ISBN0-
8186-0634-7.ISSN0163-596
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Computing System and Network Architectures in High [601645] (ID: 601645)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
