T ec hnical Military A cadem y of Buc harest [627855]
T ec hnical Military A cadem y of Buc harest
Departmen t of Computer Engineering
PHD THESIS
Multicore CPUs for P arallelizing
Netw ork Intrusion Preven tion
Scien tic A dviser: Author:
Prof. Victor V aleriu P atriciu Radu V elea
ii
Buc harest, 2017
Maecenas elemen tum v enenatis dui, sit amet
v ehicula ipsum molestie vitae. Sed p orttitor
urna v el ipsum tincidun t v enenatis. A enean
adipiscing p orttitor nibh a ultricies. Curabitur
v ehicula semp er lacus a rutrum.
Quisque ac feugiat lib ero. F usce dui tortor,
luctus a con v allis sed, lacinia sed ligula.
In teger arcu metus, lacinia vitae p osuere ut,
temp or ut an te.
Abstract
Here go es the abstract ab out MySup erPro ject. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. A enean aliquam lectus v el orci malesuada accumsan. Sed lacinia egestas
tortor, eget tristique dolor congue sit amet. Curabitur ut nisl a nisi consequat mollis sit
amet quis nisl. V estibulum hendrerit v elit at o dio so dales pretium. Nam quis tortor sed
an te v arius so dales. Etiam lacus arcu, placerat sed laoreet a, facilisis sed n unc. Nam
gra vida fringilla ligula, eu congue lorem feugiat eu.
ii
Contents
A c kno wledgemen ts i
Abstract ii
1 `Plan` – W ork in Progress 1
1.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Problem Statemen t . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Thesis Ob jectiv es and Outline . . . . . . . . . . . . . . . . . . . . 1
1.2 Net w ork In trusion and Detection Systems . . . . . . . . . . . . . . . . . 1
1.2.1 Denitions of General Concepts . . . . . . . . . . . . . . . . . . . 1
2 CPU P arallelism in Net w ork Securit y 2
2.1 Multitasking in Securit y Solutions . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Lin ux Net w ork Stac k . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Ev olution of P arallel Hardw are and APIs . . . . . . . . . . . . . 5
2.2 Multithreaded In trusion Detection and Prev en tion Systems . . . . . . . 8
2.2.1 Snort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Suricata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 The Need for more P arallelism . . . . . . . . . . . . . . . . . . . 10
2.3 P o w er and P erformance Analysis T o ols . . . . . . . . . . . . . . . . . . . 11
2.3.1 VT une . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 P o w erTOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 In tel R
P o w er Gadget . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 GPGPU Impact on Securit y 13
3.1 In tro duction to GPU Programming . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Op enCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 V ulk an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Early Exp erimen ts in In trusion Detection and Prev en tion Systems 18
3.1.4 P erformance Considerations for GPGPU in Net w ork Securit y . . 19
3.2 Pro of of Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Stateless Firew alls . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Stateful Firew alls . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 T ec hnological A dv ancemen ts . . . . . . . . . . . . . . . . . . . . . 20
3.3 Mo deling a Hybrid In trusion Detection System . . . . . . . . . . . . . . 21
3.3.1 Prop osed Soft w are Stac k . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Estimated Hardw are Requiremen ts . . . . . . . . . . . . . . . . . 22
iii
CONTENTS iv
4 String Matc hing 24
4.1 Classical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Bo y er-Mo ore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 Kn uth-Morris-Pratt . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.3 Aho-Corasic k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.4 Other Algorithms and Notable V ariations . . . . . . . . . . . . . 27
4.2 P arallel String Matc hing . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Using Op enCL in Net w ork In trusion Detection . . . . . . . . . . 28
4.2.2 P arallel Malw are Detection for Laptops and Ultrab o oks . . . . . 32
4.2.3 Conclusions and Lessons Learned . . . . . . . . . . . . . . . . . . 35
5 Encryption 37
6 T racing and Visualization 38
7 Conclusions 39
List of Figures 40
Bibliograph y 45
Chapter 1
`Plan` – W ork in Progress
1.1 In tro duction
1.1.1 Problem Statemen t
1.1.2 Thesis Ob jectiv es and Outline
1.2 Net w ork In trusion and Detection Systems
1.2.1 Denitions of General Concepts
1
Chapter 2
CPU Parallelism in Network
Security
The curren t c hapter aims to pro vide a succinct o v erview on the state of the art of
soft w are securit y solutions and ho w they lev erage the parallelism a v ailable in mo dern
hardw are. The fo cus of the follo wing sections is to presen t existing solutions along with
a tec hnical description and p erformance ev aluation(where p ossible) – and suggest new
directions for impro v emen t. The discussion will b e cen tered around the Lin ux Net w ork
Stac k and what it oers for rew alls and in trusion prev en tion systems. The decision
to fo cus on Lin ux is facilitated b y the high a v ailabilit y of do cumen tation and the fact
that the k ernel is widely spread in the wild and used b y a lot of op erating systems:
Ubun tu, Red Hat, Android, Chrome OS, Y o cto Pro ject, etc.
The subsequen t sections will include con ten t describing the ev olution of parallel hard-
w are and APIs, and ho w they relate to commercial soft w are in general and net w ork
securit y to ols in particular. A summary of the use of m ulticore CPUs will b e presen ted
along with their role in designing securit y solutions. A dv an tages and limitations will
b e p oin ted out and discussed. Own con tributions and considerations on the topic will
b e presen ted along with an analysis of p o w er and p erformance to ols that are used to
b enc hmark net w orks and net w ork-related soft w are in parallel en vironmen ts.
The c hapter will conclude with some directions regarding the future ev olution of parallel
securit y solutions.
2.1 Multitasking in Securit y Solutions
2.1.1 Lin ux Net w ork Stac k
An imp ortan t sub ject to b e addressed in the start of this c hapter is the Lin ux Net w ork
Stac k. The Lin ux Net w ork Stac k is something that has ev olv ed o v er a p erio d of time.
The historical details are not v ery imp ortan t and will only b e briey addressed and the
fo cus will consist of some of the curren t asp ects that can b e used to implemen t pro of
2
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 3
of concepts and aect system and net w ork securit y . In order to understand the big
picture it is useful to k eep in mind this map of the Lin ux k ernel1:
Figure 2.1: Lin ux Kernel Map
P arallelism in the Lin ux k ernel comes in the form of k ernel threads. These tasks run
concurren tly on m ulticore CPUs and w ork collab orativ ely in the same address space.
Kernel threads are sp ecialized to w ork in certain subsystems and handle async hronous
functions. Sync hronization is done through a sp ecial k ernel API that resem bles some
mec hanisms found in userspace libraries suc h as POSIX threads. Com bined with a com-
plex in terrupt system, these elemen ts allo w the k ernel to ecien tly balance a substan tial
amoun t of w ork. F or example, when the net w ork in terface receiv es a new pac k et it just
sa v es it and defers the pro cessing for another time (in case more pac k ets come along
and ha v e to b e sa v ed). P ac k et pro cessing can b e done later either in in terrupt (tasklet)
or pro cess con text (w orkqueue). The Completely F air Sc heduler (CFS) handles load
ballancing of k ernel and userspace tasks: it mak es sure eac h task gets a fair share of
run time on the CPU (according to its priorit y and needs) and that run-queues on eac h
activ e CPU core are ev enly ballanced. Recen t impro v emen ts in pro cess sc heduling try
to tak e in to consideration the hardw are capababilities of the cores and calibrate p er-
formance and p o w er consumption (see: Energy Ecien t Sc heduling for heterogeneous
systems2). T ask sc heduling can aect the w a y applications in teract with the net w ork
stac k. This is sp ecically relev an t when dealing with applications that rely on real-time
1http://www.mak elinux.net/k ernel_map/
2https://www.lina ro.o rg/blog/co re- dump/energy- a w a re- scheduling- eas- p roject/
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 4
in terception and manipulation of net w ork trac. More-often then not, single-threaded
applications will not b e able to cop e with mo dern net w ork loads. The follo wing para-
graphs will presen t some p opular tec hnologies that can b e used to in tercept, insp ect
and pass net w ork data to more sp ecialized to ols.
A t the logical lev el net w ork trac can b e insp ected through sk_bu in terface. Sk_bu
represen ts the data structure used to handle incoming and outgoing trac in Lin ux
so c k ets. The Lin ux k ernel pro vides a ric h API for managing these ob jects and their
b eha vior in relation to existing net w ork proto cols. Kernel dev elop ers can use Netlter1
framew ork, to insert ho oks in to the net w orking stac k and insp ect or detour trac.
This mec hanism is used b y utilit y programs suc h as iptables and in creation of stateful
rew alls in Lin ux. F rom within a k ernel mo dule the ho oks can b e installed using nf_-
r e gister_ho ok – this function allo ws the dev elop er to sp ecify the p oin t of in terception
and callbac k functions. In order to pro cess the trac in userspace, a series of libraries
and mec hanisms ha v e b een implemen ted.
Berk eley P ac k et Filter2refers to a mec hanism to lter net w ork trac dev elop ed in 1992
at the La wrence Berk eley Lab oratory b y Stev en McCanne and V an Jacobson. BPF
relies on a set of p olicies that ecien tly separate in teresting data from other trac and
th us mak e b etter use of the CPU and memory .BPF denes an abstract mac hine and
instruction set that is used to express the net w ork lters[1]. Ov er the y ears BPF has
b een used to implemen t other lters, for example system call manipulation for virtual
or sandb o xed applications. Since Lin ux k ernel v ersion 3.19, BPF is b eing referred to
as extended BPF (eBPF) and it is used in other non-net w orking related areas (suc h as
tracing p oin ts, so c k ets, etc.).
A side eect of using BPF is that data no longer has to b e copied across memory
and the analysis is p erformed in place. This giv es BPF a big sp eed-up compared to
other in terfaces. The curren t userspace API enables the follo wing options of setso ckopt :
SO_A TT A CH_FIL TER , SO_DET A CH_FIL TER , SO_LOCK_FIL TER . An in terest-
ing observ ation is the use case for SO_LOCK_FIL TER – the option allo ws lo c king the
lter attac hed to a so c k et. Once set, a lter cannot b e remo v ed or c hanged. A pro cess
can, th us, setup a so c k et, attac h a lter, lo c k it then drop privileges and b e assured
that the lter will b e k ept un til the so c k et is closed[2].
T cp dump is a p opular Lin ux program that analyses net w ork trac. It relies on a p cap
library implemen tation (libp cap) to p erform this task. T cp dump w orks on most mo dern
op erating systems (the Windo ws p ort is called WinPcap) and is designed to w ork with
BPF in order to reduce the amoun t of captured trac. The in tercepted pac k ets can b e
written to a le in a format that is usable b y other to ols suc h as Snort or Wireshark.
Libp cap is a library that pro vides users with an implemen tation for an API to capture
net w ork trac and monitor the net w ork at the data link la y er. The library has b een
written in C programming language, but has bindings in other languages suc h as Ja v a
or Python. Both the library and tcp dump are dev elop ed b y the same op en source group
and are in an activ e state. Pcap can b e in tegrated in m ultithreaded programs and used
to monitor parallel and distributed applications[3]. That said, libp cap, ma y not p erform
1https://www.netlter.o rg/
2Lin ux So c k et Filtering ak a Berk eley P ac k et Filter https://www.k ernel.o rg/do c/Do cumentation/
net w o rking/lter.txt
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 5
w ell in high-sp eed data acquisition scenarios[4]. The p erformance and rate of pac k et
capture will dep end on the underlying CPU and I/O.
Libp cap dep ends on the comm unication b et w een k ernelspace and userspace more
sp ecically it needs to access or cop y the k ernel's in ternal data buers and mak e them
a v ailable to userspace applications. Recen t v ersions of libp cap tak e adv an tage of the
Lin ux zero-cop y mec hanism. This remo v es the o v erhead of cop ying buers from k er-
nelspace to userspace. Zero-cop y is supp orted in Lin ux through a series of systems
calls and Lin ux sp ecic API. F or other t yp es of comm unication b et w een userspace and
k ernelspace, a rew all dev elop er w ould ha v e to use either io ctl (used also b y libp cap)
in terface or netlink so c k ets. io ctl is a system call used to comm unicate with a k ernel
device. Giv en Lin ux's philosoph y that ev erything should b e a le or a pro cess, device
driv ers can b e op ened from userspace and the resulting le descriptor passed to and io ctl
call. The driv er will then handle the call and its parameters in ternally and comm unicate
with the caller via memory buers. An imp ortan t use case for io ctl is the creation of
loadable k ernel mo dules to accelerate net w ork pro cessing. Using the aforemen tioned
netlter API, w e can create a stateful rew all and manipulate its p olicies via io ctl or
cop y sensitiv e pac k ages to userspace, where they can b e pro cessed more ecien tly (us-
ing a sp ecial library or taking adv an tage of the m ulticore CPU or GPU arc hitecture).
The Netlink so c k et API is a relativ ely new Lin ux k ernel in terface that acts as an in ter-
pro cess comm unication (IPC) mec hanism. It cannot b e used to comm unicate across
the net w ork but can b e used to transmit data b et w een k ernelspace and userspace. The
description of the standard can b e found in this RF C: http://to ols.ietf.o rg/html/rfc3549 .
Netlink is designed to b e a more exible successor to io ctl .
2.1.2 Ev olution of P arallel Hardw are and APIs
P erformance of net w ork in trusion detection/prev en tion systems is equally aected b y
hardw are p erformance as w ell as the soft w are la y er. Some imp ortan t soft w are factors:
Scalable API
The API used should b e able to handle m ultiple proto cols and pro vide gran-
ular access to data
Driv ers and libraries should b e p ortable across hardw are to minimize auditing
eorts and main tenance
Heterogeneous En vironmen ts
Heterogeneous en vironmen ts can pro vide b etter p erformance, exibilit y and
tak e adv an tage of all the a v ailable resources
Op en Standard
Op en source pro jects are transparen t to the end user and help build trust
Large comm unities ha v e b etter c hances of nding critical bugs
Proprietary extensions can b e dev elop ed for prot. This is a v alid business
mo del for to ols suc h as Snort[5]
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 6
Hardw are factors:
P o w er Consumption
In the long run, optimizing p o w er consumption can sa v e more money than
the original in v estmen t in infrastructure
Con trolling the p o w er and temp erature of net w ork sensors reduces main te-
nance costs and increases hardw are lifespan
P arallelism
In curren t times, computing algorithms seldom rely on single-pro cess solu-
tions
P arallelism can b e ac hiev ed at instruction lev el (SIMD v ector instructions),
core lev el (m ulticore CPUs, GPUs), or m ulti pro cess lev el (distributed sys-
tems, clusters)
Imp ortan t factors are n um b er of w ork-items/threads and frequency
Memory
Memory is an imp ortan t resource in an y computing system. Its capacit y and
sp eed of access can create serious b ottlenec ks in high-sp eed and compute-
in tensiv e tasks suc h as deep pac k et insp ection
Sp ecialized memory con trollers and tec hnologies can allo w memory sharing
and data transfers b et w een p eripheral devices, sometimes without the in ter-
v en tion of the CPU
The next paragraphs will b e sp en t discussing sev eral APIs for designing parallel soft w are
securit y solutions in heterogeneous en vironmen ts. These soft w are APIs are implemen ted
as soft w are libraries or compiler extensions. They w ere designed to map on existing and
future hardw are arc hitectures. While some of their targeted platforms ma y b e GPUs, w e
will curren tly fo cus on the CPU-parallelism c haracteristics, and pro vide further details
ab out GPUs in the ensuing c hapters.
Op enCL
Op enCL is a framew ork for writing programs in heterogeneous and parallel en viron-
men ts. It is an op en standard and runs on most mo dern hardw are arc hitectures and
op erating systems. Op enCL is supp orted b y ma jor v endors suc h as In tel R
, ARM, and
AMD and has features that allo w programmers to run co de on m ultiple GPUs and the
CPU within the same computing con text or pro cess. Op enCL's memory consistency
mo del is based on ISO C11 programming language. Op enCL is pro vided as a library-
lev el API that sets the en vironmen t for parallel execution on a device (CPU, GPU,
or accelerator). The API is set to run on the host (the mac hine that designates the
tasks and handles resource managemen t). Execution of parallel instructions is done
through function-lik e constructs called k ernels. Kernel co de is a subset of C language
with some extensions for image and SIMD op erations (v ectors, oats, etc.). Kernel
co de is compiled on the host mac hine b efore b eing sen t to the device for execution.
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 7
The same k ernel co de is executed b y ev ery device thread (or w ork item) running in the
curren t con text. Similar to CUD A, device memory is group ed in 3 la y ers. In Op enCL
the k eyw ords for device memory t yp es are global, lo cal and priv ate. The k eyw ords are
mapp ed b y the compiler o v er the underlying hardw are arc hitecture. Giv en the fact that
Op enCL's goal is to pro vide a p ortable framew ork for heterogeneous platforms, memory
limitations v ary dramatically from implemen tation to implemen tation. This fact mak es
writing p ortable, p erformance ecien t algorithms v ery problematic. A ma jor b ottle-
nec k for Op enCL applications are memory transfers b et w een host and device. W e need
to ensure that w e k eep memory aligned and allo cated con tin uously in order to reduce
the o v erhead of memory transactions. struct argumen ts cannot b e passed to a k ernel
function and bit-elds struct mem b ers are curren tly not supp orted. GPUs are bad at
executing branc h instructions. Mo dern graphics hardw are will execute all branc hes and
discard the incorrect paths in order to a v oid a costly ush of the instruction pip eline.
W e m ust tak e care to reduce the n um b er of if or other conditional jumps in the k ernel
co de in order to get the b est p erformance. Op enCL k ernel co de has limitations when it
comes to p oin ter arithmetic and dereferencing global/lo cal memory . These limitations
ma y b e remo v ed in the future. Recursion is also not supp orted in Op enCL. V endor
implemen tations ma y exp erience extra limitations. These c haracteristics of Op enCL
mak e it an attractiv e solution for ooading compute-in tensiv e w ork to the GPU while
analyzing net w ork trac. Platforms that supp ort Op enCL are candidates for h ybrid
NIDPS implemen tations: applications that collab orativ ely execute parallel tasks on the
CPU and GPU.
Heterogeneous System Arc hitecture
Heterogeneous System Arc hitecture (HSA) is an in teresting up coming tec hnology de-
v elop ed b y HSA F oundation (lead curren tly b y AMD) that aims to pro vide a unied
view of a system's computing elemen ts. In other w ords, this will allo w a programmer to
transparen tly create and run soft w are while making use of the full computing p oten tial
of the underlying hardw are. The dierence in w orko w from con v en tional applications
can b e b etter describ ed through the follo wing diagrams:
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 8
Figure 2.2: On the left: steps p erformed when ooading calculations to the GPU on a
non-HSA system.
On the righ t: steps p erformed when ooading calculations to the GPU on a HSA
system, using the HSA functionalit y .
Source:[6]
As of F ebruary 2015 only AMD's Ka v eri desktop pro cessors implemen t HSA in hard-
w are. A dditional c haracteristics of HSA can b e found in this whitepap er:[7]. The
implications of this tec hnology for host in trusion detection systems and an ti-viruses
could b e signican t. Using the GPU to p erform compute in tensiv e tasks could increase
p erformance and p o w er sa vings and create a b etter user exp erience b y freeing the CPU
to handle OS or application related pro cesses[8].
Op enMP
Op enMP stands for Op en Multi-Pro cessing and is an API for m ultithreaded or shared-
memory programming. The standard is a v ailable on m ultiple platforms and op erating
systems and is supp orted b y most mo dern compilers. Op enMP is usable in C/C++
and F OR TRAN programming languages. Op enMP is under activ e dev elopmen t and
most hardw are soft w are v endors are curren tly in v esting in Op enMP implemen tations:
AMD, In tel R
, IBM, Nvidia, Red Hat, T exas Instrumen ts, Oracle, etc. The main usage
of Op enMP is in HPC applications, where it can deliv er great p erformance on m ulti-
pro cessor mac hines and grids. In distributed computing, Op enMP can b e used together
with MPI (Message P assing In terface). In net w orking, Op enMP can b e used to paral-
lelize compute-in tensiv e w ork done b y rew alls and in trusion detection systems[9].
2.2 Multithreaded In trusion Detection and Prev en tion Sys-
tems
Threads are a fundamen tal concept for parallel programing. A thread is the smallest
en tit y that can b e sc heduled b y the op erating system. Ev ery pro cess that runs on a
system has at least one thread. Unlik e pro cesses, threads share b y default parts of the
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 9
address space (heap, data, co de). This mak es sync hronization m uc h easier and faster
and maps w ell to problems that require a shared memory mo del.
Figure 2.3: On the CPU parallelism relies on the usage of pro cesses and threads.
The APIs that w ere presen ted in the previous section (Op enMP , CPU-side Op enCL)
are wrapp ers o v er system threads. In userspace threads ha v e system-sp ecic imple-
men tations inside libraries lik e POSIX or as Windo ws threads. Securit y soft w are using
threads usually link against these system libraries. The follo wing sections will briey
describ e some pro jects that p erform in trusion detection tasks and use threads.
2.2.1 Snort
Snort is an op en source NID/PS (handles b oth detection and prev en tion), originally
dev elop ed b y Martin Ro esc h in 1998. In 2013 his compan y , Sourcere, w as acquired
b y Cisco Systems; the to ol is under con tin uous dev elopmen t with v ersions a v ailable for
home users as w ell as industry mem b ers. Snort w orks b y setting the lo cal Ethernet
in terface in to promiscuous mo de and p erforming real-time analysis and logging on the
in tercepted trac. Logs can b e stored in clear text and outputted to console or the
task can b e delegated to to ols suc h as Barn y ard2[10] (op en source in terpreter for Snort
binary output) in order to reduce I/O time in high trac situations. Snort can also
in teract with m ysql databases, and users can p erform SQL queries on the logs.
Its system of alerts and actions is based on a set of rules. These rules are stored lo cally
on the mac hine and are up dated via a set of scripts p erio dically . The user is free to write
a new set of rules or use ones created b y the comm unit y . Rule matc hing in v olv es a lot of
string comparisons. The w a y Snort do es this is through a v ery ecien t implemen tation
of Aho-Corasic k algorithm [11].
Proling op erations with p erf during sev eral Snort analysis sessions yielded ab out 60%
of computing time sp en t in string matc hing op erations. This gure conrmed earlier
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 10
studies on the p erformance of in trusion detections systems[12][13]. W ork on impro ving
Snort string matc hing p erformance has led to some in teresting articles. One recen t
approac h reduces compute times b y up to 5x at the exp ense of memory consumption[14].
With v ersion 3.0 Snort's design c hanged to incorp orate m ultithreading: the user has an
option to congure the n um b er of threads that are a v ailable for pac k et pro cessing.
2.2.2 Suricata
Suricata is an op en source NID/PS that uses a high degree of parallelism – its designers
to ok m ulti-core and m ulti-pro cessor arc hitectures in to consideration from inception. It
is supp orted b y industry-lev el partners suc h as FireEy e and Pro ofp oin tTM. Suricata
fo cuses on high p erformance and can ac hiev e sp eeds of up to 10 GB/s on commo dit y
hardw are. Suricata in tegrates a highly parallel arc hitecture that can use m ultiple CPU
threads and has a CUD A mo dule for GPU pro cessing; it can p erform data analysis
in real-time as w ell as oine. Other notable features include a wide range of output
formats and conguration options, heuristics based on host reputation, m ultiple pattern-
matc hing algorithms and m ultiple proto col parsers [15][16].
Suricata's m ultithreaded mo del can ha v e its dra wbac ks – p erforming a similar lo cal
analysis to the one men tioned in the Snort section, yielded the fact that more than 10%
computing time w as sp en t sync hronizing threads (in pthread lo c k and unlo c k m utex
functions).
That b eing said, a pro of of concept w as dev elop ed in 2013 at Altera that p orted parts of
Suricata's IDS/IPS engine on FPGAs and accelerated them using Op enCL[17]. The ex-
p erimen t sho w cased the b enets of running Op enCL applications on inheren tly parallel
hardw are suc h as the Stratix R
V FPGA from Nallatec h.
2.2.3 The Need for more P arallelism
Since the adv en t of the m ulti-core CPU, soft w are has b egun lev eraging the p o w er of
parallelism with the main purp ose of making applications faster and more resp onsiv e.
The gro wth in computing p o w er comes hand in hand with the amoun t of data generated
b y mo dern devices. This has implications in all elds of computer science and driv es
for the impro v emen t of storage, transp ort and protection of data.
Cloud Computing aims to oer users and organizations the abilit y to store and access
their data remotely while resp ecting their priv acy and in tegrit y . Cloud v endors ha v e
th us b ecome an attractiv e target for malicious en tities that w an t to compromise their
securit y and disrupt their business.
Other generators of big data and trac are devices that mak e up the In ternet of Things;
giv en the heterogeneous nature of hardw are and soft w are in v olv ed in suc h pro cesses,
ecien tly protecting them from the ey es and ears of outside parties b ecomes an ev en
greater c hallenge.
While existing tec hnologies pro vide decen t p erformance to solv e these problems, new
paradigms and hardw are c hanges come with the promise of b eing able to pro vide in-
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 11
creased throughput, lo w er p o w er consumption and b etter scalabilit y . The next c hapter
will con tin ue the discussion ab out the b enets of parallelism and in tro duce the GPU as
an up coming tec hnology that will shap e the net w ork and securit y ecosystems.
2.3 P o w er and P erformance Analysis T o ols
In order to b enc hmark net w ork securit y solutions w e need to iden tify a series of to ols
and pro cedures for measuring the most imp ortan t p erformance metrics. This section
will describ e the existing to ols for analyzing parallel securit y solutions and net w ork
applications. I will b e discussing the mec hanisms used in the industry and academia
and cite some recen t w ork on this topic.
The purp ose of this section is to set some guidelines for assessing the p erformance
of curren t parallel securit y solutions and help prole the future protot yp es.I will not
b e describing con v en tional net w ork analysis to ols suc h as Wireshark b ecause of the
abundan t sources existing elsewhere. I will try to fo cus on p o w er and p erformance
analysis b ecause I think it is more imp ortan t in areas where high p erformance computing
and net w ork securit y mix.
2.3.1 VT une
VT une is the name for a soft w are pro duct dev elop ed b y In tel R
. Its main function is
p erformance analysis for soft w are running on In tel R
x86 and x64 arc hitectures. VT une is
used as part of In tel P arallel Studio, (In tel R
P arallel Studio XE 2015, 2015); it pro vides
p erformance information on a p er thread basis and helps iden tify b ottlenec ks during
application run time (dynamic analysis).
An early example of VT une usage can b e found in this pap er[18]; here VT une w as used
to b enc hmark encryption applications and net w ork soft w are from at instruction lev el.
Soft w are libraries resp onsible for trac capture and analysis (ex: p cap) ha v e b eneted
from VT une – used to iden tify b ottlenec ks and driv e optimization eorts[19].
While not a cen tral elemen t of academic w ork, VT une is cited frequen tly in w orks that
need to pro vide p erformance metrics or other similar measuremen ts. It can b e used
in arc hitectural studies where the most imp ortan t thing is the big picture or some
abstract computing mo del suc h as collab orativ e net w ork securit y in cloud[20], or tuning
p erformance of lo w-lev el algorithms and highly-parallel w orkloads[21].
2.3.2 P o w erTOP
P o w erTOP is an op en source to ol used for measuring electrical p o w er consumption on
UNIX op erating systems. P o w erTOP analyzes the programs, device driv ers, and k ernel
options running on a computer and estimates the p o w er consumption resulting from
their use as w ell as CPU frequency scaling and other useful data.
A cademic w orks cite P o w erTOP in a similar manner to VT une. With a gro wing em-
phasis on green computing, most soft w are and hardw are v endors try to get the most
CHAPTER 2. CPU P ARALLELISM IN NETW ORK SECURITY 12
of battery life on mobile devices or to reduce data cen ter main tenance cost. Soft w are
securit y solutions ha v e a compute-in tensiv e nature, whether they analyze net w ork traf-
c or protect individual hosts iden tifying p o w er-h ungry elemen ts and optimizing or
rep orting their activit y is a ma jor win.
In net w orking it can P o w erTOP can b e used analyze the p erformance of v arious sensors
(ex: wireless sensors, [22]). A new t yp e of denial of service attac k that causes fraudulen t
energy consumption in cloud infrastructures has b een do cumen ted in 2015[23]. The
information rep orted b y P o w erTOP and its in ternal database can help iden tify these
kinds of attac ks and allo w v endors to adjust their securit y p olicies.
2.3.3 In tel R
P o w er Gadget
In tel R
P o w er Gadget is a free p o w er monitoring to ol for x86 pro cessors. The to ol w orks
similar to P o w erTOP and pro vides an estimation of the actual p o w er consumed b y the
CPU so c k et. F or more accurate gures on p o w er, hardw are instrumen tation is advised.
Most of the exp erimen ts presen ted in this thesis do not require a high lev el of accuracy
for p o w er measuremen ts. The exp ected dierence b et w een serial and parallel tec hnolo-
gies should b e signican t enough to o v ershado w an y noise our measuring instrumen ts
migh t add. The reader should lo ok o v er these exp erimen ts and tak e a w a y the relativ e
prop ortion of the v alues, k eeping in mind that the soft w are to ols migh t ha v e v arying
resolutions. A full disclaimer can b e found on the v endors w ebsite1.
1https://soft w a re.intel.com/en- us/a rticles/intel- p o w er- gadget- 20
Chapter 3
GPGPU Impact on Security
The previous c hapter briey describ ed some of the applications of CPU parallelism
in net w ork securit y and ho w mo dern op erating systems and NIDPSs tak e adv an tage
of m ulticore arc hitectures. The need for sp eed in detecting incoming threats cannot
b e solv ed b y using commo dit y CPUs alone. En terprise solutions rely on sp ecialized
compute sensors inside their net w ork infrastructure to implemen t and enforce restrictiv e
securit y p olicies. A cademic researc h has tried to nd extra computing resources that
could help users and organizations feel safer on the In ternet. T o this end GPUs ha v e
started to b e used for general purp ose programming in applications that are not related
to graphics. This is called GPGPU and it is a gro wing trend among soft w are and
hardw are man ufactures alik e.
The GPU's design is inheren tly parallel and pro vides a b etter platform for SIMD al-
gorithms than the m ulticore CPU. A GPU has man y small cores (compute units) that
toghether supp ort thousands of threads. This mak es it ideal for executing em barass-
ingly parallel co de. Besides man y parallel compute units, GPU arc hitecture con tains
dedicated memory (Video RAM), whic h is faster than regular RAM. Other hardw are
elemen ts lik e o v erla ys and deco ding units p erform dieren t tasks.
GPUs can b e dedicated or in tegrated in to the c hipset. Dedicated GPUs tend to b e more
p o w erful and ha v e their o wn memory resources, while in tegrated GPUs share some of the
c hipset resources. CPU comm unication with dedicated GPUs dep ends hea vily on the
qualit y of the in terconnect buses a v ailable in the system. Memory transfers are kno wn
to b e a common problem that causes p erformance degradation in GPGPU computing.
In tegrated graphics ha v e few er compute units, but can w ork in closer collab oration with
the CPU (due to the unied memory mo del) and consume less p o w er. W e will go in to
further detail on CPU-GPU in teractions in the sections to come.
GPUs are presen t in most consumer form factors: automotiv e b oards, mobile phones,
tablets and other lo w-end devices, laptops, desktop PCs, serv ers, etc. Con v en tional
applications that are GPU-in tensiv e usually in v olv e graphics pro cessing for video and
other media (computer games, design soft w are, CAD2, etc.). The curren t thesis will
describ e some of the uncon v en tional use cases for GPU computing in soft w are securit y
and net w ork securit y .
2computer-aided design
13
CHAPTER 3. GPGPU IMP A CT ON SECURITY 14
The curren t c hapter will reiterate some of the problems facing securit y applications on
host systems, as w ell as on the net w ork, and presen t existing solutions for optimizing
p erformance using parallel computation (GPGPU, h ybrid CPU + GPU). Original and
p ersonal con tributions will b e highligh ted accordingly and the author will share some
of his though ts on the state of the art and future directions.
3.1 In tro duction to GPU Programming
GPU programming comes in man y a v ors: con v en tional APIs lik e Op enGL(ES) and
DirectX pro vide graphics dev elop ers with the means to create complex scenes that
accurately mimic the visual and ph ysical c haracteristics of real life.
A t some p oin t during their ev olution, graphics pro cessors b egan to b e used for other
computational purp oses b esides graphics: high p erformance computing, signal pro cess-
ing, medical imaging, cryptograph y , etc. T o meet these requiremen ts, stak eholder orga-
nizations got together and created new standards and programming in terfaces to help
dev elop ers along the w a y .
A p opular standard for parallelizing co de on the CPU is Op enMP . It is based on a
fork-join mo del and allo ws programmers to write a parallel implemen tation on top of
the original, serial, co de. This has man y adv an tages when it comes to dev elopmen t
sp eed and debugging. Op enMP dev elopmen t is bac k ed b y companies suc h as AMD,
In tel R
, IBM, etc. Op enMP can w ork w ell in distributed en vironmen ts together with
MPI (message passing in terface) or other APIs. New er v ersions can run on GPUs and
accelerators, making Op enMP a v ersatile to ol when it comes to optimizing existing
solutions[24]. A cademic w ork has sho wn the p oten tial of Op enMP to impro v e net w ork
in trusion detection systems through parallelism[9].
CUD A is a parallel computing framew ork from NVIDIA. It is one of their most p op-
ular tec hnologies, exp osed to users as an application programmable in terface with the
underlying driv ers and hardw are. CUD A has app eared in 2007 and has ev olv ed to
w ork with other APIs suc h as Op enGL, Op enA CC and Op enCL. It has a wide range
of applicabilit y and has b een used extensiv ely in academia and industry to lev erage the
b enets of GPU computing examples can include accelerating encryption ciphers[25]
or impro ving real-time automated ngerprin t iden tication systems[26].
Another no v el and in teresting tec hnology is Op enCL – a framew ork for writing programs
in heterogeneous and parallel en vironmen ts. Op enCL allo ws users to write highly par-
allel co de that can run on a m yriad of devices – with the option of using them as part
of the same compute con text. Op enCL is an op en standard, dev elop ed b y Khronos
consortium. Sister-APIs include Op enGL and V ulk an for graphics pro cessing. Recen tly
these t w o standards ha v e b een augmen ted to p ermit the use of compute shaders (similar
to Op enCL or CUD A k ernels) for general purp ose computing.
Similar tec hnologies are a v ailable on mobile devices, and there ha v e b een eorts to use
them to impro v e v arious securit y asp ects in Android or devices part of the In ternet
of Things. Mo dern smart phones p osses a graphics pro cessing mo dule on c hip that
usually shares some amoun t of memory with the main CPU. This reduces the memory
CHAPTER 3. GPGPU IMP A CT ON SECURITY 15
bandwidth p enalt y for CPU-GPU transfers that plague so man y parallelization eorts
on desktop and serv er hardw are.
T ec hnologies that can b e used to program mobile GPUs on Android are Renderscript
and Op enCL (2.0 standard on w ard). While still b eing more ecien t compared to CPU
co de in terms of computing time and p o w er consumption[27][28], the general limitations
in memory size and battery presen t on mobile devices ha v e so far imp eded signican t
eorts to impro v e most securit y pro cesses using the GPU.
3.1.1 Op enCL
This section is dedicated to the detailed description of Op enCL in ternals. Throughout
this thesis Op enCL will b e presen ted as the framew ork of c hoice when parallelizing
compute-in tensiv e tasks on the GPU. The reason b ehind this is the a v ailabilit y of the
API across m ultiple hardw are arc hitectures and op erating systems. Op enCL has an in-
teresting qualit y: the parallel co de can also run on m ulti-core CPUs. This fact op ens up
in teresting options regarding h ybrid implemen tations and allo ws deplo ymen t on hard-
w are platforms that do not ha v e a GPU (ev en though suc h deplo ymen t is exp ected to
b e met with sev ere p erformance p enalties).
The standard for Op enCL w as rst published in 2008, with implemen tations b eing
made a v ailable to the public since 2009. The consortium resp onsible for elab orating
the standard issued a statemen t to publish an up dated v ersion of the standard ev ery
18 mon ths. Some k ey elemen ts that w ere tak en in to consideration for the new standard
w ere p ortabilit y , con tin uous in tegration and in teraction with other APIs (Op enGL).
W rapp ers o v er the original standard w ere dev elop ed in the form of W ebCL. The standard
enjo ys b oth proprietary and op en source implemen tations. Most applications on the
mark et emplo ying Op enCL op erate in the elds of computer aided engineering and
video/image enco ding. A cademic researc h around the standard is mostly asso ciated
with high p erformance computing (HPC):
medical imaging
meteorology and climatology
molecular and particle dynamics
cryptograph y
Bitcoin mining
Op enCL API is implemen ted as a driv er and userspace library . The driv er manages in-
teractions with the dedicated hardw are and the library acts as a supp ort for application
dev elopmen t. Co de written for Op enCL devices can b e group ed in to t w o categories:
host and devic e co de.
Host co de runs on the CPU. It can b e written in C/C++ and link ed directly with
the Op enCL library . Bindings in other programming languages include C#, Ja v a, Go
and Python. Host co de is resp onsible for setting up the Op enCL en vironmen t (called
con text). The host API can p erform platform queries to determine the b est options for
parallelization and can allo cate the necessary resources for the GPU to use: memory
CHAPTER 3. GPGPU IMP A CT ON SECURITY 16
ob jects, sync hronization elemen ts, run time compilation, etc. Host co de manages the
lifetime of Op enCL ob jects and handles the transfers to and from the GPU – this
asp ect can pla y a critical role in application p erformance, regardless of the parallel co de
running on the GPU.
Device co de runs on the GPU or accelerator. The co de is loaded b y the host either
in binary format or as plain text (if the co de is in text format it will require run time
compilation). Device co de is mean t to b e executed in parallel. The cen tral elemen t
of the device co de is the __kernel function. This function resem bles a C99 function.
The k ernel do es not return a v alue and uses C-lik e base t yp es and argumen ts. The
argumen ts are passed from the host co de. Eac h k ernel instance is designed to run on
a single hardw are elemen t (GPU thread or w ork item). F rom a logical p oin t of view
w e could think of these functions as thread start-up functions from Windo ws threads
or pthreads. In an ideal scenario, the k ernel co de should con tain the most compute-
in tensiv e elemen ts of the application. The n um b er of la y out of the w ork items placed
in to execution is decided from the host co de and is dynamic. Op enCL supp orts m ulti-
dimensional execution patterns (the w ork items are organized in to groups of up to three
dimensions). Lo cal groups of threads can b enet from hardw are extensions suc h as lo cal
memory or cac hing. GPU memory access patterns need to b e tak en in to consideration
when queuing as group of w ork items for execution.
An image of the relation b et w een the platform, device and w ork-item abstractions can
b e seen b ello w:
Figure 3.1: Op enCL abstractions. Source:[29]
Op enCL's memory mo del reects the memory hierarc hies a v ailable in GPUs. Host
memory is usually RAM. Global memory is VRAM and is shared b y all the executing
GPU threads. On lo w-end devices suc h as smartphones, tablets or ultrab o oks global
memory represen ts the memory shared b et w een the CPU and GPU: L3 cac he for In tel's
unied memory mo del arc hitectures (where in tegrated graphics lac k dedicated memory).
Lo cal and priv ate memory represen t v arious lev els of cac hing a v ailable on the hardw are.
Dep ending on the arc hitecture they ma y o v erlap or b e absen t in hardw are. Op enCL's
CHAPTER 3. GPGPU IMP A CT ON SECURITY 17
logic presen ts a uniform in terface to the programmer that can hide this condition.
Figure 3.2: Memory hierarc h y of AMD GPUs: represen tativ e of the Op enCL memory
mo del. Source:[30]
P ersonal con tributions in v olving Op enCL required careful tunning of memory allo ca-
tions and accesses. The next c hapters will pro vide measuremen ts that highligh t the
impact of memory on Op enCL p erformance.
3.1.2 V ulk an
V ulk an is a new API for graphics that aims to sup ersede Op enGL in terms of p er-
formance and CPU usage. The standard emerged in early 2016 and is supp orted b y
ma jor v endors suc h as AMD, In tel and Nvidia. V ulk an can use GLSL shaders and
Op enCL k ernels; it compiles them in to an in termediary represen tation called SPIR-V
(Standard P ortable In termediate Represen tation). This new approac h helps solv e some
compatibilit y issues across m ultiple platforms.
Implemen tation con trol of V ulk an applications is explicitly passed to the dev elop er.
Unlik e Op enGL the programmer no longer has to deal with a single state mac hine
con tained within a single con text, but rather m ultiple ob jects and command buers.
Error c hec king, memory managemen t and sync hronization b ecomes the resp onsibilit y
of the user. In doing this, creators of the standard put more p o w er in the hands of the
users and made driv er dev elopmen t for V ulk an simpler and less demanding (compared
to other graphics APIs).
V ulk an k eeps some of the abstractions dened in Op enGL and Op enCL and adds new
features:
devices (abstraction for underlying hardw are and driv er)
queues (for execution – eac h queue is asso ciated to a device)
CHAPTER 3. GPGPU IMP A CT ON SECURITY 18
command buers (ob jects that record commands that can later b e submitted to
a queue for execution)
sync hronization ob jects (fences, semaphores, barriers, ev en ts)
As a future goal, an y w ork done with Op enCL or CUD A could in theory b e p erformed
b y V ulk an. An in teresting case study could b e created b y comparing implemen tations
from these APIs bac k-to-bac k.
3.1.3 Early Exp erimen ts in In trusion Detection and Prev en tion Sys-
tems
Net w ork in trusion detection systems (NIDS) are comprised of soft w are solutions and
dedicated hardw are designed to monitor net w ork trac and alert the system admin-
istrator of an y suspicious activit y . Net w ork in trusion prev en tion systems (NIPS) add
a proactiv e elemen t to NIDS, and are capable (in theory) of taking decisiv e action in
order to prev en t unauthorized activities inside the net w ork. NIDS can b e though t of
as complex logging mec hanisms that can serv e to iden tify malicious b eha vior and p er-
form correctiv e actions at a later date. NIPS on the other got to b e able to resp ond
in real-time to securit y threats. The t yp e of resp onse can b e static (based on a set of
predened rules), or dynamic (follo wing complex algorithms that use mac hine learning
or similar articial in telligence sc hemes).
The National Institute of Standards and T ec hnology (NIST) has issued a do cumen t to
describ e some of the b est practices in deplo ying NIDS and NIPS within an organization's
net w ork. Dieren t t yp es of detection and protection mec hanisms can b e found here[31]
and are helpful in understanding the problem at length.
Host In trusion Detection and Prev en tion Systems (HIDS) comprise of soft w are re-
w alls, monitoring and an tivirus soft w are that are designed to w ork and protect a single
mac hine. Besides incoming and outgoing trac, these systems also monitor in ternal
op erating system conguration and b eha vior. Deplo y ed within a net w ork they act as a
`last line of defense' against hac k ers. The rest of this section will describ e some op en
source in trusion detection and prev en tion systems, highligh t their latest features and
commen t on their p erformance and usage of parallel and GPU-sp ecic algorithms.
Eorts to use GPUs to accelerate in trusion detection can b e traced bac k to 2008, when
Gnort, a v ersion of Snort 2.2.1 that used GPU pro cessing to accelerate Aho-Corasic k
computation, w as rst published[32].
Since then there has b een signican t w ork to use up coming tec hnologies to further lev er-
age the adv an tage that highly-parallel GPU computation oers. Studies ha v e fo cused
on b oth CUD A[33] [34] and Op enCL[35][36] to p erform deep pac k et insp ection as part
of NIDS and to adv o cate the future usage of GPUs to aid net w ork sensors and rew alls.
P orting Suricata's 2.2.2 CUD A co de has b een planned since 2013, but as of this article no
progress has b een made in this direction. A p ossible explanation for this dela y can b e the
man-hour cost for undertaking suc h a task, compared to the exp ected b enet: Op enCL
p erformance on commo dit y hardw are w as trailing at the time b y a short distance b ehind
CUD A[37].
CHAPTER 3. GPGPU IMP A CT ON SECURITY 19
Aho-Corasic k string matc hing automaton has b een customized to w ork with CUD A's
memory mo del. V ersion 2.0 of Suricata p ossesses a single k ernel ( SCA CCudaSe ar ch in
util-mpm-ac-cuda-kernel.cu ) that handles the ab o v e task.
3.1.4 P erformance Considerations for GPGPU in Net w ork Securit y
Figures presen ted in academia w orks on the p erformance of GPU rew alls or pac k et
pro cessing mo dels usually state the p enalties and b ottlenec ks encoun tered when w orking
with GPU hardw are and APIs.
A relev an t example of this is GASPP[38] – a GPU-accelerated stateful pac k et pro cessing
framew ork. F ollo wing a similar path as the one describ ed b efore, the authors created a
mo dular system capable of assisting trac pro cessing applications, in trusion detection
and prev en tion systems, pac k et encryption applications, and trac classication to ols.
The strategy relied on cac hing incoming pac k ets and pushing them to the GPU opp or-
tunistically – using a w ell dened metric some pac k ets could b e pro cessed immediately
b y the CPU to a v oid the o v erhead of data transfers.
Memory bandwidth is in fact a serious limitation when dealing with high-sp eed net w ork
trac (10 Gb or bigger). Some data pro cessing op erations do not t w ell with the GPU
programming mo del:
Conditional pro cessing and branc hing op erations
Inconsisten t execution paths and div ergen t w orkloads (uctuating net w ork load)
Op eration sc heduling for hard real-time pac k ets (that cannot w ait for the transfer
buer to get full or a timer to expire)
Nev ertheless the high throughput of mo dern GPUs can pro v e cost-eectiv e in pro cessing
large amoun ts of data at once – unied memory mo del hardw are ma y in the future bridge
the memory gap and allo w for ecien t real- time trac pro cessing. Heterogeneous APIs
are ev olving in a direction that targets all mark et segmen ts: mobile, clien t and serv er.
Dev elop ers are exp ected to b e able to access GPUs for general purp ose computation
regardless of the underlying arc hitecture.
3.2 Pro of of Concepts
The curren t section will list the most recen t w ork done in academia around the concept
of a GPU-p o w ered rew all. The discussion will fo cus on the recen t trends and curren t
state of the art. GPUs ha v e b een used to accelerate net w ork applications for quite
some time. An area that can b enet from parallelization in mo dern rew alls is pac k et
classication. Firew alls can b e implemen ted in hardw are or soft w are, but in the curren t
con text w e will refer just to soft w are rew alls.
CHAPTER 3. GPGPU IMP A CT ON SECURITY 20
3.2.1 Stateless Firew alls
Stateless rew alls p erform static pac k et ltering actions based on a set of predened
rules. They are not able to adapt and detect new trac patterns and emerging threats.
This rigidit y can b e a dra wbac k in some scenarios, but they mak e up for this shortcoming
b y b eing generally c heap er to build, main tain and congure. An in teresting pro of of
concept w as presen ted b y[39]. Their solution relied on Netlter ho oks and a k ernel
mo dule to create a stateless rew all. The mo del forw arded net w ork trac to user space
where it w as pro cessed on the GPU using CUD A. The pro cessing in v olv ed applying a
series of rules to eac h pac k et and acting on the data regarding the nature of a pac k et
(v erdict). Eac h pac k et w as pro cessed b y a single GPU blo c k that created m ultiple
threads – in this w a y the authors w ere able to lev erage b oth data and function-parallel
capabilities of the GPU. The setup w as th us able to reduce pac k et dela y caused b y
the rew all o v erhead through the use of parallelism. GPU pro cessing has b een used to
augmen t net w ork sensors in clusters with considerable high data trac. It has b een
sho wn that a single Nvidia M2070 GPU can handle in excess of 11 million pac k ets p er
second[40].
3.2.2 Stateful Firew alls
Unlik e their coun terparts, stateful rew alls attempt to p erform pac k et ltering based
on dynamic rules and connection parameters suc h as proto col t yp e, p ort n um b er, tim-
ing, etc. They are generally slo w er and more exp ensiv e to main tain, but can pro vide
enhanced protection against certain t yp es of attac k ers. GPGPU computing has b een
used to impro v e rew all p erformance b y implemen ting SIMD algorithms that handle
complex scenarios and trac patterns[41]. Besides throughput, p o w er consumption is
another consideration when designing rew all and NIDPSs. Commercial, o-the-shelf
GPUs consume considerably less p o w er than CPUs when p erforming similar tasks. This
is b ecause of mo dern CPUs are designed around the concept of race-to-sleep. This
concept relies on the CPU nishing its computational tasks as so on as p ossible then go-
ing in to a lo w-p o w er state and reducing its frequency . This b eha vior do es not map w ell
with the constrain ts of net w ork trac analysis – where constan t async hronous ev en ts
can k eep the CPU ramp ed up for long p erio ds. On the other hand a GPUs parallel
design allo ws it to outp erform ev en dedicated routers in terms of energy eciency [42].
3.2.3 T ec hnological A dv ancemen ts
Commercial hardw are man ufacturers are follo wing the trend describ ed ab o v e and in v est-
ing in new tec hnologies that enable CPUs and GPUs to share w ork in a more seamless
w a y . Heterogeneous System Arc hitecture from AMD is a no v el computer arc hitecture
that in tegrates the GPU and CPU inside the same computing con text. The 2 devices
can th us share memory and in terop erate more ecien tly . The new programming mo del
is based on previous exp eriences with Close to Metal and Op enCL. AMD targets all
mark et segmen ts (mobile, desktop, serv er, cloud, etc.) with this new approac h and
CHAPTER 3. GPGPU IMP A CT ON SECURITY 21
aims to pro vide b etter p erformance and reduced p o w er consumption with its new par-
allel hardw are1.
On the same note, Nvidia is in v esting in unied memory tec hnologies suc h as Op enA CC
(Op en A ccelerators) and includes supp ort in CUD A and its new Maxw ell microarc hitecture[43]
2.
In tel R
CPU arc hitecture has ev olv ed to w ards a unied memory mo del with its in-
tegrated graphics: In tel R
HD graphics and IrisTM. Supp ort for Op enCL on these
platforms can b e found in closed source driv ers and SDKs, but also implemen ted in
op en source pro jects suc h as Beignet3.
The general consensus mo ving forw ard is in v esting in h ybrid tec hnologies that can ef-
cien tly lev erage the computing p o w er of either the CPU or the GPU and presen t a
uniform programming mo del in terface that is easy to learn and use b y dev elop ers.
3.3 Mo deling a Hybrid In trusion Detection System
Mo dern securit y soft w are has to b e proactiv e in order to adapt to 0-da y attac ks. Securit y
of critical infrastructure cannot b e exp ected to rely on a set of predened rules and
b eha viors. On the other hand the risk of ha ving a v ery strict securit y p olicy can result
in false p ositiv es and impact the abilit y of using the protected infrastructure ecien tly .
Eorts ha v e b een made to include elemen ts of articial in telligence (esp ecially mac hine
learning) in to the fra y . The results are in trusion detection and prev en tion systems that
are self adaptiv e and are able to deal w ell with new threats. The constrain ts of real-life
scenarios mak e these tec hnologies applicable for small infrastructures suc h as priv ate
clouds or for purely educational purp oses[44][45].
In order to solv e the problem p osed b y high net w ork trac and in tense computation
w e prop ose a theoretical mo del for an in trusion detection system that relies on the
tec hnologies discussed in the previous sections. The main goal w ould b e to in tercept
and analyze net w ork trac using the CPU and GPU and where p ossible, drop dangerous
trac or log suspicious activit y . As reference op erating system for our mo del w e ha v e
selected Lin ux, based on the fact that its op en source and its deriv ativ es are p opular
across mark et segmen ts.
3.3.1 Prop osed Soft w are Stac k
F rom a soft w are p oin t of view, our mo del w ould ha v e to b e split b et w een k ernel and
userspace. The k ernel part w ould b e resp onsible for in tercepting and forw arding net w ork
trac to a group of userspace pro cesses that can b e congured to either p erform real-
time analysis or just log ev en ts. A dditional logic can b e in tegrated in the form of p eer-to-
p eer proto cols that can further protect individual no des against malicious in trusion[46].
1http://develop er.amd.com/resources/heterogeneous- computing/what- is- heterogeneous- system- a rchitecture- hsa/
2https://devblogs.nvidia.com/pa rallelfo rall/unied- memo ry- in- cuda- 6/
3http://www.intel.com/content/www/us/en/a rchitecture- and- technology/visual- technology/
graphics- overview.html
CHAPTER 3. GPGPU IMP A CT ON SECURITY 22
In terception of net w ork trac on a Lin ux mac hine can b e ac hiev ed b y creating a k ernel
mo dule that uses netlter ho oks. Userspace libraries suc h as libp cap can also b e used
to pro vide a uniform in terface across dieren t mac hines and mak e the IDS adaptable
to Lin ux k ernel c hanges. The k ernel mo dule w ould lik ewise b e capable of terminating
suspicions connections based on feedbac k from userspace.
Comm unication b et w een k ernel and userspace can b e ac hiev ed with io ctl (if the k ernel
mo dule is to b e implemen ted as c haracter device) or netlink so c k ets. Netlink so c k ets
allo w eac h pro cess to receiv e a buer of data corresp onding to a subset of the in tercepted
trac. The whole system will therefore b eha v e as a m ultiple instruction m ultiple data
(MIMD) mac hine. The securit y exp ert is th us free to create a new userspace comp onen t
to handle new scenarios and enforce dieren t p olicies.
Once the data buer reac hes the pro cess in userspace it gets analyzed. The analysis can
b e p erformed with whatev er resources the mac hine has a v ailable. In order to ac hiev e a
maxim um lev el of parallelism w e can use Op enMP for CPU computation and Op enCL
for GPU. Op enCL's programming mo del allo ws the user to run co de on m ultiple GPUs
and CPUs in a single con text. The result of the analysis can then b e logged to the disk
and if an y red ags are raised during this pro cess, an email can b e sen t to the system
administrator and the connection can b e terminated (alb eit with a certain dela y).
The dra wbac k of this system is that it's not capable of real-time in trusion prev en tion
due to the windo w created b et w een sending data from k ernel to userspace, pro cessing
and in terpreting the data and con tacting the k ernel mo dule again if there is a problem.
The adv an tage w ould b e that there is little impact on the net w ork sp eed and the forensic
data is gathered async hronously .
Figure 3.3: Hybrid m ultiplexing arc hitecture
3.3.2 Estimated Hardw are Requiremen ts
The m ulti-pro cess and m ulti-device arc hitecture can b e tuned to either pro vide max-
im um throughput or ha v e as little impact on system resources as the administrator
desires. Dep ending on the nature of the trac and existing p olicies for monitoring,
comp onen ts can b e added or turned o – Op enCL allo ws users to add m ultiple devices
CHAPTER 3. GPGPU IMP A CT ON SECURITY 23
in to a computing con text or to split (and share) a single device across con texts. A
unied memory arc hitecture w ould b e able reduce the o v erhead imp osed b y memory
transfers b et w een CPU and GPU. In tel 6th generation pro cessors share L3 cac he mem-
ory (up to 8 MB in size) with its in tegrated GPU. The corresp onding GPU mo del can
b e In tel R
HD or Iris graphics running at ab out 1 GHz with 16 to 24 execution units.
A dding a dedicated graphics card form A TI or Nvidia w ould result in a system that can
p erform up to 7000 GFLOPS.
Figure 3.4: Role distribution and resource sharing for Hybrid IDS
Recen t w ork in academia suggests that GPUs will start to pla y a more signican t role
in the eld of in trusion detection and prev en tion. Sev eral exp erimen ts and pro of of
concept will b e presen ted in this thesis, all of them conclude that existing IDPSs can
b enet from parallelism and p erform generally b etter in terms of throughput and p o w er
consumption than CPU-only solutions. The adv an tages GPUs oer in augmen ting
net w ork sensors are their abilit y to ooad compute in tensiv e tasks and redeplo ymen t
exibilit y (if net w ork trac is lo w, they can b e reassigned to other tasks – esp ecially
useful in cloud infrastructures).
Heterogeneous parallel APIs are b eing activ ely dev elop ed b y ma jor hardw are v endors,
making dev elop ers more comfortable in p orting existing solutions to the new arc hitec-
tures.
These com bined facts encourage the author of this do cumen t to predict the expansion
of GPU usage in net w orking applications suc h as rew alls and IDPS in the follo wing
y ears.
Chapter 4
String Matching
As presen ted in the previous c hapters, string matc hing pla ys a big part in the pro cess
of in trusion detection: net w ork in trusion detection systems sp end a great deal of their
time trying to matc h malw are signatures to incoming trac. The same can b e said
ab out an tivirus soft w are. Giv en these facts it has b ecome paramoun t for an application
to p erform ecien t pattern matc hing as part of securing a system or net w ork.
4.1 Classical Algorithms
Net w ork analysts and securit y professionals ha v e w ork ed to in tegrate and adapt existing
algorithms to their curren t needs. The follo wing are some common goals to tak e in to
consideration when lo oking to adopt a string matc hing algorithm for a securit y solution:
Algorithmic complexit y
Theoretical b oundaries of time and space can help dev elop ers mak e a decision
b efore actually implemen ting it in soft w are.
It can pro vide a go o d estimation on ho w the algorithm will scale on future
w orkloads.
Resources used
Memory usage can aect the p erformance of an algorithm.
Algorithms that use a lot of memory can p erform p o orly on devices with
limited resources.
P arallel algorithms migh t require sp ecial hardw are to giv e maxim um p erfor-
mance.
Branc hing, lo okups, hashing op erations migh t ha v e dieren t throughput de-
p ending on the t yp e of data and hardw are in use.
Deterministic b eha vior
Some applications migh t require exact pattern matc hing.
24
CHAPTER 4. STRING MA TCHING 25
Appro ximate matc hes could impro v e sp eed, but result in more false p osi-
tiv es.
P attern t yp e
Algorithms can b e tuned for sp ecic pattern t yp es or alphab et sizes.
Application scop e migh t require m ultiple pattern matc hing.
The curren t c hapter will try to address all of these issues as it giv es an expanded def-
inition of the most widespread pattern matc hing algorithms used in the industry and
academia. In the searc h for more sp eed and less p o w er consumption, researc hers ha v e
turned to parallelism in order to impro v e application p erformance and user exp erience.
GPGPU has pro vided a platform for man y exp erimen ts and state of the art implemen-
tations of parallel string matc hing.
4.1.1 Bo y er-Mo ore
Bo y er-Mo ore (BM) string searc h algorithm w as dev elop ed in 1977[47]. It is considered
the standard b enc hmark for string matc hing and v ariations of it are implemen ted in
m ultiple applications. The logic b ehind an y ecien t string matc hing algorithm is to
minimize the n um b er of c haracter or b yte comparisons when searc hing for the pattern
inside the input text. This reduces the computation time and memory accesses resulting
in a direct p erformance gain on all platforms.
The algorithm includes a prepro cessing stage b efore the actual searc h can b e conducted.
The original v ersion op erates on a single pattern and builds a lo okup table from it. The
lo okup table con tains in teger v alues that are used to skip as man y c haracters as p ossible
during the searc h stage. A trademark of this algorithm is the fact that the searc h
o ccurs starting from the end of the pattern. The t w o strings are aligned and successiv e
comparisons o ccur un til a mismatc h is detected. The text c haracter that caused the
mismatc h is used to determine the n um b er of c haracters that will b e skipp ed in the
follo wing iteration and where the next comparisons will start.
The asymptotic complexit y for the algorithm is O(m + n) if the pattern is not included
in the text and O(m*n) otherwise (w orse case: a matc h o ccurs at ev ery p osition); where
m is the size of the pattern and n the length of the input. In theory , the sp eedup o v er a
plain, brute force approac h increases with the pattern size. The original implemen tation
requires an additional O(m) buer to store the lo okup table.
The v alues in the lo okup table are based on a couple of rules that try to determine
the maxim um n um b er of c haracters that can b e safely skipp ed. The most basic rule is
called the bad c haracter rule: once a mismatc h is detected, the pattern is shifted to
the next o ccurrence of the curren t text c haracter or passed it, if it is not presen t in the
pattern. A lo okup table of this rule w ould require an en try for eac h c haracter in the
set to p oin t to the rst equal instance to the left of it in the pattern, and then shift b y
that amoun t. More complex rules are the go o d sux rule and the Galil rule [48].
BM implemen tations can b e found in an tivirus soft w are as w ell as net w ork in trusion de-
tection and prev en tion soft w are. Some v ariations suc h as Bo y er-Mo ore-Horsp o ol (BMH)
CHAPTER 4. STRING MA TCHING 26
trade some space to ac hiev e an a v erage-case complexit y of O(n) [49]. Snort and Suri-
cata rely on optimized v ersions of the algorithm to matc h their set of rules to the
net w ork trac. The algorithm in its original form do es not p ertain to parallelization.
A cademic researc h and industry implemen tations fo cus on harv esting the data parallel
requiremen ts of their resp ectiv e applications and customize BM to closely map to the
parallel arc hitectures. An example of this can b e found here[50]. Later sections will
pro vide details on p ersonal con tributions and w ork around using BM on parallel and
heterogeneous hardw are.
4.1.2 Kn uth-Morris-Pratt
Kn uth-Morris-Pratt (KMP) is a string searc hing algorithm that w as dev elop ed in 1970[51]
and enjo ys a signican t p opularit y . KMP c haracter comparisons b egin from the start
of the pattern (as opp osed to BM). The pattern is pro cessed and a table is built with
information ab out the previous matc hing c haracters: if, for example, w e reac h index k
in the pattern that means that the previous k-1 c haracters matc hed k-1 text c haracters.
The algorithm tries to use this information to reduce the n um b er of comparisons.
The "partial matc h" table acts as a failure function and determines the n um b er of
redundan t comparisons that can b e skipp ed in the ev en t of a c haracter mismatc h. Eac h
k -th en try in the table is built based on the structure of the 0 – k-1 pattern substring.
The longer the sux and prex of this substring matc h, the more c haracters can b e
skipp ed if k do es not matc h the input (b y mo ving forw ard un til the prex o v erlaps the
p osition of the substring sux – and then con tin uing the searc h). An imp ortan t concept
here is the b or der . A b order of a string X is a another string Y that is b oth a prop er
sux and a prop er prex for X . The nul l string is considered a b order of length 0 for
ev ery non-n ull string. the prepro cessing stage of KMP , th us, rev olv es around nding
the longest b order for eac h substring con tained in the pattern.
The complexit y for KMP is O(n) , with at most 2*n comparisons b eing p erformed.
Prepro cessing stage complexit y is O(m) . KMP w orks w ell in scenarios where the pattern
and input text are comp osed using a small alphab et. F or example matc hing DNA
sequences can ha v e v ery long patterns and inputs but only a four-c haracter alphab et
( {A, C, G, T} ).
In securit y-related searc hes a signican t lev el of randomness is exp ected and alphab ets
could b e as big as t w o or ev en four-b yte c haracter sets (though most often a 256 b yte
alphab et is used). In practice the n um b er of p ositiv e pattern matc hes is negligible
compared to the size of the analyzed data when scanning for vulnerabilities on the disk
or net w ork. The dierence b et w een KMP's 2*n comparisons and BM's n + m can
really b e felt in soft w are. This mak es KMP a go o d second c hoice, at b est, though some
implemen tations ma y b e tuned to t custom scenarios. Comparativ e analysis of parallel
KMP and BM implemen tations tend to enforce this p oin t[52].
Both BM and KMP are single pattern algorithms, but they can b e extended to m ultiple-
pattern matc hing – in this resp ect KMP tak es up less memory for its prepro cessed data
structures, and that ma y render it more suitable for lo w-end devices that run on limited
resources[53]. F orw ard comparisons mean that unlik e generic BM, KMP can b enet
from some hardw are optimizations suc h as ecien t cac hing or SIMD instructions.
CHAPTER 4. STRING MA TCHING 27
4.1.3 Aho-Corasic k
Both BM and KMP build a help er-tables that facilitate ecien t pattern matc hing as
new input text is fed. Their in ternal represen tation is similar to an automaton: on
eac h transition a new n um b er of c haracters is skipp ed and the curren t index is adjusted
(a new state). Aho-Corasic k (A C) tak es this concept to the next lev el and creates
a complex automaton that can solv e m ultiple pattern matc hes sim ultaneously . The
algorithm w as designed in 1975 and presen ted as a new w a y of impro ving the sp eed
of library bibliographic searc hes[11]. The algorithm tak es man y k eyw ords as input and
builds a data structure similar to a trie – eac h no de can represen t a single c haracter or
an atom comp osed of a com bination of c haracters. The transitions b et w een the no des
represen t the relation b et w een the individual patterns that comp ose the automaton.
A C can tak e as input whole dictionaries and use them to build a string matc hing au-
tomaton. The searc h complexit y is linear to the total length of the patterns, the strings
and the ev en tual matc hes. Implemen tations of the algorithm can b e found in man y
soft w are programs that ha v e to p erform with string searc hes (Unix command fgr ep is
an example). In securit y applications A C is used to matc h en tire databases of virus
signatures or other kno wn malw are to net w ork trac or le con ten ts. Snort, Suricata
and ClamA V ha v e their o wn implemen tations of A C for rule-matc hing.
A cademic researc h around A C has pro duced a great n um b er of v ariations. Compute
sp eed (parallel sp eedup) and memory fo otprin t (automaton size and compression) are
usually the most imp ortan t asp ects that go v ern A C optimizations. Unlik e the previous 2
algorithms, A C do es not skip an y of the input b ytes when a mismatc h o ccurs: it simply
adv ances in to a new state and uses the next input b yte to compute a new transition,
and so on.
This asp ect can b e exploited when designing a parallel implemen tation of A C: as eac h
transition b et w een states is indep enden t of other transitions within the automaton.
P arallel pro of of concepts ha v e used this idea to created highly tuned implemen tations
for p erforming string matc hing in in trusion detection systems on the GPU[54]. Suricata
has an A C mo dule written in CUD A for this purp ose.
Exp erimen ts with A C in securit y ha v e fo cused around ac hieving fast and scalable p er-
formance on parallel systems. Examples include m ulti-threaded systems[55], GPU-
ooading for net w ork sensors[56], distributed applications[57] and em b edded platforms[58].
Gnort – a GPU deriv ativ e of Snort also uses a parallel v ersion of A C[32].
P ersonal con tributions to the state of the art are included in this c hapter. There, I will
discuss at length some of the exp erimen ts p erformed with parallel A C automatons on
m ulti-core CPUs and GPUs.
4.1.4 Other Algorithms and Notable V ariations
// TODO
CHAPTER 4. STRING MA TCHING 28
4.2 P arallel String Matc hing
The decision to parallelize a string matc hing algorithm used inside a piece of securit y
soft w are needs to tak e in to consideration the run time realities of that soft w are. The
w ork-o w of a net w ork in trusion detection and prev en tion system could v ary signi-
can tly from a host in trusion system or an an tivirus. There is a risk that the computa-
tional eort is split across a length y timeline (spik es of v ariable gran ularit y) in a w a y
that mak es parallelization ineectiv e. If the solution targets a sp ecic customer, the
application dev elop ers ha v e to tak e in to consideration his or hers deplo ymen t capabili-
ties. F or example creating a parallel CUD A solution w ould restrict its usage to NVIDIA
hardw are exclusiv ely . While the p erformance migh t b e greatly impro v ed, the solution
migh t fail to reac h a signican t segmen t of users, th us limiting the o v erall impact.
4.2.1 Using Op enCL in Net w ork In trusion Detection
In order to co v er as m uc h of the existing ecosystem, I'v e decided to fo cus on accel-
erating m ulti-pattern matc hing algorithms using Op enCL and test their p erformance
across a m ultitude of form factors. The prop osed implemen tation com bines notions
from other w orks[59][60][61] and creates a case study that allo ws the reader to compare
the p erformance of man y v ersions of parallel Aho-Corasic k automatons.
Aho-Corasic k compresses in to a GPU-friendly structure that can b e loaded in to lo cal
memory and shared b et w een individual w ork items. A t the time of writing this pap er
Op enCL implemen tations are a v ailable on all ma jor op erating systems and hardw are
arc hitectures. My exp erimen ts start out with a classic v ersion of Aho-Corasic k:
Figure 4.1: A C automaton for input patterns: ab c, b, ba, c; the curv ed lines represen t
fail links.
The automaton is transformed along the w a y . A v aluable optimization for parallel
en vironmen ts is describ ed here [62]. The in ternal data structure gro ws as new patterns
(NIDS rules) are added. Once this stage is complete the required links b et w een the
no des are built and a breath-rst searc h copies the relev an t data in to a con tin uous
CHAPTER 4. STRING MA TCHING 29
region of memory . In other w ords the automaton gets transformed from a radix tree
in to a m ulti-link ed list:
Figure 4.2: A C automaton as radix tree; indexes represen t the order of the elemen t in
the list.
Figure 4.3: A C automaton as list.
Using the new structure w e can no w build an em barrassingly parallel implemen tation
to p erform string matc hing. F or testing purp oses I used pac k et capture les from online
sources and rules from the Snort library . Op enCL execution parameters w ere selected
to matc h a series of real-life scenarios suc h as high load trac, small pac k ets, big log
les, etc. I compared the GPU p erformance with the b est CPU implemen tation: serial
and parallelized with threads (Windo ws threads and pthreads). The test hardw are
consisted of 2 systems with In tel R
i7-2600K CPUs running at 3.40GHz (4 cores, 8
logical pro cessors – h yp er-threading). The graphics cards used to run Op enCL k ernels
w ere AMD Radeon HD 6850 for desktop and NVIDIA GeF orce 820M and AMD Radeon
7650M for laptops. F or Android I used a Nexus 5 device, with Quad-core 2.3 GHz Krait
400 CPU and A dreno 330 GPU. A detailed description of the exact test en vironmen t
can b e found in the article I published on this topic in the pro ceedings of the 2015
Con trol Systems and Computer Science (CSCS) In ternational Conference[63].
CHAPTER 4. STRING MA TCHING 30
The most relev an t results are presen ted b ello w. The gures pro vide an argumen t for
the use of parallelism inside NIDS applications.
Figure 4.4: Aho-Corasic k p erformance comparison across hardw are.
The p erformance gain w as bigger when the input data w as larger and more compact
(rather than m ultiple k ernel executions on smaller c h unks).
Figure 4.5: P erformance after increasing GPU w ork item coun t.
I corrob orated the p erformance tests with a detailed p o w er analysis. The analysis
sho w ed that not only w as the parallel implemen tation faster, but it w as also more
ecien t in terms of the total energy consumed. Both CPUs on desktop and Android
exp erienced o v erheating and throttling under hea vy loads, whic h further reduced their
eciency in the long run.
CHAPTER 4. STRING MA TCHING 31
Figure 4.6: W/s c hart sho wing GPU and CPU p o w er consumption while running A C
on desktop.
Figure 4.7: W/s comparison of p o w er consumption b et w een pthreads CPU and GPU
running Op enCL.
While complex NIDS are exp ected to run on dedicated hardw are and serv er-class CPUs,
this study is relev an t b ecause it sho ws Op enCL dev elopmen t on lo w-end devices has the
p oten tial to impro v e the securit y of the platform and create new v ectors of defense in
the form of Host In trusion Detections Systems or lo w-p o w er an tivirus soft w are. The
next section of this c hapter will follo w-up on this idea and pro vide more measuremen t
data to bac k up this claim.
My study concludes that the o v erall b enets of using GPUs to impro v e net w ork trac
scans are high a v ailabilit y , reduced p o w er consumption and large throughput. Some
of the lessons learned include the fact that string matc hing algorithms lik e the ones
used in securit y applications scale w ell in parallel en vironmen ts (m ulti-core CPUs and
CHAPTER 4. STRING MA TCHING 32
GPUs). Ev en though the GPU en vironmen t is non-uniform and the p erformance of
sp ecic v endor driv ers migh t v ary dramatically from one platform to another there is a
gro wing in terest in the use of GPUs and parallel tec hnologies in the securit y industry .
4.2.2 P arallel Malw are Detection for Laptops and Ultrab o oks
The previous section highligh ted the b enets Op enCL parallel implemen tations can
bring to securit y applications relying on high amoun ts of string matc hing. The curren t
section is dedicated to the next step along this researc h path: in tegrating an Op enCL
pro cessing mo dule in to an existing solution. The previous study sho w ed that Op enCL
can reduce p o w er consumption and ooad critical computation when the amoun t of
w ork is large. Lo w-end consumer devices seldom face suc h scenarios when analyzing
net w ork trac. On the other hand, host in trusion detection applications (an tivirus
soft w are) are exp ected to p erform CPU-in tensiv e scans regularly ev en on battery p o w-
ered devices. These devices usually ha v e in tegrated graphics that p osses signican t
computational p o w er and share memory with the CPU. In tegrated GPUs are not cur-
ren tly b eing used to their fullest abilit y in securit y soft w are. Their lac k of dedicated
memory ma y put them at a disadv an tage against regular GPUs, but it also means they
can w ork closer with the CPU and b e part of h ybrid implemen tations (a p oin t argued
earlier as an adv an tage of Op enCL o v er other parallel framew orks).
On individual hosts, an tiviruses trac k do wn and protect against viruses or malw are.
Malw are is a generic term for an y piece of soft w are that causes incon v enience or other-
wise harms the user. Malw are can propagate on the net w ork or through an y other I/O
p ort a v ailable on the device (USB stic ks, optical disc storage, etc.). An tivirus soft w are
detects malicious co de b y scanning the hard-driv e or system memory . The usual ap-
proac hes are static analysis and dynamic analysis. In static analysis the con ten ts of the
lesystem or memory are matc hed against a set of kno w patterns that represen t virus
signatures. In dynamic analysis the an tivirus builds a logical mo del of what that piece
of co de is trying to do and ags it as a threat based on a series of metrics. Large virus
signature databases are k ept and up dated for the purp ose of pro viding iden tication for
static analysis to ols. Sp eed and accuracy are the critical elemen ts go v erning the p er-
formance of an tiviruses. Based on this statemen t I'v e designed a h ybrid approac h that
lev erages CPU and GPU compute capabilities in order to accelerate pattern matc hing
for malw are signatures. The solution presen ted fo cuses on impro ving p erformance and
reducing p o w er consumption of string matc hing algorithms on devices suc h as ultra-
b o oks and laptops.
The string matc hing algorithm selected for this implemen tation is a v ariation of Bo y er-
Mo ore-Horsp o ol (BMH). The algorithm w as tuned for m ulti-pattern matc hing. Lik e in
the case of net w ork in trusion actual malw are detections are few and far b et w een when
compared to the scanning frequency . In order to use this I'v e selected an appro ximate
string matc hing v ersion of the algorithm. The algorithm prepro cesses the malw are
database and creates 4-b yte hashes from eac h signature. The hashes are not unique
and collisions can happ en b et w een signatures and harmless con ten t. During the actual
parsing input text is hashed as w ell and matc hed with the precomputed BMH table.
Once a matc h is reco ded the actual signature is c hec k ed b yte-p er-b yte against the input
stream. This metho d drastically reduces the amoun t of b yte-comparisons that ha v e to
CHAPTER 4. STRING MA TCHING 33
b e p erformed. By setting the size of the hashed c h unks to 32 or 64 b yte-long patterns w e
further reduce the amoun t of w ork that needs to b e done. F urther more, all this hashing
and matc hing can b e done in parallel. Results simply need to b e comm unicated to the
next lev el for an y additional string matc hing. This ts w ell in to Op enCL's computing
mo del, as eac h of these op erations can b e encapsulated in to a __kernel function. Since
the output from this function is b o olean, it could b e enco ded as a bitmask in to a 64-bit
in teger v alue (largest scalar data t yp e a v ailable in Op enCL's computing language). This
means that eac h GPU thread could p erform up to 64 c hec ks at a time, th us reducing
the n um b er of memory transfers b et w een the CPU and GPU (whic h still ha v e to b e
handled through Op enCL APIs ev en though b oth share the same memory). The results
from the GPU computation are then in terpreted b y the CPU: if an y hash matc hed the
input, the CPU p erforms additional string matc hing to mak e sure w e don't ha v e a false
p ositiv e.
Figure 4.8: Hybrid malw are scan w orko w.
The co de w as tested on an ultrab o ok with In tel R
Core TM i7-6600U with in tegrated
HD Graphics 520. The in tegrated GPU has 24 compute units clo c k ed at 1050 MHz.
The implemen tation is OS-agnostic. I used a malw are database of appro ximately 30000
signatures. F or the tests, I scanned infected con ten ts and Windo ws/Lin ux system les.
I sim ulated the b eha vior of an tivirus soft w are b y selecting only a xed amoun t of input
data to b e parsed at one time (some an tivirus soft w are lik e ClamA V buer a lot of les
together, others ha v e separate con texts for individual les). The p erformance summary
can b e seen b ello w:
CHAPTER 4. STRING MA TCHING 34
Figure 4.9: Hybrid pattern matc hing p erformance on small buers.
Figure 4.10: Hybrid pattern matc hing p erformance on large buers.
The results sho w that for small buer sizes the GPU p erformance is sub optimal. Up on
closer examination it w as found that for these scenarios o v er 98% of computation time
w as sp en t p erforming Op enCL library calls on the CPU (on a v erage ab out 0.3 millisec-
onds, regardless of the amoun t scanned). This sev ere p erformance p enalt y fades a w a y
as the amoun t of w ork increases. As w e start to use larger buer sizes the sp eedup ap-
proac hes 2X. If w e exceed the shared memory size (L3) w e start to exp erience additional
p erformance degradations. This is not an imp ortan t factor though, as most an tivirus
CHAPTER 4. STRING MA TCHING 35
soft w are try to ha v e a small memory fo otprin t and seldom use more than 100 MB of
memory at one time[64]. This means that the test scenarios used fall w ell b et w een the
memory b oundaries encoun tered in con v en tional soft w are.
I measured the p o w er consumed b y this new setup with similar to ols as the previous
Op enCL pro of of concept (the to ols and metho dology are describ ed here 2.3). I selected
a 16 MB buer size as the most represen tativ e size for this study:
Figure 4.11: Comparison of p o w er consumption b et w een h ybrid and CPU-only imple-
men tations.
The gure sho ws p o w er sa vings of up to 25% on a CPU running at 3200 MHz for the
duration of the test. The com bined p o w er and p erformance gures clearly adv o cate the
sup eriorit y of the h ybrid implemen tation. Opp ortunistic usage of the GPU can b enet
securit y applications running in resource constrained en vironmen ts.
The con ten ts of this study ha v e b een presen ted at the second In ternational Conference
on Computer and Applications (ICCA) in Dubai, United Arab Emirates (Septem b er
2017).
4.2.3 Conclusions and Lessons Learned
A strong motiv ation for the w ork presen ted in this c hapter w as the fact that a ma jor
b ottlenec k in net w ork in trusion detection and prev en tion soft w are is string matc hing.
Static analysis rule matc hing still pla ys a signican t part in net w ork and host securit y
to ols. The reason for this is b ecause it is a full-pro of w a y of detecting previously analyzed
threats. The risk for false-p ositiv es is reduced. If an y signatures matc h regular les or
trac, they can simply b e white-listed on a case p er case basis.
A t the time of the rst study presen ted, Op enCL supp ort on Android w as limited (there
w ere man y shortcomings, for example lac k of lo cal memory). There w ere a couple of
CHAPTER 4. STRING MA TCHING 36
similar studies on string matc hing using CUD A and a few using Op enCL but almost
none pro vided an in-depth analysis across form factors: desktop, laptop, mobile. It is
coun terin tuitiv e that b y p o w ering up the GPU and temp orarily increasing the p o w er
y ou can actually reduce the o v erall energy consumption. This particular concept is
called r ac e to halt and is w ell do cumen ts in b o oks and articles that tac kle the delicate
topic of p o w er sa ving and green computing. The idea is that if y ou can p erform a task
faster using more resources (higher frequencies, more cores) and then go to sle ep y ou
are b etter o than if y ou w ould ha v e sp en t more time at lo w er frequencies. That is an
excellen t concept to ha v e in mind when writing soft w are for battery based devices. The
problem is that in trusion detection soft w are will still b e a burden on battery p o w ered
devices. V endors are t ying alternativ e metho ds of securing their devices b y k eeping
their op erating systems closed or mo ving the compute-in tensiv e scans in to the cloud.
Nev ertheless, as Op enCL implemen tations b ecome more mature GPU based securit y
solutions will b ecome more p opular.
String matc hing algorithms can b e tuned to matc h most t yp e of signatures, ev en for
encrypted malw are[65]. Encrypted malw are requires a small piece of co de to p erform
the deco ding of the pa yload and mo v e the execution to the en try p oin t. F or this reason
encrypted co de usually relies on rudimen tary algorithms. These algorithms are enough
to b ypass plain signature matc hing detection systems, but the underlying pattern still
remains recognizable. By hashing the ra w b ytes and applying rotation and X OR op er-
ations it is p ossible to c anc el out the eects of the encryption. The article cited ab o v e
giv e a few examples of this pro cedure. The deriv ed BMH algorithm used for the parallel
malw are detection POC builds on this tec hnique to co v er these kinds of threats. The
tests yielded p ositiv e detections for les that con tained tro jans, w orms, le infectors or
other t yp es of encrypted viruses.
The parallel malw are mo dule w as deplo y ed on laptops with dedicated graphics (A TI and
NVIDIA). The results k ept in prop ortion with the data obtained from the in tegrated
tests. Ev en though p erformance w as gained due to the sup erior lev el of parallelism and
core frequency , the Op enCL driv er still suered from setup p enalties whic h ruined the
p erformance for small buers (for les less than 1MB, the dierence in GPU scan time
could b e h undreds of times slo w er than the CPU v ersion).
The POC remains a go o d candidate for large les suc h as statically link ed executables
and net w ork or application logs. Plans to include it in a commercial securit y solution
are b eing considered in the future.
Chapter 5
Encryption
37
Chapter 6
T racing and Visualization
38
Chapter 7
Conclusions
39
List of Figures
2.1 Lin ux Kernel Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 On the left: steps p erformed when ooading calculations to the GPU on
a non-HSA system.
On the righ t: steps p erformed when ooading calculations to the GPU
on a HSA system, using the HSA functionalit y .
Source:[6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 On the CPU parallelism relies on the usage of pro cesses and threads. . . 9
3.1 Op enCL abstractions. Source:[29] . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Memory hierarc h y of AMD GPUs: represen tativ e of the Op enCL memory
mo del. Source:[30] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Hybrid m ultiplexing arc hitecture . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Role distribution and resource sharing for Hybrid IDS . . . . . . . . . . 23
4.1 A C automaton for input patterns: ab c, b, ba, c; the curv ed lines represen t
fail links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 A C automaton as radix tree; indexes represen t the order of the elemen t
in the list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 A C automaton as list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Aho-Corasic k p erformance comparison across hardw are. . . . . . . . . . 30
4.5 P erformance after increasing GPU w ork item coun t. . . . . . . . . . . . 30
4.6 W/s c hart sho wing GPU and CPU p o w er consumption while running A C
on desktop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 W/s comparison of p o w er consumption b et w een pthreads CPU and GPU
running Op enCL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8 Hybrid malw are scan w orko w. . . . . . . . . . . . . . . . . . . . . . . . 33
4.9 Hybrid pattern matc hing p erformance on small buers. . . . . . . . . . . 34
4.10 Hybrid pattern matc hing p erformance on large buers. . . . . . . . . . . 34
4.11 Comparison of p o w er consumption b et w een h ybrid and CPU-only imple-
men tations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
40
Bibliography
[1] Stev en McCanne and V an Jacobson. The BSD P ac k et Filter: A New Arc hitecture
for User-lev el P ac k et Capture. In USENIX winter , v olume 93, 1993.
[2] Suc hakrapani Datt Sharma. Extending the extended BPF KeBPF – UeBPF. 2015.
[3] Vitor Duarte and Nuno F arruca. Using libPcap for monitoring distributed applica-
tions. In High Performanc e Computing and Simulation (HPCS), 2010 International
Confer enc e on , pages 9297. IEEE, 2010.
[4] Neil Desai. Increasing p erformance in high sp eed NIDS. A lo ok at Snort's Internals ,
2002.
[5] Snort. https://www.sno rt.o rg/ , 2017. A ccessed: F ebruary 2017.
[6] Heterogeneous System Arc hitecture. https://en.wikip edia.o rg/w/index.php?title=
Heterogeneous_System_Architecture&oldid=763501263 , 2017. A ccessed: F ebruary
2017.
[7] George Kyriazis. Heterogeneous system arc hitecture: A tec hnical review. AMD
F usion Develop er Summit , 2012.
[8] Lisa T Su. Arc hitecting the future through heterogeneous computing. In Solid-State
Cir cuits Confer enc e Digest of T e chnic al Pap ers (ISSCC), 2013 IEEE International ,
pages 811. IEEE, 2013.
[9] Rob ert Benner, Victor TE Ec hev erria, Uzoma On unkw o, Ja y P atel, and Da vid
Zage. Harnessing man y-core pro cessors for scalable, highly ecien t, and adaptable
rew all solutions. In Computing, Networking and Communic ations (ICNC), 2013
International Confer enc e on , pages 637641. IEEE, 2013.
[10] Noah Dietric h. Snort 2.9. 7. x on Ubun tu 12 and 14 with Barn y ard2, PulledP ork,
and BASE, 2015.
[11] Alfred V Aho and Margaret J Corasic k. Ecien t string matc hing: an aid to
bibliographic searc h. Communic ations of the A CM , 18(6):333340, 1975.
[12] Hung-Jen Liao, Ch un-Hung Ric hard Lin, Ying-Chih Lin, and Kuang-Y uan T ung.
In trusion detection system: A comprehensiv e review. Journal of Network and
Computer Applic ations , 36(1):1624, 2013.
41
BIBLIOGRAPHY 42
[13] Sp yros An tonatos, K ostas G Anagnostakis, and Ev angelos P Mark atos. Generating
realistic w orkloads for net w ork in trusion detection systems. In A CM SIGSOFT
Softwar e Engine ering Notes , v olume 29, pages 207215. A CM, 2004.
[14] Yisi Xu and Derek P ao. Space-time tradeo in the Aho-Corasic k string matc hing
algorithm. In Communic ations and Network Se curity (CNS), 2015 IEEE Confer-
enc e on , pages 713714. IEEE, 2015.
[15] Aleksandar Milenk oski, Marco Vieira, Sam uel K ounev, Alb erto A vritzer, and
Bry an D P a yne. Ev aluating computer in trusion detection systems: A surv ey of
common practices. A CM Computing Surveys (CSUR) , 48(1):12, 2015.
[16] Mic hal Ennert, Ev a Cho v anco v á, and Zuzana Dudlák o vá. T esting of IDS mo del
using sev eral in trusion detection to ols. Journal of Applie d Mathematics and Com-
putational Me chanics , 14(1), 2015.
[17] Altera. A ccelerating Op en Source Securit y Using Op enCL R
. https://www.altera.
com/content/dam/altera- www/global/en_US/p dfs/literature/p o/ss- 1056- 1.0.p df ,
2013. A ccessed: F ebruary 2017.
[18] Dinesh C Suresh, Sat y a R Mohan t y , W alid A Na jjar, Laxmi N Bh uy an, and F rank
V ahid. Lo op lev el analysis of securit y and net w ork applications. In W orkshop
on Computer A r chite ctur e Evaluation using Commer cial W orklo ads (CAECW-03) .
Citeseer, 2003.
[19] Loris Degioanni, Mario Baldi, F ulvio Risso, and Gianluca V arenni. Proling and
optimization of soft w are-based net w ork-analysis applications. In Computer A r chi-
te ctur e and High Performanc e Computing, 2003. Pr o c e e dings. 15th Symp osium on ,
pages 226234. IEEE, 2003.
[20] Zhen Chen, W en yu Dong, Hang Li, P eng Zhang, Xinming Chen, and Jun w ei Cao.
Collab orativ e net w ork securit y in m ulti-tenan t data cen ter for cloud computing.
Tsinghua Scienc e and T e chnolo gy , 19(1):8294, 2014.
[21] Aksha y Desai, Krishna Ank algi, Harish Y aman ur, and Siddalingesh S Na v algund.
P arallelization of AES algorithm for disk encryption using CBC and ICBC mo des.
In Computing, Communic ations and Networking T e chnolo gies (ICCCNT), 2013
F ourth International Confer enc e on , pages 17. IEEE, 2013.
[22] Georgios-Grigorios Mplemenos and Ioannis P apaefstathiou. F ast and p o w er-
ecien t hardw are implemen tation of a routing sc heme for WSNs. In Wir eless
Communic ations and Networking Confer enc e (W CNC), 2012 IEEE , pages 1710
1714. IEEE, 2012.
[23] Massimo Ficco and F rancesco P almieri. In tro ducing fraudulen t energy consump-
tion in cloud infrastructures: a new generation of denial-of-service attac ks. IEEE
Systems Journal , 2015.
[24] Ch unh ua Liao, Y onghong Y an, Bronis R De Supinski, Daniel J Quinlan, and Bar-
bara Chapman. Early exp eriences with the Op enMP accelerator mo del. In Inter-
national W orkshop on Op enMP , pages 8498. Springer, 2013.
BIBLIOGRAPHY 43
[25] Radu-Daniel T omoiaga and Mircea Stratulat. A ccelerating solution prop osal of
AES using a graphic pro cessor. A dvanc es in Ele ctric al and Computer Engine ering ,
11(4):99104, 2011.
[26] Salih Gorgunoglu, Ilhami Muharrem Orak, Ab dullah Ca vusoglu, and Mehmet Gok.
Examination of Sp eed Con tribution of P arallelization for Sev eral Fingerprin t Pre-
Pro cessing Algorithms. AD V ANCES IN ELECTRICAL AND COMPUTER EN-
GINEERING , 14(2):38, 2014.
[27] Mohammad Ahmed Alomari and Khairulmizam Samsudin. A framew ork for GPU-
accelerated AES-XTS encryption in mobile devices. In TENCON 2011-2011 IEEE
R e gion 10 Confer enc e , pages 144148. IEEE, 2011.
[28] Zhi Qi, W en W en, W ei Meng, Y a Zhang, and Longxing Shi. An energy ecien t
Op enCL implemen tation of a ngerprin t v erication system on heterogeneous mo-
bile device. In Emb e dde d and R e al-Time Computing Systems and Applic ations
(R TCSA), 2014 IEEE 20th International Confer enc e on , pages 18. IEEE, 2014.
[29] A Lo ok at Altera's Op enCL SDK for FPGAs. http://www.anandtech.com/sho w/
7334/a- lo ok- at- alteras- op encl- sdk- fo r- fpgas/ , 2013. A ccessed: April 2017.
[30] Op enCLTMOptimization Case Study F ast F ourier T ransform
P art I I. http://develop er.amd.com/resources/a rticles- whitepap ers/
op encl- optimization- case- study- fast- fourier- transfo rm- pa rt- ii/ , 2011. A ccessed:
April 2017.
[31] Karen Scarfone and P eter Mell. Guide to in trusion detection and prev en tion sys-
tems (idps). NIST sp e cial public ation , 800(2007):94, 2007.
[32] Giorgos V asiliadis, Spiros An tonatos, Mic halis P olyc hronakis, Ev angelos P
Mark atos, and Sotiris Ioannidis. Gnort: High p erformance net w ork in trusion de-
tection using graphics pro cessors. In International W orkshop on R e c ent A dvanc es
in Intrusion Dete ction , pages 116134. Springer, 2008.
[33] Jy otsna Sharma and Maninder Singh. CUD A based Rabin-Karp P attern Matc hing
for Deep P ac k et Insp ection on a Multicore GPU. International Journal of Computer
Network and Information Se curity , 7(10):70, 2015.
[34] Mr Gaura v Bhamare and Satish Banait. F aster Multipattern Matc hing System on
GPU Based on Aho-Corasic k Algorithm. 2014.
[35] Nhat-Ph uong T ran, Myungho Lee, Sugw on Hong, and Jaey oung Choi. High
throughput parallel implemen tation of aho-corasic k algorithm on a gpu. In Par al lel
and Distribute d Pr o c essing Symp osium W orkshops & PhD F orum (IPDPSW), 2013
IEEE 27th International , pages 18071816. IEEE, 2013.
[36] Cheng-Liang Hsieh, Lucas V espa, and Ning W eng. A high-throughput DPI engine
on GPU via algorithm/implemen tation co-optimization. Journal of Par al lel and
Distribute d Computing , 88:4656, 2016.
[37] Asad Mohhamad and Vikram Garg. A Comparativ e Study of GPU Computing
b y using CUD A and Op enCL. International Journal of Computer Applic ations ,
128(17), 2015.
BIBLIOGRAPHY 44
[38] Giorgos V asiliadis, Lazaros K oromilas, Mic halis P olyc hronakis, and Sotiris Ioan-
nidis. GASPP: A GPU-A ccelerated Stateful P ac k et Pro cessing F ramew ork. In
USENIX A nnual T e chnic al Confer enc e , pages 321332, 2014.
[39] Kamal Chandra Reddy , Ankit Tharw ani, Ch Krishna, et al. P arallel rew alls on
general-purp ose graphics pro cessing units. arXiv pr eprint arXiv:1312.4188 , 2013.
[40] W enji W u and Phil DeMar. A gpu-accelerated net w ork trac monitoring and anal-
ysis system. In Computer Communic ations W orkshops (INF OCOM WKSHPS),
2013 IEEE Confer enc e on , pages 7778. IEEE, 2013.
[41] Abha y a Kumar Saho o, Amardeep Das, and Ma y ank Tiw ary . Firew all engine based
on graphics pro cessing unit. In A dvanc e d Communic ation Contr ol and Comput-
ing T e chnolo gies (ICA CCCT), 2014 International Confer enc e on , pages 758763.
IEEE, 2014.
[42] Lin us Blomquist and Hampus Engström. GPU based IP forw arding, 2015.
[43] Mark Harris. Unied memory in CUD A 6. GTC On-Demand, NVIDIA , 2013.
[44] Pra v een Kumar Ra jendran. Hybrid in trusion detection algorithm for priv ate cloud.
Indian Journal of Scienc e and T e chnolo gy , 8(35), 2015.
[45] R zv an R UGHINI, George MILESCU, Mircea BARD A C, and Nicolae PU.
Optimization of p erformance monitoring and attac k detection in all optical net-
w orks.
[46] Laura Gheorghe, Razv an Rughinis, and Nicolae T apus. Securit y Infrastructure
for Wireless Sensor Net w orks. University" Politehnic a" of Buchar est Scientic
Bul letin, Series C: Ele ctric al Engine ering , 73(2):103116, 2011.
[47] Rob ert S Bo y er and J Strother Mo ore. A fast string searc hing algorithm. Commu-
nic ations of the A CM , 20(10):762772, 1977.
[48] Zvi Galil. On impro ving the w orst case running time of the Bo y er-Mo ore string
matc hing algorithm. Communic ations of the A CM , 22(9):505508, 1979.
[49] R Nigel Horsp o ol. Practical fast searc hing in strings. Softwar e: Pr actic e and
Exp erienc e , 10(6):501506, 1980.
[50] Y osang Jeong, Nhat-Ph uong T ran, Myungho Lee, Dukyun Nam, Jik-So o Kim,
and So on w o ok Hw ang. P arallelization and P erformance Optimization of the
Bo y er-Mo ore Algorithm on GPU. KIISE T r ansactions on Computing Pr actic es ,
21(2):138143, 2015.
[51] Donald E Kn uth, James H Morris, Jr, and V aughan R Pratt. F ast pattern matc hing
in strings. SIAM journal on c omputing , 6(2):323350, 1977.
[52] Charalamp os S K ouzinop oulos and K onstan tinos G Margaritis. String matc hing
on a m ulticore GPU using CUD A. In Informatics, 2009. PCI'09. 13th Panhel lenic
Confer enc e on , pages 1418. IEEE, 2009.
[53] Deepak V en ugopal and Guoning Hu. Ecien t signature based malw are detection
on mobile devices. Mobile Information Systems , 4(1):3349, 2008.
BIBLIOGRAPHY 45
[54] An tonino T umeo, Oreste Villa, and Donatella Sciuto. Ecien t pattern matc hing on
GPUs for in trusion detection systems. In Pr o c e e dings of the 7th A CM international
c onfer enc e on Computing fr ontiers , pages 8788. A CM, 2010.
[55] S Arudc h utha, T Nishan th y , and Roshan G Ragel. String matc hing with m ulticore
CPUs: P erforming b etter with the Aho-Corasic k algorithm. In Industrial and
Information Systems (ICIIS), 2013 8th IEEE International Confer enc e on , pages
231236. IEEE, 2013.
[56] Nhat-Ph uong T ran, Myungho Lee, Sugw on Hong, and Minho Shin. Memory e-
cien t parallelization for Aho-Corasic k algorithm on a GPU. In High Performanc e
Computing and Communic ation & 2012 IEEE 9th International Confer enc e on
Emb e dde d Softwar e and Systems (HPCC-ICESS), 2012 IEEE 14th International
Confer enc e on , pages 432438. IEEE, 2012.
[57] An tonino T umeo, Oreste Villa, and Daniel G Cha v arria-Miranda. Aho-corasic k
string matc hing on shared and distributed-memory parallel arc hitectures. IEEE
T r ansactions on Par al lel and Distribute d Systems , 23(3):436443, 2012.
[58] Manel Ab dellatif, Chamseddine T alhi, Ab dela w ahab Hamou-Lhadj, and Mic hel
Dagenais. On the use of mobile GPU for accelerating malw are detection using
trace analysis. In R eliable Distribute d Systems W orkshop (SRDSW), 2015 IEEE
34th Symp osium on , pages 4246. IEEE, 2015.
[59] Nhat-Ph uong T ran, Myungho Lee, Sugw on Hong, and Jongw o o Bae. P erformance
optimization of Aho-Corasic k algorithm on a GPU. In T rust, Se curity and Privacy
in Computing and Communic ations (T rustCom), 2013 12th IEEE International
Confer enc e on , pages 11431152. IEEE, 2013.
[60] Xin y an Zha and Sarta j Sahni. Highly compressed Aho-Corasic k automata for ef-
cien t in trusion detection. In Computers and Communic ations, 2008. ISCC 2008.
IEEE Symp osium on , pages 298303. IEEE, 2008.
[61] Shima Soroushnia, Masoud Danesh talab, Juha Plosila, and P asi Liljeb erg. Hetero-
geneous parallelization of Aho-Corasic k algorithm. In 8th International Confer enc e
on Pr actic al Applic ations of Computational Biolo gy & Bioinformatics (P A CBB
2014) , pages 153160. Springer, 2014.
[62] Cheng-Hung Lin, Chen-Hsiung Liu, Lung-Sheng Chien, and Shih-Chieh Chang.
A ccelerating pattern matc hing using a no v el parallel algorithm on GPUs. IEEE
T r ansactions on Computers , 62(10):19061916, 2013.
[63] Radu V elea, Victor-V aleriu P atriciu, and Florina Gurzau. Using Op enCL to Im-
pro v e String Matc hing for Net w ork In trusion Detection. In Contr ol Systems and
Computer Scienc e (CSCS), 2015 20th International Confer enc e on , pages 447452.
IEEE, 2015.
[64] Giorgos V asiliadis and Sotiris Ioannidis. Gra vit y: a massiv ely parallel an tivirus
engine. In International W orkshop on R e c ent A dvanc es in Intrusion Dete ction ,
pages 7996. Springer, 2010.
[65] Mircea Ciub otariu. Virus Cryptoanalysis. Virus Bul letin , 2003.
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: T ec hnical Military A cadem y of Buc harest [627855] (ID: 627855)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
