Url classification through Hidden Markov Models propus ˘a de Alexandru Martiniuc Sesiunea: iulie, 2019 Coordonator s ,tiint ,ific Conf. Dr. Gavrilut… [604500]
MR
UNIVERSITATEA ”ALEXANDRU-IOAN CUZA” DIN IAS ,I
FACULTATEA DE INFORMATIC ˘A
logoFii.png
LUCRARE DE LICENT ,˘A
Url classification through Hidden Markov
Models
propus ˘a de
Alexandru Martiniuc
Sesiunea: iulie, 2019
Coordonator s ,tiint ,ific
Conf. Dr. Gavrilut Dragos
UNIVERSITATEA ”ALEXANDRU-IOAN CUZA” DIN IAS ,I
FACULTATEA DE INFORMATIC ˘A
Url classification through Hidden
Markov Models
Alexandru Martiniuc
Sesiunea: iulie, 2019
Coordonator s ,tiint ,ific
Conf. Dr. Gavrilut Dragos
Avizat,
ˆIndrum ˘ator lucrare de licent ,˘a,
Conf. Dr. Gavrilut Dragos.
Data: ………………………. Semn ˘atura: ……………………….
Declarat,ie privind originalitatea cont,inutului lucr˘ arii de licent,˘ a
Subsemnatul Martiniuc Alexandru domiciliat ˆınRomˆ ania, jud. Ias ,i, mun. Ias ,i,
calea Buz˘ aului, nr. 25, bl. A, et. 5, ap. 45 , n˘ascut la data de 01 ianuarie 2018 , iden-
tificat prin CNP [anonimizat] , absolvent: [anonimizat] ˘at,ii de informatic ˘a,Facultatea de
informatic˘ a specializarea informatic˘ a , promot ,ia 2018, declar pe propria r ˘aspundere
cunosc ˆand consecint ,ele falsului ˆın declarat ,iiˆın sensul art. 326 din Noul Cod Penal s ,i
dispozit ,iile Legii Educat ,iei Nat ,ionale nr. 1/2011 art. 143 al. 4 s ,i 5 referitoare la pla-
giat, c ˘a lucrarea de licent ,˘a cu titlul Url classification through Hidden Markov Models
elaborat ˘a sub ˆındrumarea domnului Conf. Dr. Gavrilut Dragos , pe care urmeaz ˘a
s˘a o sust ,inˆın fat ,a comisiei este original ˘a,ˆımi apart ,ine s ,iˆımi asum cont ,inutul s ˘auˆın
ˆıntregime.
De asemenea, declar c ˘a sunt de acord ca lucrarea mea de licent ,˘a s˘a fie verificat ˘a
prin orice modalitate legal ˘a pentru confirmarea originalit ˘at,ii, consimt ,ind inclusiv la
introducerea cont ,inutului ei ˆıntr-o baz ˘a de date ˆın acest scop.
Am luat la cunos ,tint ,˘a despre faptul c ˘a este interzis ˘a comercializarea de lucr ˘ari
s,tiint ,ifice ˆın vederea facilit ˘arii falsific ˘arii de c ˘atre cump ˘ar˘ator a calit ˘at,ii de autor al
unei lucr ˘ari de licent ,˘a, de diplom ˘a sau de disertat ,ie s ,iˆın acest sens, declar pe proprie
r˘aspundere c ˘a lucrarea de fat ,˘a nu a fost copiat ˘a ci reprezint ˘a rodul cercet ˘arii pe care
amˆıntreprins-o.
Data: ………………………. Semn ˘atura: ……………………….
Declarat,ie de consimt,˘ amˆ ant
Prin prezenta declar c ˘a sunt de acord ca lucrarea de licent ,˘a cu titlul Url clas-
sification through Hidden Markov Models , codul surs ˘a al programelor s ,i celelalte
cont ,inuturi (grafice, multimedia, date de test, etc.) care ˆınsot ,esc aceast ˘a lucrare s ˘a fie
utilizate ˆın cadrul Facult ˘at,ii de informatic ˘a.
De asemenea, sunt de acord ca Facultatea de informatic ˘a de la Universitatea
”Alexandru-Ioan Cuza” din Ias ,i, s˘a utilizeze, modifice, reproduc ˘a s ,i s˘a distribuie ˆın
scopuri necomerciale programele-calculator, format executabil s ,i surs ˘a, realizate de
mine ˆın cadrul prezentei lucr ˘ari de licent ,˘a.
Absolvent: [anonimizat]: ………………………. Semn ˘atura: ……………………….
Cuprins
Motivat ,ie 2
1 Introducere 3
Introducere 3
2 Related work 4
3 Preliminary 5
4 Theoretical background 6
4.0.1 Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.0.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . 8
4.0.3 Elements of an HMM . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.0.4 The 3 Basic problems for HMM . . . . . . . . . . . . . . . . . . . . 9
4.0.5 Solutions to the problems . . . . . . . . . . . . . . . . . . . . . . . 10
4.0.6 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Proposed solution 12
5.1 A HMM aproach for urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 Building the model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Concluzii 15
Bibliografie 16
1
Motivat ,ie
* de ce ai vrut sa cercetezi HMM pe url-uri? * aici poti spune de ideea unei pro-
tectii la baza unui router/gateway care nu are acces tot timpul la content
2
Capitolul 1
Introducere
* Un fel de introducere in partea tehnica, spui cateva lucruri in mare despre HMM
* Cumva descrii intr-o pagina, maxim 2
3
Capitolul 2
Related work
* Ce s-a mai facut pe HMM pe URL-uri malware * 2-3 exemple si cu ce limitari
sau changelluri ai identificat
4
Capitolul 3
Preliminary
* aici poti zice care e scopul unui URL/URI * apoi sa zici ca inclusiv acesta poate
fi utilizat deosebit de creatorii de malware * phishing – prin a simula foarte aproape un
URL clean * malware – prin a se ascunde adevaratul url: ex: ”www.blblabla.com/aaaaaaaaaaaaaaaaaa.doc
.exe” * malware – prin generarea de domenii – DGA * spam – ajuta ca sa fie mailurile
mai mici * foarte multe tehnologii suporta remote templates (vezi office templates) –
scopul initial era bun – acuma se poate abuza
URL
URI, decomposition of a URL, photo
About fileless atacks
Spam
5
Capitolul 4
Theoretical background
4.0.1 Markov Model
Before defining what is a Markov Model we have to say what is a stochastic pro-
cess. A stochastic process is a family of random variables
(X(i))x2I
defined over a space with probability, where each variable:
Xi=X(i) :
!R
represents a stept of the process A Markov chain is a stochastic model describing a
sequence of possible events in which the probability of each event depends only on
the state attained in the previous event.
A process satisfies the Markov property (sometimes characterized as ”memory-
lessness”) if one can make predictions for the future of the process based solely on its
present state just as well as one could knowing the process’s full history, hence inde-
pendently from such history; i.e., conditional on the present state of the system, its
future and past states are independent.
A discrete-time Markov chain with a finite number of states is a stochastic process
(Xn)n1;whereX (i) :
!S=x1;x2;:::;x n
having the Markov property, namely that the probability of moving to the next state
depends only on the present state and not on the previous states:
P(Xn+ 1 =xjX1=x1;X 2=x2;:::X n=xn) =P(Xn+ 1 =xjXn=xn) (4.1)
6
and both parts of the equality are well defined, meaning they satisfy the inequality :
P(X1=x1;X 2=x2;:::X n=xn)>0
A Markov Chain is called stationary if the following equality is true:
P(Xn+ 1 =xjjXn=xi) =P(Xn=xjjXn 1=xi)8n2;si;sj2S
In the current paper we will only consider discrete Markov chain ( non-stationary
) having a finite number of states as we will see later on.
An example of a Markov Model or Markov chain is the following state diagram:
This model can be described using the transition matrix of the system, in this case
: 8
>>>>>>>><
>>>>>>>>:0 2/3 1/3 0
0 0 0 1
0 0 0 1
1/2 0 0 1/29
>>>>>>>>=
>>>>>>>>;
The transition matrix is a stochastic matrix because all of its elements are probabilities
and the sum on every row is equal to 1.
The above model could also be called an observable Markov model since the
output of the process is a series of states that represent an observable event at each
instance of time. As we will see later on there exists another type of Markov model
with states that arent obsevable which is of interest for us, this being the core of this
paper.
7
Applications
app for markov wiki
4.0.2 Hidden Markov Model
A Hidden Markov model or for simplificity lets call it HMM is a statistical Mar-
kov model that has the power to model a system that we assumed is a Markov process
but with states that are not observable ( we will refer to those with hidden states).
The urn model
The best way to present and extend the ideas of the HMM is considering the
following scenario. You have N urns, each containing a big number of balls with M
different colors. According to an unknown internal random process, a ball is randomly
took from one of the urns (also random choosed) and the color of the ball will represent
the observation. We record this event and then put the ball back. In the same manner
a random urn is again choosen, and the ball selection is repeated, obtaining a finite
sequence of observations.
The HMM that would model such situation will be represented in the following
manner: the hidden states will be represented by the urns, the colors of the balls will be
the observations ( output from the system ), and the choice of urns will be the transition
matrix.
8
4.0.3 Elements of an HMM
The above mental exercise gives us an idea of what an HMM is. We will now
define elements of an HMM and explain how the model generates observatios.
We denote an HMM as
= (A;B; )
thus having the folowing notations:
1). N – the number of hidden states in the model ( very often a fixed number
when use in practice)
2). M – the number of observation ( the alphabet )
3). T – length of the observation sequence ( discrete time )
4).- 1 x N row vector, initial state distribution
5). A – N x N matrix, the state transition matrix
6). B – N x M matrix, the observation probability matrix
7).Oi- observation, where i = 1;T
A visual representation of an HMM with its elements will be given by the fol-
lowing Trellis graph1:
video
4.0.4 The 3 Basic problems for HMM
With the basic elements of the constructed HMM been presented we are capable
to resolve real-world situations founded in the following 3 problems:
Problem 1: Given a model = ( A, B,) and an observaion sequence O, we can
determine the probability P(Oj). That is, an observation sequence can be scored to
see how well it fits a given model.
1A trellis is a graph whose nodes are ordered into vertical slices (time) with each node at each time
connected to at least one node at an earlier and at least one node at a later time
9
Problem 2: Given a model = ( A, B,)and an observation sequence O, we can
determine an optimal state sequence for the Markov model.
Problem 3: Given an observation sequence O and the parameter N (the number
of hidden states) we can determine a model that maximizes probability of O. That is,
we can train a model to best fit an observation sequence.
4.0.5 Solutions to the problems
Solution to problem 1
Given the model we want to calculate the probability of the observation sequ-
ence O that being, given O1;O2;:::;O TcalculateP(Oj).
Considering a sequence of hidden states qiwithi=1;T. For simplificity we will
the denote this sequence with Q. We can see that the probability P(Oj)can be written
as :
P(Oj) =X
allQP(O;Qj)
Lets see how we can make this ecuation simpler. The observation probability of the
sequence, knowing that the initial state is q1will be:
P(OjQ;) =P(O1jq1;)P(O2jq2;):::P(OTjqT;) =TY
i=1P(Oijqi;)
If we assume that the observations are disjoint :
O1?O2?:::?OT=?T
i=1Oi
then using the multiplication rule2, we have the following:
P(OjQ;) =bq1(O1)bq2(O2):::bqT(OT)
Also, the probability of the hidden states given the system is :
P(Qj) =q1aq1q2aq2q3:::a qT 1qT
It is clear why we introduced those equalities. Using the multiplication rule the
P(Oj)probability can be written as:
P(Oj) =X
allQP(OjQ;)P(Qj) =X
q1;q2:::qTq1aq1q2bq1(O1)aq2q3bq2(O2):::a qT 1qT
2multiplication rule
10
Indeed we come to a better form of the ecuation starting from the initial but at a
deeper look we can see the upside. Knowing that the alphabet of the hidden states is
of length N,the length of the observation sequence is T, and by making an iterations
over all states we can see that the complexity of the algorithm for finding P(Oj)is of
the orderO(TNT). That is the drawback. The algorithm is exponential in size and in
real world application is unfeasible unleast you use very small values for N, and T. If
we want to model bigger systems we must introduce the core of the Hidden Markov
Model.
The Forward-Backward Algorithm
4.0.6 N-grams
An n-gram is a contiquous sequence of items of size n from a given text. The
type of the items can vary between phonemes, letters, words and others according to
the application in which they are used. Depending of the size of n we can meet the
following names, unigram for 1-gram , bigrams or digrams for 2-ngram, or trigrams
( 3-gram ). In the following paper we gonna build our system using the extraced n-
grams, where 1;3. Poze tabele n-grame.
11
Capitolul 5
Proposed solution
* Dataset * Methodology * Step-by-step * Results
Blacklist
5.1 A HMM aproach for urls
We will try to solve the problem of malicious URLs detection using a HMM
aproach. The way in which we will gonna start is by constructing a n-gram Model
over a hidden markov model.
A n-gram model is a probabilistic model used in language processing for predic-
ting the next item in a sequence describeable by a Markov Model of order (n – 1).
The benefits of n-gram models are the simplicity and scalability
5.1.1 Building the model
The first step in constructing the model is to identify how would the system look
like in what we are interested, the identification of clean and malware URLs.
As we showed in the theoretical background a HMM can be described as =
(A;B; ), where :
A – N x N matrix, the state transition matrix , where N – numbers of hidden states
B – N x M matrix, the observation probability matrix, M – is the number of obser-
vations
- 1 x N row vector, initial state distribution
12
The first step we have to make is to identify in our case what these parameters
will be. First we have to identify the numbers N and M, which need to be fixed for
our system to our. For the beginning we will start by defining the number of hidden
states as N = 2. There are a number of reasons why we want to start from here. The
number of hidden states , equal to 2 , will let us to clear categorize an URL in a category
or another, in our case these states will be called Clean and Malware. It is clear why
we made this choice, both hidden states refer to complex processes, with very many
parameters to consider, which we want to model in a simpler way. So for now we will
keep N = 2.
Before establishing what M is , we have to respond to the following questiong : In
the case of URLs, what will be the observations ? At this point we will test two kinds
of observations from the many that exist on which systems will be built. Because our
input consists of raw URLs, we will first normalize it, then get from this normalized
form the n-grams ( substrings of length n ).
Normalization
Pentru a construi inputul pentru sistem vom proceda in felul urmator. Avem
de exemplu urmatorul URL: ”http://www.abc.co.uk/a”. Din acest input vom scoate
toate semnele de punctuatie [+-*)(]* si vom imparti dupa fiecare semn obtinand o
serie de substriguri. Pentru exemplu dat vom avea urmatoare secventa de substrin-
guri: [’http’,’www’,’abc’,’co’,’uk’,’a’]. Din aceasta lista vom elimina substriguri de ti-
pul ’www’,’http’ ´https’ ´ftp’ intrucat nu vor ajuta asa mult in stabilirea verdictului. Deci
acum avem urmatoarea lista de itemi [’abc,’co’,uk,a], din care cu ajutorul n-gramelor
vom obtine seria de observatii cu care vom putea antrena sistemul. Pentru lucrarea
de fata vom considera 2 cazuri pentru observatii, dupa care ne vom ghida cand vom
antrena sistemele. Primul sistem va fi antrenat dupa urmatorul alfabet [a-z0-9NULL],
unde simbolul NULL va fi folosit atunci cand lungimea cuvantului din care dorim sa
extragem n-gramele este mai mic decat n. Mai explicit seria de observatii extrasa folo-
sind acest alfabet va fi:
pentru 2-grame: [ ab , bc , co, uk, a =;]
pentru 3-grame: [ abc, co =;, uk=;, a=;=;], unde cu am notat NULL
Vom testa de asemenea si pentru alfabetul [a-zA-Z0-9]* pentru a testa o ipoteza
13
: posibilitatea ca prin noile simboluri adaugate sa reusim sa descriem mai bine starile
ascunse.
Pentru fiecare din cele 2 sisteme propuse vom folosi aceste 2 alfabete dar le vom
antrena diferit.
Figure 10: Malicious URL Attack and Subroutine Defense flow
First system
Primul sistem va fi un custom HMM.
5.1.2 Feature extraction
The research has been done on a database, composed of 1 million URLs, which
have been provided by the Bitdefender Cyber Threat Intelligence Lab.
Through this process, the database has been populated with the URLs and the
corresponding downloaded files, which have been further used for the training pro-
cess. Needless to say, the efectiveness of the machine learning approach depends on
the quality of the data representation, which is domain-specific
we firstly extract only the lexical features of the URL. The initial assumption is
that the properties of the URL as a string can represent a clue whether this is malicious
or benign. The first category of lexical features which we extract is represented by the
statistical ones, such as the length of the URL, the length of each of the components of
the URL(Hostname, Top Level Domain, Primary Domain, etc.)
Therefore, the URL is split by the special character ”/”. feature hashing, also
known as the hashing trick F2 score
5.2 Results
14
Concluzii
Lorem ipsum
15
Bibliografie
1. Lawrence R. Rabiner, A tutorial on hidden Markov models and selected applications in
speech recognition , 2018
2. Thomas H. Austin Mark Stamp, Hidden Markov models for malware classification ,
2017
16
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Url classification through Hidden Markov Models propus ˘a de Alexandru Martiniuc Sesiunea: iulie, 2019 Coordonator s ,tiint ,ific Conf. Dr. Gavrilut… [604500] (ID: 604500)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
