University Alexandru Ioan Cuza of Iasi, Romania [601407]
University Alexandru Ioan Cuza of Iasi, Romania
Faculty of Computer Science
Improving Detection Of Malicious
Programs
Author:
Mihai – Razvan BencheaSupervisor:
Prof. Dr. Henri Luchian
Thesis committee:
Prof. PhD Dan Simovici, University of Massachusetts, United States of America
Prof. PhD Denis Enachescu, University of Bucharest, Romania
Prof. PhD Daniela Zahariei, West University of Timisoara, Romania
A thesis submitted in fulllment of the requirements
for the degree of Philosophiae Doctor (PhD)
in the
Faculty of Computer Science
University Alexandru Ioan Cuza of Iasi, Romania
December 2015
Abstract
The thesis presents dierent methods that can be used in practice in order to increase
the detection rate of an antivirus product. The two most common types of malware
are analyzed: executable les and document les. A study over the privacy threats
brought by applications for mobile devices, as well as the necessary steps to create a
framework capable of analyzing and detecting them are also provided.
To increase the detection rate for the executable les, the following feature extraction
techniques are used:
Genetic Programming: New features are created from the original ones by evolv-
ing a set of boolean equations using the boolean operators not;and;or .
Neural Network: A Restricted Boltzmann Machine is used in order to nd non-
linear relations between features.
The resulted features are used to train a linear classier called One Side Class Percep-
tron (OSCP). The results are compared to the mapped version of the OSCP algorithm.
For the malicious documents, two types of clustering are tested, in order to see which
one is better for practical means: hierarchical Bottom-Up clustering and Hash based
clustering. Classication of document based malware is tested using a combination of
a binary decision tree and the OSCP algorithm as well as a Hidden Markov Model.
Acknowledgements
I want to express my gratitude to my advisor, Professor Henri Luchian, for the great
advices and ideas he provided to me. His experience helped to look at my work from
a dierent perspective while his constructive critics, helped me to realize and repair
my mistakes. I would also like to thank him for his patience and for allowing and
encouraging me to follow my research interests.
A special gratitude goes to my collegue, Dragos Gavrilut, who responded each time
I requested his help. His practical experience in writing algorithms helped me to
overcome any diculties that I had in testing my ideas.
I would also like to thank the members from the advisory committe and to the members
from Professor Luchian's research group for the insightful advices and for showing me
dierent approaches to my problems.
I am indebted to Bitdefender's management who oered me the nancial support, the
infrastructure, as well as the data necessarry to complete my thesis. The thanks go also
to my collegues from Bitdefender with whom I worked in developing the algorithms
presented in this thesis.
Last, but not least, I want to thank to my family and friends for the support and
understanding they oered during my phd years.
ii
Contents
Abstract i
Acknowledgements ii
Contents iii
List of Figures vi
List of Tables vii
List of Papers ix
1 Malware Detection Problems 1
1.1 Malware Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mechanisms For Evading Detection . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Defensive Mechanisms.Packers and protectors . . . . . . . . . . 3
1.2.2 Oensive mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Non executable malware les . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Attack Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Vulnerability Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Malware detection requirements . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Security and privacy concerns . . . . . . . . . . . . . . . . . . . . . . . 21
1.7.1 Privacy concerns regarding mobile devices . . . . . . . . . . . . 22
2 State Of The Art. Related Work 26
2.1 Malware Detection using Machine Learning . . . . . . . . . . . . . . . . 26
2.1.1 Malware detection using machine learning . . . . . . . . . . . . 27
2.1.2 Practical Optimizations for Perceptron Algorithms in Large Mal-
ware Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Building a practical and reliable classier for malware detection 29
2.2 Feature Extraction using Genetic Programming . . . . . . . . . . . . . 31
iii
Contents iv
2.2.1 A Hybrid Approach to Feature Selection and Generation Using
an Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Automatic Feature Generation for Machine Learning Based Op-
timizing Compilation . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Evolving Novel Image Features Using Genetic Programming-based
Image Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Feature Extraction Using Genetic Algorithms . . . . . . . . . . 34
2.2.5 Feature Generation Using Genetic Programming With Applica-
tion to Fault Classication . . . . . . . . . . . . . . . . . . . . . 35
2.2.6 The Evolutionary Pre-Processor Automatic Feature Extraction
for Supervised Classication using Genetic Programming . . . . 37
2.3 Improving Malware Detection through Feature Extraction . . . . . . . 38
2.3.1 Data Mining Methods for Detection of New Malicious Executables 38
2.3.2 The proactivity of perceptron derived algorithms in malware de-
tection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.3 Learning to Detect and Classify Malicious Executables in the Wild 41
2.3.4 Comparison of feature selection and classication algorithms in
identifying malicious executables . . . . . . . . . . . . . . . . . 42
2.3.5 Data mining methods for malware detection . . . . . . . . . . . 45
2.3.6 Detecting unknown malicious code by applying classication tech-
niques on opcodes representation . . . . . . . . . . . . . . . . . 46
2.3.7 A feature extraction method and recognition algorithm for de-
tection unknown worm and variations based on static features . 48
2.3.8 An interpretable string based malware detection system using
svm ensemble with bagging . . . . . . . . . . . . . . . . . . . . 49
2.3.9 Ecient virus detection using dynamic instruction sequences . . 50
2.3.10 A static malware detection system using data mining methods . 52
2.3.11 Large-scale malware classication using random projections and
neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Detection of non-executable les . . . . . . . . . . . . . . . . . . . . . . 55
2.4.1 Caeine Monkey: Automated Collection, Detection and Analysis
of Malicious JavaScript . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.2 The Rise of PDF Malware . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Detection and Analysis of Drive-by Download Attacks and Ma-
licious JavaScript Code . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4 Prophiler: a Fast Filter for the Large-Scale Detection of Mali-
cious Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.5 Spyproxy: Execution-based Detection of Malicious Web Content 62
2.4.6 Zozzle: Low-overhead Mostly Static JavaScript Malware Detection 63
2.4.7 Pattern Mining for Future Attacks . . . . . . . . . . . . . . . . 64
2.4.8 Fast Plagiarism Detection System . . . . . . . . . . . . . . . . . 66
2.4.9 JPlag: Finding plagiarisms among a set of programs. . . . . . . 67
Contents v
3 Improving Malware Detection By Feature Extraction 68
3.0.1 Feature Creation Using Mapping . . . . . . . . . . . . . . . . . 71
3.1 Feature Creation Using Genetic Programming . . . . . . . . . . . . . . 73
3.1.1 Using the boolean andoperator for crossover . . . . . . . . . . . 77
3.1.2 Using all boolean operators for crossover . . . . . . . . . . . . . 87
3.1.3 Testing a better tness function . . . . . . . . . . . . . . . . . . 93
3.2 Feature Creation Using Restricted Boltzmann Machine . . . . . . . . . 98
3.2.1 Training Ensemble of Linear Classier and Restricted Boltzmann
Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.2 Testing the Ensemble and Results . . . . . . . . . . . . . . . . . 105
4 Clustering and Classication of Script based Malware 109
4.1 Clustering Document Based Malware . . . . . . . . . . . . . . . . . . . 116
4.1.1 Hierarchical Bottom-up Clustering . . . . . . . . . . . . . . . . 117
4.1.2 Hash Based Clustering . . . . . . . . . . . . . . . . . . . . . . . 127
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.1.4 Detecting document based malware . . . . . . . . . . . . . . . . 135
4.2 Classication of Script Based Malware . . . . . . . . . . . . . . . . . . 140
4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2.2 Using the One Side Class Perceptron algorithm . . . . . . . . . 142
4.2.3 Using the Hidden Markov Model . . . . . . . . . . . . . . . . . 145
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5 Mobile Privacy Issues 148
5.1 Application Store Protection and Malware . . . . . . . . . . . . . . . . 148
5.1.1 AppStore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.2 Google Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.2 Security and Privacy Risks . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.2.1 Privacy risks Of Third Party SDKs . . . . . . . . . . . . . . . . 153
5.2.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.2.3 Remote Code Execution via Android SDKs . . . . . . . . . . . 160
5.2.4 Methods for Privacy Risks Detection . . . . . . . . . . . . . . . 164
6 Conclusion and Future Work 167
Bibliography 169
List of Figures
1.1 Malware Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Growth of market share of Mobile Phones OS . . . . . . . . . . . . . . 23
1.3 Growth of sapplication store . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 OSC training process. Training step (a). Class minimize step (b). . . . 71
3.2 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4 Crossover operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5 Selection Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Number of original features used on each generation . . . . . . . . . . . 82
3.7 Crossover operation including original features . . . . . . . . . . . . . . 83
3.8 Number of original features used on each generation . . . . . . . . . . . 83
3.9 Selection Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.10 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.11 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.12 Chromosome example evaluation . . . . . . . . . . . . . . . . . . . . . 89
3.13 Evaluation tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.14 Initial Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.15 Evolution of the detection rate and false positives rate . . . . . . . . . 92
3.16 Evolution of the tness score of each generation . . . . . . . . . . . . . 92
3.17 Dierent features being activated on the same subset . . . . . . . . . . 94
3.18 Evolution of the detection rate . . . . . . . . . . . . . . . . . . . . . . . 97
3.19 Evolution of the number of features activated on each record . . . . . . 97
3.20 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . 98
3.21 Restricted Boltzmann machine and Linear Classier . . . . . . . . . . 100
4.1 Distribution for the Manhattan distance for all pair of les . . . . . . . 122
4.2 Distribution for the Manhattan distance for all pair of les . . . . . . . 126
4.3 Steps for building the classier . . . . . . . . . . . . . . . . . . . . . . 140
4.4 Sequence Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.1 Network trac of CricketLand . . . . . . . . . . . . . . . . . . . . . . . 156
5.2 Widdit Update Architecture . . . . . . . . . . . . . . . . . . . . . . . . 162
vi
List of Tables
3.1 The results of the 3-fold cross-validation for OSCP and OSCP-MAP
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according F-score . . . . . . . . . . . . . . . 76
3.3 The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according the F-score plus the original 300
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4 Results of using chromosomes resulted in dierent generations . . . . . 82
3.5 Results on dierent iteration for arbitrary length features . . . . . . . . 84
3.6 An example of bloating problem . . . . . . . . . . . . . . . . . . . . . . 85
3.7 Results on dierent iteration for arbitrary length features . . . . . . . . 85
3.8 Comparison of the results when using features resulted from mapping
versus genetic programming . . . . . . . . . . . . . . . . . . . . . . . . 86
3.9 Results using dierent number of features . . . . . . . . . . . . . . . . 86
3.10 The results of the 3-fold cross-validation for OSCP, OSCP-MAP, OSCP-
RBM algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1 Token classes and their value . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2 Token classes and their value . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3 List of tokens after performing simple string operations . . . . . . . . . 120
4.4 Token classes and frequencies . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Cluster components for Manhattan distance algorithm . . . . . . . . . 123
4.6 Cluster obtained using custom metric distance . . . . . . . . . . . . . 126
4.7 Cluster components for hash-based clustering method . . . . . . . . . . 130
4.8 Nr. of PDF les containing javascript code . . . . . . . . . . . . . . . . 131
4.9 Time comparision between Hash Based Clustering and Hierarchical Bottom-
up Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.10 Results for dierent values for normalization factor . . . . . . . . . . . 133
4.11 Similarity of les from top 5 biggest clusters after manual analyzes . . 133
4.12 Clustering results obtained with and without precomputation . . . . . 134
4.13 Mean detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.14 Top 20 token classes according to score S1. . . . . . . . . . . . . . . . 138
4.15 Top 20 token classes according to score S2. . . . . . . . . . . . . . . . 139
vii
List of Tables viii
4.16 3-fold cross validation results for OSCP, OSCP-frequency BT-OSCP-xxx
and HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1 Actions Performed by TOP5 3rd Party SDKs . . . . . . . . . . . . . . 155
5.2 Permissions required to integrate Widdit SDK . . . . . . . . . . . . . . 163
List of Papers
The content of this thesis is based on the work presented in the following papers:
1. "Feature Creation using Genetic Algorithms for Zero False Positive Malware Clas-
sication", Razvan Benchea , Dragos Gavrilut, Henri Luchian, EVOLVE 2015,
2015
2. "Feature Creation Using Genetic Programming with Application in Malware De-
tection", Cristina Vatamanu, Razvan Benchea , Dragos Gavrilut, Henri Luchian,
International Symposium on Symbolic and Numeric Algorithms for Scientic
Computing (SYNASC), 2015
3. "Combining Restricted Boltzmann Machine and One Side Perceptron for Malware
Detection", Razvan Benchea , Gavrilut Dragos, the International Conference on
Conceptual Structures (ICCS), 2014
4. "A Practical Guide for Detecting the Java Script-Based Malware Using Hidden
Markov Models and Linear Classiers", Doina Cosova, Razvan Benchea , Dra-
gos Gavrilut, International Symposium on Symbolic and Numeric Algorithms for
Scientic Computing (SYNASC), 2014
5. "Building a Practical And Reliable Classier For Malware Detection", Cristina
Vatamanu, Dragos Gavrilut, Razvan Benchea , Journal in Computer Virology,
2013
6. "A Practical Approach on Clustering Malicious PDF documents", Cristina Vata-
manu, Dragos Gavrilut, Razvan Benchea , Journal in Computer Virology, 2013
7. "Practical Optimizations for Perceptron Algorithms in Large Malware Dataset",
Dragos Gavrilut, Razvan Benchea , Cristina Vatamanu, International Sympo-
sium on Symbolic and Numeric Algorithms for Scientic Computing (SYNASC),
2012
ix
List of Papers x
8. "Optimized Zero False Positives Perceptron Training for Malware Detection",
Dragos Gavrilut, Razvan Benchea , Cristina Vatamanu, the International Sym-
posium on Symbolic and Numeric Algorithms for Scientic Computing (SYNASC),
2012
9. "Performance Testing Framework: Evaluating The Impact On The System Speed",
Paul Bot, Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea , the Work-
shop on Antimalware Testing Research (WATER), 2014
10. "Hiding The Network Behind The Network. Botnet Proxy Business Model",
Cristina Vatamanu, Alexandru Maximciuc, Razvan Benchea , Virus Bulletin
(vb2014), 2014
11. "Google and Apple Markets: Are Their Applications Really Secure?!", Vlad Bor-
dianu, Razvan Benchea , Dragos Gavrilut, Virus Bulletin (vb2013), 2013
12. "In-depth Analysis of PushDO Botnet", 16th Association of Anti-Virus Asia Re-
searchers International Conference 2013 (AVAR), Alexandru Maximciuc, Cristina
Vatamanu, Doina Cosovan, Dragos Gavrilut, Octavian Mihai Minea, Razvan
Benchea , 2013
Chapter 1
Malware Detection Problems
1.1 Malware Evolution
Starting with the rise of internet access the malware environment began to change.
Before that, most malware samples were being spread by removal media such as
oppy
disks or CD-ROMs, which of course limited the access to few computers. The access
to information was also very limited, compared to our days. This may be the reason
why many types of malware were just proof of concepts or bad jokes. However, access
to internet gave malware creators access to many computers. Starting with the year
2007 malware started to increase exponentially. More malware was produced in 2007
than in all years before that put together. The sudden growth was mainly in
uenced
by the idea that malware could be used to earn money. Almost all possible ways of
earning money through malicious code were exploited. From the most obvious ones, like
stealing access to bank account or asking for a ransom to give access to some encrypted
les, to less obvious ones like simulating user behavior for certain websites in order to
increase trac and with it money income for that website. According to an article
published in 2003 [1],the annual cost of cyber activity in United States of America,
was between 24 billion to 120 billion dollars. This year, more than 300.000.000 samples
were discovered, with a 100.000.000 more than the previous year. The Figure 1.1, taken
from the independent anti-virus testing company, AV-Test [2], best describes how the
number of malware les has changed from year to year.
1
Chapter 1. Malware Detection Problems 2
Figure 1.1: Malware Evolution
Trying to detect these samples has created an arm race between antivirus products and
malware. As antiviruses and operating systems are getting more and more complex
and include more technology in order to protect computers, the malware adapts and
brings new methods for evading detection. These methods will be described in the
following sections.
1.2 Mechanisms For Evading Detection
At the beginning, almost all the anti-viruses were signature based. The way they
worked was by computing a hash function on certain parts of the scanned le and then
compared it to a list of hashes that was built in the laboratory after analyzing malware
samples. The method worked well since, most malware les were static (they never
changed from one infection to another) and the number of les were quite few. There
was no need of heuristics since malware analyst could gather all the les and manual
analyze them.
Chapter 1. Malware Detection Problems 3
When malware creators realized that the quantity of money they can earn is dependent
on the time their malware remains undetected, several method of protecting the virus
were developed.
1.2.1 Defensive Mechanisms.Packers and protectors
One of the simplest methods to use by a malware creator to avoid detection of its
program is to use a packer. The packer is a program that given an executable will
generate another one that has a smaller size than the original but does the same thing.
It works by compressing the original code and inserting a stub that will decompress it in
memory when the program is executed. There are several advantages from the malware
creator perspective when using a packer. First, the le on disk is smaller thus it can be
transported more easily. Secondly, the resulted le will look completely dierent than
the original, thus a signature based anti-virus will not be able to be detected. Packers
are not malicious by themselves. They are also used by many applications, such as
installers, in order to compress their data. This is why detecting the packer can lead
to many false alarms.
Another method that is frequently used is to protect the code from people that want
to reverse engineer it (such as security researchers that work in an antivirus company).
The protector is a piece of code that gets executed before the protected code and,
depending on its complexity, provides dierent methods to revoke access to code and
data unless certain conditions are satised. The main idea is to make analysis of the
code very dicult, but allow the malicious code to be run on a system where no analysis
is being performed. The protectors are also used by programmers in order to prevent
reverse engineering of their code. They usually work by encrypting the code and data
and decrypting it on execution only if certain conditions are met. These conditions can
be one of the following types:
Debugger is not present: if a debugger is attached to the current process, the
process will force itself to crash, exit, or behave abnormally.
Chapter 1. Malware Detection Problems 4
Debugging tools are present: If tools that are usually used to analyze the behavior
of a process are running on the system such as network monitoring tools or
programs that show processes components, the process will force itself to crash,
exit, or behave abnormally.
Virtual machine detection: Most packers don't run on a virtual machine. This is
mainly because virtual machines are used by malware analysts to study samples
behavior. There are dierent methods to detect if a process is running in a virtual
machine, starting from simple ones like detecting dierent types of hardware that
are specic to a certain vendor to more complex ones, like detecting if certain
instructions are emulated and others are executed.
Code tampering detection: The method works by calculating some checksums on
some parts of the code and data to see if it was modied during the execution of
the sample. The tampering usually happens when the sample is being analyzed
in a debugger.
Besides these methods of validation, protectors usually include dierent methods that
make analysis very dicult:
Anti-dumping: The dumping method is usually applied in order to get access to
a code that is protected by a packer. Since the code that needs to run has to be
unpacked in memory, a simple method to get access to that code is to dump the
memory after its execution. The anti-dumping method works by overwriting in
memory pieces of code that were executed and are no longer needed.
Virtualizing: The method works by transforming pieces of the original code in
another language, understood only by the packer. The packer then executes the
code using a virtual machine. In order to analyze a sample that is virtualized
one must understand how the virtual machine works and what each new created
byte code represent.
Chapter 1. Malware Detection Problems 5
Obfuscators: Their main purpose is to make analysis and reverse engineering
very dicult. They work by rewriting the original code with instructions that
do the same but using a code that is very dicult to read. Examples of methods
used in obfuscations are: random variable names, garbage code, calls to functions
that do not change the behavior of a program, encryption of strings and function
names
Polymorphism is usually related to le infectors. The main purpose is to make
detection based on hashes inecient. The method usually relies on an encryp-
tion algorithm that takes a key as an argument. Every time a new infection is
generated, it will use a dierent key, thus will result a le that looks dierent,
but does the same thing.
Server side polymorphism works in the same way as polymorphism, but instead of
being performed on the infected computer the le is generated on a server. This
brings several advantages to the attacker, the most important being that it has
total control of the algorithm that generates the le. He can adapt it and change
it at any time. This method is very popular among malware creators because
they can create a new sample, scan it with every known antivirus product and
only if it is not detected the le will be sent into the wild.
Metamorphism represents the method through which a malware can change itself
without aecting its behavior. This is more complex than polymorphism since
on polymorphism the only dierence from samples is due to dierent encryption
key. On metamorphism the sample is usually disassembled and an instruction or a
group of instructions is replaced with a new instruction or a group of instructions
that does the same thing but in a dierent way. For example, a multiplication
might be changed with several operations of addition.
Chapter 1. Malware Detection Problems 6
1.2.2 Oensive mechanisms
Remaining undetected can also be achieved by tempering with the anti-virus product
or making changes to the operating systems. Even though in some cases, the operating
system usually warns the user of changes made to the conguration of the os, many users
don't read the message or just accept it anyway. There are cases when a malware tries
an action so often, that the notication message is displayed in an aggressive manner.
To stop the system from showing it, many users simply deactivate the operating system
protections without looking for the cause of the notications.The following methods of
oensive protection were encountered when analyzing malware families:
Avoid running when antivirus is installed: this is the simplest and the most
used method of avoiding detection. The malware simply lists the processes and
searches their names into a list of le names used by antivirus products. If it
nds it then it will simply uninstall itself. This way, it will take longer until the
malware analysts get access to the malware le and other users that don't run
an antivirus product will stay infected longer
Terminating the antivirus product: this is harder for the malware le to do since
most of the anti-malware products have a driver that protects their les from
inside kernel. However, if the virus le managed to get into the kernel (usually
by using an exploit or by installing a digitally signed driver) then this is possible.
Disable updates: usually, the Windows updates are the ones that are disabled.
This way, if the user has the default antivirus product, no updates will be de-
livered. There were also some malware samples that deactivated the update
mechanisms by simulating mouse click on the anti-malware product interface.
Disable or fooling rewall: Since most malware les try to connect to a command
and control center in order to receive updates or commands, generating network
trac is something that can not be avoided. In order to remain undetected, there
are several ways to trick the rewall. An obvious one is to simply deactivate it.
Chapter 1. Malware Detection Problems 7
Another one is to perform the actions by injecting into a well known application
that generates network trac (like a browser). However, when an antivirus is
installed and network trac is scanned, many malware les simply encrypt their
trac and make it look like a genuine one. For example, a banker family transfers
the updates as an image le.
Change DNS settings: by changing the domain name server or by editing the hosts
le (a le where domain names and the ip they are resolving to are entered) a
malware le can block the antivirus product from contacting the update server.
Change permissions: an interesting trick used by one the most well known worms,
Concker, to avoid being scanned by the anti-virus product was to change its
permissions. After being installed, the malware would set the permissions of its
les to allow only execution (no reading or writing). This way, when a scanning
was being performed, the antivirus would skip it since it had no access.
Install a rootkit: this is the most ecient way to avoid detection. By installing
a driver, a malware can hook functions used for listing les, registry keys or
processes . Each time one of these functions should return information about a
component of the malware, it would just skip it, thus appearing hidden to the
user or even the anti-virus product. Current version of the Windows operating
system requires any installed driver to be digitally signed. This has signicantly
reduced the number of rootkits. To install a driver, most of the malware that
succeed are doing it through the use of a stolen certicate.
Not writing the le using the le system: to achieve this, one method used by
malware is to have a component on the disk that reads a buer from the registry
in order to execute it and have the important malware component in the registry.
The le on the disk will appear as benign to an analysis since all the malware
components are in the registry.
Chapter 1. Malware Detection Problems 8
The list of methods described above is by no means complete. In the last years there
were reports of malware that remained hidden for a great period of time. New methods
are being developed as anti-viruses get more complex. For example, this year a malware
was discovered that overwritten the rmware of an hard-drive in order to make it look
smaller. On the remaining disk storage (not used by the hard-drive) the malware would
store its components. To get access to the kernel, the virus exploited a vulnerability of
an already installed driver.
1.3 Non executable malware les
Most users know that they should not open an executable le that is received through
e-mail because it may contain a virus. However, not many realize that malware les
come in many forms. The reason behind this is that many often used programs receive
as input les that can contain code. Even though the use of code is mainly for the
increase of functionality it also allows attackers to control the behavior of the program.
All of these programs come with protection mechanisms in order to avoid running of
malicious code but sometimes, through the use of a vulnerability, these mechanisms
fail.
The most used le formats that carry malware are listed below:
.SCR: the .scr les are associated with screen savers. They are usually run when
there is no user interaction after a dened period of time. A thing that most
users don't know is that .scr les are actually executable les. They have the
same format (MZ-PE format) as any executable le on the system. Thus, they
can come disguised as malware.
.DLL: the dynamic link library le contains a set of functions that are exported
in order for other programs to use them. Almost every medium or large program
uses these les in order to separate the functionality of the program in dierent
parts. Their format is the same as any executable (MZ-PE) which means that
Chapter 1. Malware Detection Problems 9
they are sometimes associated with malware. A
ag in the header of the DLL
le forbids running them directly, however they can be run through other means.
For example, an executable from the Windows operating system (rundll32.exe)
loads any dll given as argument. They can also be loaded by Internet Explorer or
Windows Explorer if they have a certain format (Browser Helper Object format)
.BAT: the batch le is a script le that contains a series of commands to be
executed by the command line interpreter in windows. Even though malware
could be written in this language, there are rare cases of this happening. However,
many malware families rely on these les to perform secondary actions. One of
the most used case is where a malware process will attempt to delete its les after
it nishes executing. Since it can not delete its les while its running it needs
an extra process that will do this when it nishes. For this reason a batch le is
created that attempts to delete the malicious le in a loop.
.VBScript: Visual Basic script les are also used by malware creators. They are
usually preferred by hackers that just learn how to write a malware, thus visual
basic script viruses have usually a simple structure and a straight forward beha-
vior. The reason why hackers choose the visual basic script language is because it
is easy to write simple code and most important, the Windows operating system
comes with a visual basic script interpretor already installed (wscript.exe). Most
malware that come written in this language are simple worms that copy them-
selves to startup folders, downloaders that are used to get more complex malware
from the internet or simple trojans (for example les that delete all documents
from a computer). Even though this is rare, there were even le infectors written
in visual basic script language. These would infect other visual basic scripts les.
JavaScript code in HTML les: This is the main method of infection that was
used in the last years. There are usually two main purposes for the JavaScript
code when it comes to its use in malware behavior.
Chapter 1. Malware Detection Problems 10
{In the rst case it is used to exploit a vulnerability in the browser. Even
if the vulnerability is generated by how the browser handles a html tag, or
some function of a plugin found in the browser, the control is usually taken
using JavaScript through the use of a technique called heap spray. This
works by allocating large regions of memory in which malicious binary code
will be copied. The purpose is to increase the probability for the execution
to reach a memory region that is controlled by the malware in case of an
invalid memory access that is usually a result of an exploit.
{The second case is for validation of the victim (in which case it veries if
the victim is running a certain operating system or browser in order to see
if it is vulnerable), redirection (either to a malicious web-page or to a well
known benign webpage in case the user is not vulnerable) or for drive-by
downloads. Drive-by downloads have become more and more frequent in
the last years. They work by automatically downloading a le to a user
computer when the victim visits a web-page. Most often, the downloaded
le is an adware or a spy-ware program but downloaders or trojans were
also installed using this method.
.PDF: Malware embedded in pdf les have started to be used in 2007. Malware
found in these les usually target vulnerability in the Adobe Acrobat Reader
software. Since this program is found on the majority of computers that run
the Windows operating system, a remote code execution vulnerability in it could
mean a way of infecting millions of users. The majority of malware that come
in pdf les make use of the feature that allows the Acrobat Adobe Reader to
run JavaScript code. Another feature that pdf les have is the ability to embed
binary les inside it by the use of streams. Upon successful exploitation of the
Adobe Reader program, malicious les found in binary streams can be dropped
to disk and executed. PDF le format was chosen by malware creators to deliver
malicious content since it is associated with documents which users consider them
as being a static content. They are also prefered from other document formats
Chapter 1. Malware Detection Problems 11
(i.e the oce format) since the format is easy to understand. Another advantage
of pdf les over other document formats is that they can be opened using a
browser, thus they can be delivered by a web server not only through e-mails or
other targeted methods (i.e. memory sticks). The predominant malware types
associated with pdf les are downloaders. This is because the PDF format is
usually as an attack vector.
.Oce Files: Since these are the most known document formats, malware creators
found an opportunity in them to deliver malware. There are usually two directions
through which malware gets delivered as oce documents: through macros and
through vulnerabilities.
{When Microsoft introduced the ability to run macros inside oce documents
malware associated with documents started to appear. Like the JavaScript
language inside the PDF les, the Visual Basic for Applications (VBA)
was introduced to increase the usability of oce documents (i.e. automate
repetitive tasks). However, through the use of it, malware creators made it
possible to drop les on disk and execute it or send e-mails. Starting with
Oce 2007, when the Oce le format changed and didn't allow macros
to be embedded (unless a special extension is used) the macro malware
decreased rapidly. There are still documents that contain macro malware
but they are few and usually rely on social engineering to convince the victim
into allowing the macro to run.
{Since macro les are disabled in the new oce document format another
method was needed to deliver malware through the use of documents. The
new format also simplied the oce format, thus new vulnerabilities could
be found more easily. Nowadays, most of the malware that gets delivered
through oce documents rely on vulnerabilities. Like the pdf les, the oce
documents are usually used as attack vectors, so the majority of malware
that gets disguised as documents are downloaders.
Chapter 1. Malware Detection Problems 12
SWF Files: The swf extensions is associated with les for the Adobe Flash Player.
This is a format that is used to bring interactive and animated content to the
user through the use of a plugin in the browser. It has its own scripting language,
called Action Script, but it also allows the use of JavaScript. As JavaScript is used
in browsers or pdf les to take control of the program rendering them so is Action
Script for the SWF le. The Flash malware usually exploit a vulnerability inside
the
ash player. Since the same swf le can be run by browsers from dierent
operating systems, a malicious
ash le can target users that don't run windows
software.
Java Files: Software written in Java Language for PC is either a stand alone
program or an applet that requires a plugin in the browser to run. There are few
malware les written in Java language that run as stand alone programs. Most
of them are applets and require a Java plugin installed in the browser in order to
render them. As with SWF les, malware written in Java is portable so it can
aect users that run dierent operating systems. In 2012 and 2013, Java exploits
were the main method of delivering malware through the browser.[3]
1.4 Attack Vectors
An attack vector is a collection of steps that an attacker takes in order to infect the
victims. Even though there are many ways a person can get infected, there are some
patterns by which an attack can be classied. Every attack vector needs an entry point
into the victim's computer. There are four main entry points:
1. External devices: This can be anything, like a CD or DVD but the most used
one is the USB
ash drive. Until Windows 7, when the autorun feature was
deactivated, this was one of the most used methods of spreading malware. In
order to spread, malware running on a computer would list every removable
drive and copy itself to that location. In addition to that it would create an
Chapter 1. Malware Detection Problems 13
autorun.inf le where it would write instructions to auto execute the malware le
copied to the drive when the usb
ash drive is inserted. Every user that opened
the usb
ash drive using windows explorer would get infected. Using this method,
malware could spread to locations that were isolated from Internet or even from
a local network.
2. Vulnerabilities of known applications: either a vulnerability that is located in
the browser or one that relates to a program that opens documents or even a
vulnerability of a certain application that a victim which is important to the
attacker uses. The purpose is the same: to infect the victim's computer. To
increase the probability of infecting a user, hackers have created a package that
contain many exploits, named exploit pack. When a user visits a web page,
the exploit kit tries to take control of the victim machine by feeding all kind of
exploits: from Java exploits to PDF and exploits targeted to the browser that
the victim is using.
3. Direct network attacks: in this case, the computer is attacked either through a
vulnerability of a service that the operating system is running or by trying to
copy to shared folders by brute-forcing the victims passwords. This behavior is
usually related to worms. A vulnerability in an operating system service accessible
through the network can make the malware spread to allover the world. The
Concker worm [4] used both a vulnerability and a brute-force technique to spread
and succeded infecting more than 7 million computers
4. E-mail and social channels: By sending e-mails with malware attachments or
sending les through instant messaging programs the malware tries to involve
more computers in the infected network. This method is usually accompanied by
a social engineering part. The malware writes some lines in the name of the user
and sends itself as an image le in order to look genuine. Even though not all
the users that receive the les get infected (since they need to run it) it is still a
method frequently used
Chapter 1. Malware Detection Problems 14
The previous entry points are used together with social engineering methods or other
technology in order to increase probability of infection. The following attack vectors
are the most used:
1. Spear phishing: this technique relies on sending e-mails to only a small group
of people that have something in common. For example, victims of the Red
October malware [5] received an e-mail containing a document with a sale of
diplomatic cars. The document contained an exploit for a vulnerability of oce
documents that attackers used to drop a backdoor. By sending e-mail to only a
small group of people, it decreases the chance that the malicious le to be sent
to the antivirus company for analysis. Since it relies on information that is of
interest for the victim is does not rise any suspicions and it can also get pass the
antispam lter.
2. Drive-by downloads: using this method a le is automatically downloaded once
a user visits a web site. If the victim is vulnerable to a certain exploit then the
le is automatically executed. Many times, this method relies on using malicious
ads that are displayed on benign web sites. Thus, without suspecting anything,
while visiting a benign web site, the user is being attacked with a collection of
exploits.
3. Black hat seo: in order to get to as many users as possible the attacker uses black
hat search optimization techniques in order to make a malicious webpage appear
through the rst results when searching certain words using well known search
engines like google. Black hat seo includes, but is not limited to, using many
keywords without any meaning to the content of the webpage, duplicate content,
or infecting other webpages with a link that points to the malicious webpage in
order to increase its ranking. If the user opens such a web page, it will result in
a drive-by download.
4. Watering hole: this method is usually used when trying to infect a certain group
of people. The main idea is to infect a benign website that the targeted users
Chapter 1. Malware Detection Problems 15
visit in order to deliver malicious content. It is similar to the drive-be download
attacks the dierence being how the malicious content gets to the user (through
infection of the webpage and not through malicious ads).
1.5 Vulnerability Types
Nowadays, most users know that they must not execute a le that came attached on
an e-mail. More than that, e-mail programs allow restrictions to be set such that the
user is not allowed to run an executable le that came as an attachment. This simple
method blocks the attacks where an executable le uses a document icon (i.e. pdf or
doc).
In order to get pass these restrictions, attackers try to nd critical vulnerabilities inside
well known applications. By leveraging these, an attacker can infect a victim by simply
making him visit a web page (by use of watering hole or black seo techniques). These
vulnerabilities appear due to mistakes of the programmer such that writing over the
boundaries of a buer or using a memory space that has already been deallocated. The
attackers try to exploits these vulnerabilities in such a way that will redirect the exe-
cution pointer to their buer, thus achieving code execution. All these vulnerabilities
deal with a memory space of a xed width (called buer). Depending on what that
buer is used, the vulnerabilities can be grouped in several categories.
1. Stack Over
ows: This is one of the simplest types of vulnerabilities and it ap-
pears when the programmer writes more data into a buer that was declared on
the stack. By doing this, the programmer actually overwrites additional data
structures used by the CPU in order to achieve a good execution.
In assembly language, when a function is being called, the CPU pushes on the
stack dierent values like the arguments of the function, the value of a register
calledEBP and the address of where it has to return. The EBP register is used
to store the value of the stack register ESP (which indicates the head of the
Chapter 1. Malware Detection Problems 16
stack) when the execution pointer is inside the function. When the execution
reaches the end of the function the ESP register will receive the value of EBP .
Basically, the EBP register serves as a backup for ESP and makes tracking
of arguments and local variables easier. Another important thing that happens
when the function has nished executing is that the address where the function
needs to return will be poped from the stack and the execution will jump to that
address.
Standard stack over
ow: the problem with this approach is that the local
variables, including buers, are stored on the same stack as the structures
explained above. When a program lls a buer with the data provided by
the attacker and it doesn't correctly check to see if it passes the boundaries
of the buer, it is possible to overwrite a return address used by a function.
Of course, in order to achieve a malicious code execution the return address
must point to an address controlled by the attacker. This is usually an
address from the stack where the buer sent by the attacks is being stored.
O by one byte: this is similar to the standard stack over
ow but it is more
dicult to achieve code execution. The problem usually appears when a
programmer tries to use a safe function, that checks for buer boundaries,
but species the wrong buer length. It is a common mistake for novice
programmers to incorrectly count the buer length since this starts at index
0 which is not intuitive. So the length provided is usually one byte shorter
than what it should be.
With a single byte to control, the attacker can not directly control the return
address of a function. However it can still modify a byte of a register called
EBP . Since on Intel architecture, values are stored in little-endian, a one
byte over
ow will actually modify the least signicant byte of the EBP
register. When the value of EBP will be assign to ESP , at the end of the
function, the attacker will actually control where the stack begins. Through
Chapter 1. Malware Detection Problems 17
this way it can also control where the function will return, since this value
is also stored on the stack
Using Format String: Like the rest of the stack vulnerabilities, this one also
tries to control the return address of a function. This time this is done by
exploiting a feature of the printf function and its variants. The function
printf allows for a buer to be written to the console. It allows for dierent
arguments that species how the output to be formated. There are dierent
variants of it; some of them are for unicode characters and there is also
a variant that is used for formating a string instead of outputing to the
console.
The formating options are specied using the % (percent) character. There
are many dierent formating options that can control how the printf function
works like: outputting a hexadecimal representation of a number, outputting
a string, a char and there is even an argument that can be used to store the
number of chars outputed in a memory address. However, the function can
also be used without any formating arguments. An user can simply pass the
buer and that buer will be outputed. This is actually the source of the
problem.
Novice programmers may pass a string that is copied from the input to the
printf function without using any formating arguments. An attacker may
send a string that contains formating arguments. Doing so will make the
printf function to search additional arguments on the stack. A very powerful
formating argument is "%n". This tells the printf function to write to an
address specied, the number of bytes processesed. Thus, by providing
several formating arguments one can skip certain addresses on the heap and
reach the return address of the function.
2. Heap Over
ows: buers on the heap are usually used when the amount of data
to be stored is too big to be stored on the stack or the maximum length of it
is not known at compiling time. These are usually created using functions such
Chapter 1. Malware Detection Problems 18
asmalloc andnew and are freed using functions free anddelete . In contrast
to the stack, there are no structures directly used by the CPU that are stored
on the heap. For this reason, these kind of vulnerabilities are more dicult.
There are dierent exploiting techniques meant to achieve code execution using
heap over
ow vulnerabilities. Most of them try to overwrite sections of memory
that will later be referenced in an execution context. This is possible since on
heap, buers are usually stored on appropriate locations. So writing beyond the
boundaries of a buer stored on heap it is very likely to write on another buer.
Overwriting Function Pointer: A good way to achieve good execution is
to overwrite function pointers. On some occasions these are stored on the
heap. A good example is a C++ class that uses virtual functions. Since
these kind of functions are initialized only when the object is initialized
they are stored as function pointers. More than that, every object that uses
virtual functions also has a pointer to a table, called virtual function table,
that stores all these pointers. Overwriting the table or the pointer to that
table is a good way of achieving execution of the malicious code.
Use after free: There are situations when an object is allocated on the heap,
freed and later used again. Normally, this would lead to a crash, since
unallocated memory is being used. However, if an attacker manages to real-
locate the freed memory and copies there his code, the crash will not happen
anymore and the new code will be executed instead. These kind of vulner-
abilities are often exploited in browsers. This is possible since browsers,
through Javascript language, allow heap-spraying. Basically, heap-spraying
means that an attacker can allocate large chunks of memory on heap and
ll the bytes with the right values. Using large chunks of memory increases
the probability to overwrite the previous freed locations.
3. Integer Over
ows: These vulnerabilities usually happen when the program per-
forms some arithmetic operations, prior to a buer operation but don't take into
account the boundaries of the arithmetic results or how the compiler treats the
Chapter 1. Malware Detection Problems 19
result. Over
owing an integer is only useful when the results is used as an argu-
ment of a buer processing function, such as strncpy.
Boundaries check: A classic scenario of this kind of vulnerability is the
following: an arithmetic operation is being performed on an unsigned inte-
ger. The result of this operations determines the length of a buer to be
allocated. After which, some user controlled buer is being copied to the al-
located buer. If the arithmetic result isn't validated, an attacker can make
the result over
ow, thus instead of obtaining a very large number, a small
number will be obtained. If this happens, the size of the allocated buer
will be too small leading to a classic stack or heap over
ow.
Signedness Errors: This happens when a signed variable is used both as
signed and unsigned in dierent scenarios. For example, a programmer may
use a signed integer to measure the length of a buer. The signed integer will
be veried against a maximum length and if its lower, a buer operations
will be performed. Passing a negative value as the length of the buer,
will make it pass the maximum length validation. However, when a strncpy
function is used, or equivalent, it will treat the supplied length as unsigned,
thus overwriting other memory zones.
1.6 Malware detection requirements
Given the fact that the malware families evolve very fast and use sophisticated tech-
niques to avoid detection it is very important for the malware classication methods
to be able to handle very large datasets in a reasonable amount of time. The amount
of malware les being delivered to the wild is so big that only an automated or semi
automated mechanism can handle it at this time. For this reason, a machine learning
algorithm can be used to lter what malware researcher analyze or it can even be used
as a primary detection method. However, this detection mechanism must also undergo
some restrictions in order for it to be used in practice:
Chapter 1. Malware Detection Problems 20
1. It has to be able to adapt fast to changes in malware landscape. By fast adap-
ting I mean that it needs to observe the similarities between new malicious les
and old malicious les of the same family, but it also needs to be able to make
clear distinctions between new malicious les that appear on a daily basis and a
vast set of benign les. Training speed is also important when it comes to fast
adaptability. If the algorithm takes too much to train, when it will be nished
the practical results will be far less than expected since the malware landscape
has already changed. So, to achieve best results, the algorithm must be able
to be trained in a couple of days. Even though the today computer power can
compensate for an increase in time complexity of the algorithm we must take into
account that training can be done on millions of records.
2. Testing speed is also very important. If we want the method to be included in
an antivirus product, it must provide as little overhead as possible. An antivirus
product usually works by intercepting access to all les in the system. So, when
a process is being executed, all les that are being accessed by the process will
be rst scanned. If no malicious les is detected, the process is allowed to run.
If a process takes too much to execute because of the antivirus product, many
users will prefer to disable the protection mechanism. For this reason, testing if
a le is malicious or not must be determined almost instantaneous.
3. Memory restrictions: Even though for the training phase we can use computers
with many gigabytes of RAM we have to take into account that the average
amount of RAM that many user computers have is below 4 gigabytes. This RAM
is shared by the operating system, the antivirus product and all other processes
run by the user. This is why the model produced by the classication algorithm
must t in a very small amount of memory.
4. No false positives: Perhaps, this is one of the most important, but also the most
dicult restriction. Detecting a benign le such as an important document or
a critical system le as malware, can make the antivirus product delete it. This
Chapter 1. Malware Detection Problems 21
can lead to information loss, system crash or corruption of important les. This
is way, a critical aspect of the machine learning algorithm is not to provide any
false positives. Dealing with large datasets means that a small percentage in false
positives rate translates into many benign les being detected. Even though the
machine learning algorithm will not achieve its best performance when used with
such a restriction, in practice it will not be used alone. Other mechanism, such
as signatures or hand written heuristics will compensate for the loss of detection.
By obeying this restriction I will try to present in this paper several methods that can
be used to help on classifying malicious and benign les.
1.7 Security and privacy concerns
If the rst types of viruses were trying to delete les, corrupt data or encrypt documents
in order to request a ransom, nowadays the landscape is very dierent. There is no point
in making a system unusable since this will not provide any benet to the attacker.
Instead, the victim can be considered as an asset for the attacker where dierent types
of techniques can be used to make him produce money. For example, an infected
system can be used as part of a botnet. The botnet can be used as a denial of service
(DOS) weapon on a certain website. So, the botnet can be rented to persons who have
interests in shutting down websites. Another example is to use the computing power
of the infected machine for virtual currency mining (bitcoin mining).
Besides of these techniques that are clearly malicious there are also other means of
making money through the use of some gray methods. For example, a le can reside
on the victims computer that monitors all the web searches performed by using a web
browser as well as the products that a user has interest in. Based on these requests, a
behavior model can be created for the victim and targeted ads can be displayed that
can convince the user to buy several things. If this is malicious or not is debatable. A
user might consider this useful since it helped him nd products that he is interested
Chapter 1. Malware Detection Problems 22
in. On the other hand, sending data that indicated behavior information might be
considered intrusive and could also be classied as a privacy threat.
The example provided above is one where just the browsing behavior is monitored.
However, there are cases when more private data is submitted to a server. For example,
the e-mail addresses of the users friends or the locations where the user is using his
device for chatting (mobile phone, tablet or laptop). There are applications that request
access the social page of the user in order to post ads in his behalf.
Usually, these kind of applications promise better web search results, access to great
oers and discounts or even prizes in lotteries. Some users will consider detecting these
applications as denying them the ability to earn dierent prizes while other will consider
missing these as a huge gap in the detection mechanism of the antivirus product.
Needless to say, money can be made by using private data of the user and these mech-
anisms are being more and more used as computers are becoming safer.
1.7.1 Privacy concerns regarding mobile devices
During the last decade the demand for smart phones has rapidly increased and got to
a point where it is very dicult nowadays to nd a mobile phone that is not able to
handle external applications. There are almost 1.91 billion smart phones worldwide
and this is fore-casted to increase to 2.56 billion by 2018 [6]. Of course, the power
of a smartphone actually stands in the application it uses. Even though there are
dierent operating systems that run on mobile devices, two of them stand out: iOS
from Apple and Android from Google, totalizing a 93.8 % in market share (14.8% for
iOS and 81.5% for Android). A statistic taken from [6] regarding market share of each
operating can be viewed in Figure 1.2
These two operating systems have also launched the idea of an application store: a
place where programmers can submit their applications and where interested users
can nd them and buy them. Even though there are many application stores for the
Chapter 1. Malware Detection Problems 23
Figure 1.2: Growth of market share of Mobile Phones OS
Android OS, the main and the largest one is Google Play. The equivalent application
store for the iOS is called AppStore. Starting with their launch in 2008 these stores
have constantly increased overpassing 1 million of applications in 2014. A graphical
display, provided by appgures [7], regarding their growth can be view in Figure 1.3
Since these mobile operating systems were created when malware attacks were frequent,
special attention was payed to security. Each application runs in a separate container,
thus it can not modify les belonging to other applications. Also, each time the appli-
cation wants to access sensitive information (contacts, sms, photos, etc.) or want to
perform possible dangerous actions (calls, send messages or even push notications) it
requests a permission that a user must allow. The model diers slightly for iOS where
the permission is requested every time is performed. In Android the permissions are
located in a manifest le and are granted the rst time the application runs.
Another important improvement regarding security is brought by a validation of each
Chapter 1. Malware Detection Problems 24
Figure 1.3: Growth of sapplication store
application that is intended to get into the market. If Apple declares that each applica-
tion is validated manually (which involves a lot of human power), Google revealed that
the Android applications are validated using a semi-automated system called Bouncer.
Given these new security measures, the attackers needed to nd new ways through
which they can monetize their victims. For these reasons, they turned to issues that
are more related to privacy. This is not something new when it comes to malware found
in the wild, since adware programs have already been doing this for more than 10 years.
As in adware, it is very dicult to tell if a permission requested by an application is
intended for the normal use of an application or is just a part of a malicious behavior
meant to steal sensitive information. For example, it is quite normal for an application
that backups the contact list into the cloud to access the contacts. However, if a game
accesses the contacts lists, it should be marked at least as suspicious. And this scenario
is available for almost every permission.
Another problem that aects users of mobile applications is the vulnerabilities that
are embedded inside benign apps. Not all developers think of security and when we
are talking of more than 350.000 developers that write code for the app stores it is
not dicult to nd an application that has security
aws. Take for example the same
Chapter 1. Malware Detection Problems 25
application discussed above that requires access to the contact list in order to make
a backup in the cloud. Even if it is benign, it passes all security checks of both
applications markets and given its scope, it is normal to access the contact list, if it
sends the data to the cloud without using encryption or without properly implementing
security protocols (like https), an attacker may use a man in the middle attack to
extract the submitted data. The situations becomes worse if the problem does not
reside in the code written by the developer of the app but in a library that many
applications use.
So, in order to protect users from privacy risks, some methods are needed that can
test an application from dierent perspectives and determine the risks associated with
using it.
Chapter 2
State Of The Art. Related Work
2.1 Malware Detection using Machine Learning
A great amount of work regarding malware detection using machine learning was carried
out by Dragos Gavrilut. Even though many of the papers published rely on variations
of the percepton algorithm, he achieved good results in both accuracy and proactivity.
Since the aim of his research is to provide a reliable method that can be used by
antivirus companies to detect malware, several restrictions are maintained in all of the
papers:
A very low number of false positives: A zero amount of false positives is preferable,
but since this can drastically reduce the detection rate, the 0.01% threshold is
acceptable. False positives are very dangerous for antivirus products. Classifying
an important document or system le as malware can lead to important data
loss or system failure. A higher number of false negatives is preferable than an
increase of false positives.
The algorithms needs to be very fast on the testing side: This restriction is
required in order not to aect the users experience. On most of the antivirus
products, every time a program is run, it is rst blocked, then scanned and if
it is classied as benign it will be allowed to run. The time in which the le is
blocked must be reduced to a minimum.
26
Chapter 2. State Of The Art. Related Work 27
2.1.1 Malware detection using machine learning
In the paper [8] the authors describe the One Side Class Percepton algorithm and apply
dierent variations in order to see how it behaves in classifying dierent datasets. In
order to see the practical applications of the algorithm, 3 datasets are created. Every
record from the dataset is dened by 308 features. If there are multiple records that
contain the same combination of features, only one of them is kept:
1. training dataset: consists of 7822 unique malware records (obtained from 27475
les) and 415 clean records (obtained from 273133 les). This is used to train
the perceptron
2. test dataset: consists of 506 unique malware records(obtained from 11605 les)
and 130 clean records (obtained from 6522 les). This dataset is used in order to
evaluate the perceptron
3. Scale-up dataset: consists of 12817 records(obtained from approx 3 million sam-
ples) and 16437 records (obtained from approx 180.000 samples). This dataset is
further splited in pieces, each one consisting of 10%, 20% and so on in order to
evaluate how the algorithm behaves when the dataset is being increased
The one-side class perceptron(OSCP) is introduced as a variation of the original per-
ceptron algorithm [9]. It works in two steps: rst, just like in the original algorithm,
a hyperplane is obtained that best classies the two classes of les. Following this,
another training is performed, this time adjusting the hyperplane such that no element
from one class is misclassied. Using the one-side class perceptron, several variations
are tested in order to see which one brings the best results:
1. Mapped one-side class perceptron: this is the one-side class perceptron, but every
record in the datasets is modied to contain a combination of every two of its
features. The combination is done using the and operator
2. Mapped one-side class perceptron with feature selection: Only the best 30% of
all features resulted after mapping were used in order to reduce the training time
Chapter 2. State Of The Art. Related Work 28
3. The dual form of the OSCP with the training entry mapped into a larger space
via several kernel functions(Polynomial Kernel function with the degree 2, 3 or 4
and Radial-Base Kernel Function )
After cross-validation, the algorithm that provided the best detection rate (sensitivity)
was a cascade of one-side perceptron algorithm with the Polynomial kernel function
with the degree 4. However, due to a higher number of false positives, the best algorithm
to be used in an antivirus engine is the one-side perceptron algorithm and its mapped
version.
Other contributions of this paper are related to optimization that can be added to
the OSCP algorithm in order to make it run faster. This is especially needed when
working with very large datasets, such as the scale-up dataset described above. The rst
optimization relies on the observation that in order to correctly classify all the samples
belonging to a certain class, the hyperplane must be adjusted only in a direction that
does not aect the correct classication of the elements belonging to the target class.
Since these elements do not modify the hyperplane (the hyperplane gets modied only
when items are incorrectly classied) it is not necessary to evaluate them on every
iterations. A second optimization relies on a programming tweak. Since
oat operations
are costly in terms of CPU clocks, it is more advantageous to use operations that work
with numbers represented as integers (the decimals are considered to be the last digits).
By using these tricks, the algorithm can be implemented to be faster by 2.5 times.
2.1.2 Practical Optimizations for Perceptron Algorithms in Large Malware
Datasets
Other optimizations for the One Side Class Algorithms are presented in [10]. One of the
rst modications of the algorithm relates to the way it is implemented. Since features
are boolean values, it is better to change the multiply instruction with a comparison.
This is due to the fact that the multiply instructions take more clock cycles than a
comparison one. Using this trick the algorithm was optimized by 96%.
Chapter 2. State Of The Art. Related Work 29
Another idea was to use a cache for every record. After the training part has started
to adjust the hyperplane, each record that will be evaluated will be selected according
to an algorithm. The aim is to test as often as possible misclassied samples and skip
the correctly classied ones (since the perceptron only adjusts the hyperplane when it
nds misclassied samples). The algorithm is described bellow:
If a record is incorrectly classied it will be selected for verication in the following
iterations. The number of iterations can be adjusted dynamically
If a record is correctly classied the sample will not be veried for the next itera-
tions. The number of iterations in which the record is skipped can be increased
from one step to the other. For example, if the rst time the sample was tested
only after 5 iterations, the next time it should be tested should increase, i.e to
10. In this way, records that are correctly classied are tested less and less.
2.1.3 Building a practical and reliable classier for malware detection
If on previous papers the main target was to reduce the number of false positives, in
paper [11] the authors try to increase the detection rate. This is done with several
methods:
1. Mapping features into a higher space: this involves selecting a limited set of
features according to a score (F-score is chosen by the author) and combining
one with each other using a binary operator. The authors have found that the
andoperator is well suited for this operation. The reason why only a part of the
features are chosen is due to the large amount of new features that are created
using this operation. If all features were selected, the algorithm will need n(n+
1)=2sizeof (feature ) memory space which, for a large dataset, is too much(where
nis the number of original features)
Another variation of this method is to rst map all features, and then use a score
function to sort the features.
Chapter 2. State Of The Art. Related Work 30
2. Using an Ensemble system: A hyperplane obtained by an one side class algorithm
will rst separate the samples in two sets: the rst set contains benign and
malicious samples while the second contains only malicious samples. Since the
records from the second set are well classied, they will be removed from the set.
The next step is to use another one side class perceptron algorithm to classify
the remaining records. Depending on the dataset size and available time, this
method can be used in several steps, each time obtaining a better accuracy
3. Vote System: Each model will vote for the elements it classies. It will give a
vote for both positive and negative records. Since OSCP algorithm is very good
at classifying a class of elements it will give a higher vote for records belonging to
that class a smaller one for the others. The nal result will be a weighted average
of the votes given by every model
4. Hybrid method: This method comes from how malware analysis is done in prac-
tice. It is easier to create a model that correctly classies a malware family than
to classify all the malware samples. For this reason, the malware set will be rst
spitted using a binary decision tree according to a certain feature. The feature
can be chosen with dierent algorithms: either F score or median close. For each
obtained cluster, an OSCP algorithm will be used
Using the methods described above, the authors have managed to increase the detection
rate from 53.4% to 69.55% if the mapped algorithm was used, to 54.71% if an ensamble
of ve iterations was used, 61.63% if the voting system was used on clusters of les
created using the Manhattan distance and 99.53% if the hybrid method was used with
the MedianClose function for choosing the feature and a decision tree of 8 levels (256
models resulted).
Chapter 2. State Of The Art. Related Work 31
2.2 Feature Extraction using Genetic Programming
2.2.1 A Hybrid Approach to Feature Selection and Generation Using an
Evolutionary Algorithm
The authors of [12] have managed to improve the accuracy of detecting certain sub-
stances by combining feature extraction through genetic algorithms and feature selec-
tion.
For feature extraction using a genetic algorithm, the authors dene a chromosome as
a structure of two elements: the rst one is a characteristic vector of features that
determine which ones are activated and which are not and the second element consists
of mathematical combinations of features activated in the previous element.
The crossover operator is adapted for the dened chromosome. Since the chromosome
is made of two parts, the cutting point has to be selected such that the arithmetic
operations that are going to be passed to the other chromosome can be computed with
the activated features. The mutation operator is very simple: it randomly activates or
deactivates features. Feature generation is performed by randomly selecting a mathe-
matical function, ltering from the activated features just the ones compatible with the
operator, applying the operator over the features and appending the resulted feature
to the chromosome. The selection function used for evaluation is a support vector
machine.
The algorithm is tested on real data used in Chromatography. The dataset consists of
200 records of 5000 features each. Feature selection algorithms like Forward selection,
Backward elimination and genetic algorithms are applied in order to obtain features
well suited for the learning problem.
The genetic algorithm is congured to have a population size of 30 elements and is run
for 50 generation.
Chapter 2. State Of The Art. Related Work 32
The results clearly show that the features obtained through the use of genetic algorithm
improve the accuracy of the results in comparison to just using support vector machine
with the standard features. The average absolute error for the target values has dropped
from 2.2 to 0.07 for a set of data and from 22 to 0.1 for another set.
2.2.2 Automatic Feature Generation for Machine Learning Based Opti-
mizing Compilation
An interesting paper regarding feature extraction is written by authors of [13]. Their
aim is to automatically produce features that can help a machine learning algorithm
determine the loop unrolling factor used in the compilation process. Doing so will
increase the speed of the compiled program.
Their method is composed of two components:
1. feature search component: this is used to explore the features space represented
by the intermediate representation and provide the feature values to the machine
learning algorithm.
2. machine learning tool: computes how good the feature provided by the feature
search component is along with the already found features(base feature set). The
feature will be continually improved until this is no longer possible. At that point
it will be added to the base feature set.
The feature space is described using a special grammar derived from the intermediate
representation (parse trees). Using these grammar the feature search component is
able to produce new features by expanding non terminals.
The cross-over operation is implemented by randomly selecting non-terminals of the
same type and subtrees are exchanged, thus creating new parse trees.
Mutation is performed by replacing non-terminals with a random expansion of the
same terminal.
Chapter 2. State Of The Art. Related Work 33
The algorithm was tested using 57 benchmarks from public sources. The genetic algo-
rithm was congured to use a population of 100 individuals and run for 200 generations
or until no improvement is observed in the last 15 generation. The machine learning
tool used is a c4.5 decision tree. Results are validated using ten cross validation.
The authors state the their method achieves a 75% prediction of the correct unrolling
factor while the public gcc compiler only achieves 3%. A support vector machine using
hand written features achieves only 59%.
2.2.3 Evolving Novel Image Features Using Genetic Programming-based
Image Transforms
Feature creation using genetic programming was also used by authors in [14] in order to
detect muscular dystrophy from analyzing gray-scale images. The features are created
using pixels found in a certain square of the images through the use of Cartesian genetic
programming.
Each new feature is a graph that consists of mathematical operations between previous
features or inputs found in the image. The following mathematical functions are used:
max,min,xy,x+y,x=y,x y,abs(x),xx,xy. The cross-over is implemented
using single-point swap by treating the graph as a list of nodes. Mutation operates on
the connectivity or the node types.
For validating the resulted features, machine learning algorithms are used (decision
trees, ridor rules or k-nn) all run through weka [15] framework.
To test their method the authors used a public dataset consisting of 386 healthy and
220 sick cells images. From these, 186 of the healthy and 200 of the sick images were
used for training while the rest were used for validation.
Chapter 2. State Of The Art. Related Work 34
The authors compared the results with the ones obtained by using 16 hand made
features (thresholded area, variance, mean radius, radius variance, perimeter, com-
pactness, entropy, fractal dimension). The genetic algorithm was congured to run for
50 generation with a population size of 200.
Through out their tests, using the 1-NN algorithm and the decision tree algorithm, the
authors show that by using genetically created features an improvement of 38% can be
achieved.
2.2.4 Feature Extraction Using Genetic Algorithms
The authors of [16] and [17] use a weight vector to perform feature extraction or feature
selection. If the weight vector has binary values, then it is used for feature selection.
For feature extraction, the values are usually in the interval 0.0 and 10.0. The idea is
to transform the feature space such that more of the records become linear separable.
The tness function used is a custom one that is based on the k-nn results. It takes
into account the number of elements left after classication was done using the weight-
modied k-nn and the cardinality of the near-neighbors minority set.
For the elements that are not linear separable a second approach is proposed: a new
feature can be combined with other through mathematical functions. Since nding
these features can only be done through the use of suboptimal heuristics, the authors
use genetic algorithms.
Their approach is rst tested on a generated dataset according to a template. The
algorithm is rst trained using 240 examples belonging to 8 classes and tested on
1200 new examples generated according to the same template. The results show that
the weighted GA-knn achieved a 4.13% error rate while the standard normalized knn
achieved 35.93%. The authors state that the error rate can be reduced even further to
3% if feature selection is performed before classication by using weights of 0 and 1.
Chapter 2. State Of The Art. Related Work 35
The method is also tested on a real world data set with application in biology, more
exactly classication of soil samples. The dataset consists of 232 samples that must
be grouped in 3 classes. Each record has 96 binary features. Using the k-nn algorithm
for classication and using weighted features, the error drops from 8.3% to 2% in
comparison of using the original feature set.
By performing feature selection on the same dataset through the use of binary values
in the weight vector, the number of features reduced from 96 to 31. This alone helped
in dropping the error rate to 0.83%. By performing feature extraction using values in
the interval 0.0 and 10.0 for the weight vector the error rate further reduced to 0.24%
In the last part of the paper the authors test if the correct features can be found
through the use of genetic algorithms. For this reason, an articial dataset is created,
some features are generated and also two hidden features that can be obtained through
the original ones are hidden in order to see if the algorithm manages to nd them. To
generate new feature, the multiplication operator is used. The authors show that the
genetic algorithm succeeds in nding the new features and using them achieves a 0%
error rate.
2.2.5 Feature Generation Using Genetic Programming With Application
to Fault Classication
Feature creation using genetic algorithms was also studied by authors in [18]. In their
approach, which has some common parts to the one presented later in this paper, the
authors try to improve detection of faults bearings before the bearing gets broken.
To do so, the authors represent the chromosome as a mathematical expression, more
exactly as a binary tree where each node is a mathematic operation and each leaf is an
original feature. The mathematical functions used are:+ ; ;;=,square, sqrt, sin, cos,
asin, acos, tan, tanh, reciprocal, log, abs, negator.
Chapter 2. State Of The Art. Related Work 36
The genetic operators are simple to implement. The crossover operation works by
choosing two chromosomes and exchanging subtrees while mutation works by randomly
choosing a node from a chromosome, deleting the subtree that correspond to the chosen
node and randomly generating another one. The tness function used is a modied
version of the Fisher criterion, which looks very similar to the F-score (Equation 3.1)
used in this paper.
The resulted features are tested using two kinds of classiers: an articial neural net-
work (a multi layer perceptron with one hidden layer) and a support vector machine.
The authors created two types of features:
The rst feature is generated by running the genetic algorithm or 1000 generation,
with the maximum tree depth of 5 and the population size of 10. The entire
process took 5 minutes.
For the second types of features, 10000 generation are used with the maximum
tree depth of 10 and a population size of 14. The process took about an hour.
By comparing the two types of features, the authors conclude that the second type
of features succeed on separating dierent classes of records where the rst type of
features fails.
By comparing the results obtained using the standard features with the ones obtained
through the use of genetically evolved features using the two classiers, the authors
state that they have obtained a 96.5% accuracy for the articial neural network in
comparison with 90.8% and 97.1% for the support vector machine in comparison to
91.2%.
Chapter 2. State Of The Art. Related Work 37
2.2.6 The Evolutionary Pre-Processor Automatic Feature Extraction for
Supervised Classication using Genetic Programming
A very interesting paper is provided in [19]. Here, the authors dene a framework, called
Eprep (Evolutionary Pre-Processor) where each chromosome is actually a solution to
the problem.
The system has many degrees of freedom, allowing it to automatically select the number
of features to extract, uses many non-linear functions in order to nd correlations
between features and determines which classier is best for the given problem.
Each chromosome is composed of two parts:
The rst one is a parse tree for functional expressions. There are many functions
that are available for use (addition, multiplication, ratio, polynomial expansion,
natural logarithm, equals to, less than or equal to, logical or, logical not, discreet
Fourier transform, cosine, arc-cosine, vector lter) but the authors recommend to
limit to just a few of them in order to reduce the size of the searching space. The
terminals of the parsing trees are either input variables, user-supplied constants
or random generated constants.
The second part is the classier. There are three available classiers that the
chromosome can choose from: Generalized Linear Machine (GLIM), K-nearest
neighbors (with k=3) and Maximum Likelihood Classier.
The initial population is created by randomly selecting functions and terminals with a
uniform distribution. The tness function is represented by the percent of misclassied
samples from a validation set.
The authors used 9 datasets to test the algorithms, gathered from dierent external
sources with application in medicine, nance, image analysis and engineering design.
The results obtained by a classier with the best features was compared with the results
Chapter 2. State Of The Art. Related Work 38
of the same classier on the raw data. For two of the problems (image processing),
the algorithm managed to reduce the error rate by half. For the others, the dierence
is not signicant. Even so, for almost all of the data sets, Eprep has only selected a
subset of the original features, thus reducing the dimensionality.
2.3 Improving Malware Detection through Feature Extraction
2.3.1 Data Mining Methods for Detection of New Malicious Executables
The authors in [20] were one of the rst who used machine learning in order to achieve
a better detection rate and to be more proactive. Even though their methods were
tested on a relative small dataset compared to the number of malware present today in
the wild (3265 malicious and 1001 benign les) it still provides important information
on how to construct a classier for malware detection. An important thing to note
is that their dataset was constructed based on the malware landscape present at the
time of writing the paper (5% trojans and 95 % viruses). Nowadays, the percents are
opposite.
The features used in the learning algorithms are constructed automatically, by statical
processing each le. The information extracted is dierent and depends on the learning
method used. The following methods are used to construct features:
1. By using a third party library, called libBFD from the GNU's Bin-Utils, it extract
the dll-names, the functions imported from each dll and how many times each
function is referenced in a binary. However, this is only suitable for PE-le, so
additional methods are needed
2. By using the GNU strings program, the authors extracts each string that appears
in a le. This method can also be used for non-PE le. However, strings are not
that representative for executable les.
3. The nal kind of information extracted is represented by sequences of bytes.
There is no explanation how these sequences are chosen.
Chapter 2. State Of The Art. Related Work 39
Dierent learning algorithms were used for dierent kind of features. For features based
on dlls and functions, the Ripper algorithm was used ( an inductive rule-based learner
that generates boolean rules). For string and byte sequences, a Naive Bayes algorithm
was used. Since the features extracted didn't t all in memory for all the data, the
dataset was spitted in several chunks. On each of these chunks, a Naive Bayes was
trained. The nal results were a combination of the results provided by each of the
Naive Bayes classiers.
The best results are obtained by running the Naive Bayes classier on the features
based on strings extracted from binaries (97% detection rate and 3.8% false positives).
Multiple Naive Bayes, that were trained using string and chunks of binary data, have
also obtained very good results (97% detection rate and 6% false positives). The other
methods didn't achieve the hoped results. For example, the Ripper algorithm only
achieved 71.05% detection rate. However, all of the machine learning methods outper-
formed the signature method when it comes to proactivity. The signature method only
achieved 33.75% detection rate.
2.3.2 The proactivity of perceptron derived algorithms in malware detec-
tion
An interesting aspect is approached by the authors in [21]. The aim of the paper is to
test the proactivity of dierent linear classiers over a period of 14 weeks. In order to
do so, the authors have collected malicious les for a period of one year during 2010
and 2011. Half of the data was used for training (data gathered in the rst 44 weeks)
and the other half was used for testing (data gathered in the last 12 weeks).
For each le approximately 2500 features were obtained. The features are both static
(obtained by parsing the PE header) and dynamic (obtained by the use of an emulator).
The next step was to reduce the number of features. For this reason, the F score was
computed for every feature and only the rst 550 were kept. The nal dataset consisted
of 975882 les from which 110144 are malicious.
Chapter 2. State Of The Art. Related Work 40
The authors tested many classication algorithms like
1. the standard batch perceptron
2. the margin perceptron: a batch perceptron that limits the misclassied records
to a maximum number ( three dierent versions were used for the limit: 500,
1000 and 1500)
3. One side perceptron: similar to the margin perceptron but this time the limit for
the false positives is set to 0
4. Mapped version of the one side perceptron
5. Mapped versions of the margin perceptron, with the margin set to 1500, 2000,
and 3500
6. Linear voting One Side: a voting ensemble made up of 16 one side perceptrons
7. Linear voting M-1600: a voting ensemble made up of 16 margin perceptrons with
the margin set to 1500
8. Cascade ensemble: a collection of the following algorithms: One-Side percep-
tron with mapped features, a clustering algorithm based on centroids and nally
another one-side perceptron that classies centroids
In order to test their proactivity, a new formula is introduced that decreases the score
if more false positives are present.
ProactivityScore =Tp
Tp+Fn+1
kFpwhere k should be less than 0:5% (2.1)
Among all algorithms, the best detection rate was obtained by the cascade ensemble
followed by the mapped version of the one side algorithm. However, the cascade en-
semble has also obtained the biggest false positive rate. The algorithm that obtained
the least false positives was the one side perceptron.
Regarding proactivity, all of the algorithms proved to be stable for the rst 4 weeks.
After this period, the proactivity dropped signicantly.
Chapter 2. State Of The Art. Related Work 41
2.3.3 Learning to Detect and Classify Malicious Executables in the Wild
Another approach for detecting malware using machine learning is described in [22]
Using a database that consists of 1971 clean executables and 1651 malicious les, the
authors have tested several machine learning algorithms in order to see how well they
perform at malware detection. In order to extract features, the binary le was converted
to hexadecimal form and n-grams of 4 bytes were extracted, thus resulting a total of
255.904.403 n-grams. Each of the n-grams was treated as a boolean feature (present or
not). Since the number of features was too big to work with, the next step was to select
only the most relevant. This was done by computing the Information Gain (Equation
2.2) for each n-gram and keeping only the most 500 relevant n-grams. The number of
n-grams to keep and the size of the n-gram was chosen empirical based on the results
obtained by training dierent classiers.
IG(j) =X
vj2f0;1gX
CiP(vj;Ci)logP(vj;Ci)
P(vj)P(Ci)(2.2)
whereCiis the class of the le, vjis the value of the j thfeature,P(vj;Ci) is the
proportion that the j thfeature has the value vjin the class Ci,P(vj) is the proportion
of elements where the j thn-gram takes the value vj, andP(Ci) is the proportion of
elements that have the class Ci.
The training was done using algorithms found in Weka framework [15]. A total of
four dierent algorithms were used: k-nearest neighbors, Naive Bayes, Support vector
machines and decision trees. Each of these were tested in the standard form as well in
boosted form (using AdaBoost algorithm). Testing was performed using 10-fold cross
validation.
The best results are obtained using boosted binary trees (99% detection rate and 0.05%
false positives). Among the non boosted classiers, the best results are obtained by the
Chapter 2. State Of The Art. Related Work 42
support vector machines which achieved a 99% detection rate and about 0.07% false
positive rate.
A nal contribution of this paper is the test of proactivity of the selected classiers.
The authors gathered another set of 291 samples and tested the models previously
obtained. This time, the boosted decision tree achieved a 0.94% detection rate with
0.01% false positive rate. The support vector machine was the second best model and
the rst among the non-boosted algorithms and achieved 0.82% detection rate with
0.01% false positive rate.
Even though the results are very good, it is worth noticing an observation stated by
the authors: none of the benign les were obfuscated or packed while 25% of the
malicious les had this property. This means that it is possible that the machine
learning algorithms just learned to predict this property.
2.3.4 Comparison of feature selection and classication algorithms in iden-
tifying malicious executables
An important contribution to malware detection using machine learning is provided by
the authors in paper [23].
The authors focus on analyzing how well dierent machine learning algorithms perform
at malware classication and what are the best feature selection algorithms that can
be used in order to obtain the best results.
The database used in the paper consists of 1074 benign les and 3680 malicious les
that were collected from an intrusion detection system. The les are transformed in
hexadecimal format and n-grams are extracted that will be used to form the features.
The authors tested dierent n-gram sizes from 1 to 10 bytes. For each le, a feature
vector is created, where each feature represents the frequency of the n-gram in the le.
There are 7 feature selection methods tested:
Chapter 2. State Of The Art. Related Work 43
1. Maximal dierence: the new set of features consists of the rst 50% of features
that are found in clean les and the rst 50% of features that are found in
malicious les.
2. Maximal normalized dierence: similar to the maximl dierence but this time
the frequencies are normalized based on their standard deviation.
3. Maximal ratio: instead of using the dierence, a ratio is used or the dierence of
the logarithms.
4. Maximal weighted ratio: similar to maximal ratio, but the result is multiplied
by the feature frequency. This way, the frequencies that appear more often will
have a greater importance.
5. Maximal Kullback-Leibler distance: it is based on the Kullback-Leibler diver-
gence that measures the dierence between two probability distributions. Since
this is not a symmetric measurement (so it can not be used as a distance) the
formula is applied twice (once to measure the information lost from one distri-
bution to the other and the second time in the opposite way) and the results are
summed.
6. Bi-normal separation: it is based on the bi-normal separation algorithm [24] in
order to determine the features that contribute the most to the true positive rate
and the least to the false positive rate. In order to obtain this information, the
Naive Bayes algorithm is used and the features analyzed the n-grams of 1 byte
in length (so a total of 256 features)
7. Forward stepwise selection and backward stepwise selection: features are added
one at a time in order to see if it improves the model. The backward method starts
with all features and then features are eliminated in order to see if it improves
the model. Both of these methods are very time consuming.
The classiers tested are:
1. Naive Bayes classier
Chapter 2. State Of The Art. Related Work 44
2. Product classier: multiplies the frequencies of the features that appear more in
clean les than malicious ones and divide the result by the product of frequencies
of the features that appear more in malicious ones
3. Entropy classier:the same idea but from the entropy of the features that are
most found in benign les is substracted the entropy of the features that are
mostly found in malicious les
4. Support vector machine classier
Each classication algorithm was tested using the features extracted through the seven
methods described above. For each combination, dierent versions were tested by
varying the number of features. From all the combinations, except support vector
machines, the product classier obtained the best performance when it is used with
features extracted by the maximal weighted ratio or forward selection. The best feature
selection method (the one that on average gives the best results for all classiers) is
the forward selection method, but it is also the most time consuming and classier
dependent.
The authors also tested features that were created based on n-grams of 1, 2, 5 and 10
bytes in length. Only the rst 1000 features with the highest frequency were kept. The
accuracy improvement was marginal when the feature size increased. Related to the
number of features used, the best results were obtained by using at lest 20 features,
but no more than 40.
The support vector machines were tested with 10, 20, 40 and 100 variables and 2
dierence selection methods: maximal distance and bi-normal separation. The best
detection rate using features selected according to the bi-normal separation method,
was 94% and a false positive rate of 1% when the number of features was xed at 40.
For the maximal distance, the detection rate improved when the number of features
was increased (from 40 to 100) from 85% to 95%. According to the authors the support
vector machines provided the best results among all the algorithms.
Chapter 2. State Of The Art. Related Work 45
2.3.5 Data mining methods for malware detection
The author of this phd paper [25] analyzed dierent methods for malware detection by
using op-codes found in binaries as features. The dataset is made up of 820 les, half
malicious and half benign. There are no encrypted or obfuscated les in the dataset
since the author considered these are not relevant for the classiers.
Each le is represented as a feature vector where each feature is a frequency of an
opcode. There were a total of 1510421 features extracted, from which 62608 were
unique. Using a simple method for feature selection, called unary variable removal, the
author removed all the features that did not have a frequency that was at least 10% of
the number of les. In addition to this, using a chi-square test of independence that
tries to nd a relation between the feature and the target class, a signicant amount
of features were removed that did not show to be related to the target class. The total
number of features that were left is 633.
Using 70% of the data and the features extracted, three types of algorithms were
trained:
1. logistic regression
2. a multi layer perceptron with 10 hidden units
3. a decision tree with max depth of 30 where the entropy was used as a split criteria
The best detection rate was obtained by the decision tree, 98.4% with a rate of 4.9%
in false positives. Similar results were obtained by the multi layer perceptron (97.6%),
but with slighter less false positives. (4.1%). One should notice that since the testing
dataset is very small (about 250 les), the dierence in the percentage from the two
algorithms is equivalent to just one le.
Chapter 2. State Of The Art. Related Work 46
2.3.6 Detecting unknown malicious code by applying classication tech-
niques on opcodes representation
Instead of using n-grams that are based only on bytes, without any meaning, the
authors in [26] used n-grams that are based on opcodes present in the binary le. The
paper is a research that tries to answer questions regarding the best machine learning
algorithms for malware detection, best n-grams length and the number of features that
should be used. The dataset started with 7688 malicious les collected from virus
heaven website and 22735 benign les. Since some of them failed to be disassembled,
the nal database ended up at 5677 malicious les and 20416 benign les.
From each of these les, n-grams of dierent sizes were extracted (1,2,3,4,5,6), each of
them made up of several opcodes. Depending on the n-gram size, the number of unique
features resulted were 515, 39011,443730, 1769641, 5033722 and 11948491. To nd out
which is the right number of features that an algorithm should use, dierent sets were
made consisting of the top 50,100, 200 and 300 features.
To order the features, a feature selection algorithm needed to be used. For this, the
authors tested three selection methods (document frequency, gain ratio, Fisher score).
Document frequency consists in computing a ratio between the frequency of the n-gram
in a le and the frequency of the n-gram that appears the most in that le.
Using the resulted data sets, 8 dierent machine learning algorithms were tested using
the weka framework: support vector machines, logistic regression, random forest, arti-
cial neural networks, decision trees, naive bayes, boosted decision trees, and boosted
naive bayes.
The authors performed many tests in order to nd the best combination of n-gram
size, number of features, feature selection algorithm and machine learning algorithm.
In total, 1152 combinations were tested using 5-fold cross validation.
According to the results, the best size of n-grams is 2, even if this ended up in providing
a lower detection rate then using n-grams of 3. The accuracy was better due to its
Chapter 2. State Of The Art. Related Work 47
low rate of false positives. The authors also concluded that document frequency is the
best feature selection algorithm in test of accuracy. Fisher score came very close to
document frequency, contributing to a higher true positive rate, but using it yield more
false positives.
Among the classier, the worst were naive bayes and its boosted version. The other
classiers performed similar, on all lengths of the n-grams. On comparing the rst 6
classiers using n-grams of length 2 and a dataset of 300 features selected by document
frequency, the random forest and boosted decision tree yielded the best accuracy and
lowest false positive rate.
On the second part of the paper, the authors tested if a dataset that consists of n-grams
of dierent size improves the accuracy. For this reason, two additional datasets were
created:the rst one was created by selecting 300 features for each of the six n-gram
sizes and the second was created by selecting the best 1800 features from all the 6
datasets put together. Neither of these proved to be better than the dataset made up
of only n-grams of size 2.
Another interesting problem that the authors tested is the imbalance problem. Ac-
cording to their hypothesis, training a classier on a data set that has the percentage
of malicious programs dierent from what it is found in the wild is detrimental. For
this reason, additional datasets were created where the percentage of malware varied
from 5% to 50% in steps of 5. All of the algorithms, except the neural network, had a
better false positive rate when tested on a dataset containing a percentage of malware
that is higher than 15% and lower to 30%. This percentage is very similar to what is
found in the wild.
Finally, in order to highlight the importance of training the classiers regularly, the
dataset was split in respect to the years it was collected (from 2000 to 2007). Each
classier was trained on a specic dataset and tested on the one obtained from the
year after. All of the algorithms yield a better accuracy when they were retrained.
Chapter 2. State Of The Art. Related Work 48
An interesting observation is that at the nal year (2007) the accuracy has dropped
signicantly, meaning there were many changes in the malware landscape.
2.3.7 A feature extraction method and recognition algorithm for detection
unknown worm and variations based on static features
A dierent approach is used by the authors from [27]. The proposed solution for
malware detection (although the authors focus mainly on worms) consists in two steps:
The rst one is to process the features in order to give them a score and the second
one is to use all the scores previously computed to give a verdict (malicious or benign).
The feature scoring algorithm rst needs to normalize each frequency of the feature by
dividing it to the maximum number of occupancies of that feature in all the training
set. After that, a mean square deviation is computed for every feature
D(Ai) =q
(E(Ai) E(Av
i))2+ (E(Ai) E(AN
i))2 (2.3)
whereE(Ai) is the total mean frequency of the feature Ai(from all training set), E(Av
i)
is the average frequency of the feature Aiin the benign les and E(AN
i) is the average
frequency of the feature Aiin the malicious les.
All of the features will be sorted based on this value and only a limited number of them
will be used for the classication algorithm.
The malware classication part works by computing for every feature how probable
it is to be malicious or benign. It does this by comparing its frequency to its mean
frequency across the malicious and benign datasets. Depending on which of the classes
the feature is closer, its mean square deviation will be added to one of the two control
variables (one for benign and one for malicious). Based on these control variables and
a counter that is incremented each time a suspected feature is detected, the algorithm
will take a decision.
Chapter 2. State Of The Art. Related Work 49
The decision also takes in consideration a threshold, called warning level, that can be
adjusted in order to make the algorithm more aggressive or less prone to false positives.
To test the algorithm, a very small dataset was constructed of 120 benign les and 100
malicious les from which 22 static and dynamic features were extracted. With the
threshold set to high, the true positive rate obtained is: 95.71% while the false positive
rate is 2.86%.
2.3.8 An interpretable string based malware detection system using svm
ensemble with bagging
A dierent kind of approach was used by the authors of [28]. Instead of using n-grams
of bytes or opcodes, as other authors did, this time the strings were used as features. To
collect these, the system needs a parser that will unpack the executables, parse the PE
header, extract function names and every string that is made of printable characters.
The features obtained are used to train a support vector machine with bagging. The
bagging methods mean that many bootstrap samples are extracted from the training
set and on each of these a svm is trained. The nal result is given by the vote of the
majority of the classiers.
The authors also propose to use the system for classifying the samples in more than
just 2 classes. This can be achieved by training k(k 1)=2 classiers where each one
of them can detect to classes.
For the testing phase, the authors have gathered 39838 executables of which 8320 are
benign and 31518 are malicious. From this set, 9838 were selected and used for training.
The class of the les are: 2320 benign, 1936 backdoors, 1769 spyware, 1911 trojans and
1902 worms. On the rst step, 13448 interpretable strings were extracted, but since
most of them had a very low frequency, only the rst 3000 were kept according to the
max-relevance algorithm.
Chapter 2. State Of The Art. Related Work 50
Based on the data, a voting system made of 100 support vector machines were trained
using bootstrapping samples of 1230 executables. The resulted classier was tested
against dierent measurements.
First, the detection rate on 1000 executables not used in training was compared with 4
well known antivirus products. The proposed system detected all of the 1000 malicious
les, while none of the antiviruses achieved this.
To test how well the system would behave on malicious les that are not related to the
malware families found in the training set, the authors gathered 500 new samples and
again compared the results with 4 well known antivirus products. The detection rate
achieved was 93.8% while the best antivirus product obtained only 70%.
For the false alarm test, the authors have selected 2000 samples (1000 benign les
that are mostly Chinese software and 1000 malicious) and tested the product against
4 known anti-virus products. The false positive rate was 1.6%. It worths noticing
that the anti-viruses tested are not focusing on the Chinese market so their algorithms
might not have been trained on Chinese software.
A nal test provided by the authors reveal that their proposed method behaves much
better than a single support vector machines, naive Bayes with bagging and decision
trees with bagging.
2.3.9 Ecient virus detection using dynamic instruction sequences
The main contribution of the paper [29] to malware detection is the way features
are chosen. Instead of using n-grams of bytes, string or opcodes, the authors use
relations between assembly instructions. For this reason the process of selecting features
starts with monitoring dynamic execution of the le. This can be done in ecient
manner by using virtual machines or hooking mechanism. The authors report that the
method they propose can extract almost 6000 instructions per second. The assembly
instructions extracted will be only those executed and most of the time, just a fraction
Chapter 2. State Of The Art. Related Work 51
from all of the instructions in the le. This is what the authors call logic assembly.
Doing so, also allows the system to capture self modifying code.
Based on the logic assembly, an abstract assembly is extracted that consists of a col-
lection of associations between instructions. There are three types of associations
extracted, each one being more restrictive than the previous. The type1 relation does
not contain instruction order, the type2 contains instruction order but instructions do
not need to be consecutive (garbage instructions can be found between) and type3 is
very similar to opcode n-grams in which instruction order is considered and instructions
must be consecutive.
The models trained use abstract assembly (relations) as features. In order to select
only features that are important to classiers (those that are frequent and are repre-
sentative to a class) the Apriori algorithm was used. The minimum support parameter
of the algorithm (the minimum percentage of basic blocks that contains the instruction
sequence) was set to dierent values in the training phase. To detect if an instruction
block is representative the frequency of that block is measured among the benign les
and the malicious les. If the ratio of frequencies found in those classes is big, then the
feature will be chosen to be used in clasication.
The classication algorithms used are support vector machines and decision tree (where
entropy is used as a split factor). The dataset consists of 267 malware les and 368
benign les. All tests were carried using 5-fold cross validation.
The parameters that were tested are:
number of features to be used. The algorithms were tested using 10,20,50,100,
200 and 300 features.
the support level for the Apriori algorithm. The tested values were 0.003, 0.005
and 0.01.
number of instructions to capture. Tested values were: 1000,2000,4000, 6000 and
8000.
Chapter 2. State Of The Art. Related Work 52
what type of relations to be used as features. All 3 types of relations were used.
From the two models tested, the SVM obtained on average a slighter better accuracy
than the binary decision tree (0.911 accuracy). The increase in number of features
translated into an increase in accuracy. However, after 200 features, the increase was
very insignicant. The best number of instructions to be captured seemed to be 2000.
Any value greater than that translated into a slighter decrease of accuracy. This may
be due to that fact that any increase of number of instructions captured translates to
more noise. For the same reason, the support level value of 0.01 provided the best
results. Regarding to the feature types, the type2 features (where instructions need
to be ordered, but not consecutive) proved to be the best suited, across almost all
congurations. The reason for this might be that it is restrictive enough to avoid being
confused as noise but it still can bypass obfuscation techniques. This is in contrast to
the n-gram that other researchers are using.
2.3.10 A static malware detection system using data mining methods
In order to detect malicious les, the authors in [30] extracted features that are mainly
found in the PE header. For this reason, they designed an application, named pe-miner
that parses the pe-header and extract dierent values. The le is also parsed in order
to extract imported dlls, function names and how frequent these are called.
Every feature that was gathered is sorted using the information gain and only a fraction
of them are kept. To further transform and reduce the feature set, the authors apply
a Principal Component Analysis algorithm.
The authors test three classiers (a support vector machine, a naive Bayes and binary
decision tree) using the Weka framework. The dataset used for training and testing is
made up of 247348 windows executable les from which 236756 malicious and 10592
benign. For the executables that come in compressed or protected form an unpacked
version is obtained. The tests are performed using 10-fold cross validation. There were
Chapter 2. State Of The Art. Related Work 53
several tests carried out with each kind of features (pe-header information, dlls and api
calls) and of course, tests that involved using all features. The best detection rate was
obtained by the binary decision tree (99.5% with 2.7% false positive rate) when using
features extracted from the pe header. Using all features proved to decrease the number
of false positives for the support vector machine but was detrimental for the decision
trees as detection rate dropped by 2 percents. The same behavior was observed for the
naive Bayes algorithms.
In order to see if the principal component analysis brings an advantage, the authors
tested the algorithms with the features before and after PCA was applied. Each time,
the true positives rate increased, but on average it was just 0.7%
2.3.11 Large-scale malware classication using random projections and
neural networks
A very interesting paper that deals with large datasets is described in [31]. The aim of
the authors is to be able to classify les as malicious or benign. A second purpose is
to classify many of the malicious les to dierent families. A set of 2.6 million les of
which 1.843.359 malicious and 817.485 benign was gathered. Among these, there are
also samples that belong to 134 important malware families.
The feature extraction method uses a lightweight virtual machine that collects dierent
information like:
null terminated patterns (that usually translates to strings)
n-grams of size 3 of consecutive api calls. The authors call this trigrams
calls to api functions and arguments
Extracting these data from the dataset yield 50 million features, a number too big to be
processed by current hardware. The next step is to apply a feature selection algorithm.
Chapter 2. State Of The Art. Related Work 54
The feature selection works in two steps. The rst step was to compute the mutual
information [32] of each sample in respect to all 136 classes. Based on the results,
3000 features were selected that were representative for every malware family class and
another 120000 features for each of the two generic classes (malicious and benign). By
eliminating the common ones, it resulted approximately 179 thousand features. Again
this number was too big to work with.
To further reduce this number, the authors used a technique called Random Projection
[33] that transforms a vector space into a smaller one while keeping the sparsity. After
this step, the number of features resulted was 4000.
For classication the authors proposed an ensemble of machine learning algorithms.
First, in order to nd non-linear correlation between features, a feed forward neural
network is used. Secondly, to classify the samples a linear regression algorithm that
uses a softmax function is applied.
The test was carried on using an additional set of 1.1 million les. 10% of the original
2.6 million records were used for validating the neural network. The system was trained
for 40 epochs with dierent variations on the number of layers (1, 2 or 3). The learning
rate was set to 0.03 and to decrease training time, a momentum was used with the
value of 0.9. The numbers of hidden units in the hidden layers were respectively 1536,
2048 and 1024.
Dierent versions of the ensemble are trained, one in which no neural network or
random projection is used, one in which random projection is used and nally, one in
which all of the algorithms are used but with dierent number of layers for the neural
network.
Surprisingly, the best results were obtained by the combination of the random projec-
tion and a neural network of just 1 layer: a false positive rate of 2.41% and a test error
of 12.37%; the two-class error rate (if malware les that were labeled by the family
they belong to were labeled just as malicious) was 0.5%.
Chapter 2. State Of The Art. Related Work 55
The authors claim that they have succeeded to reduce the two-class error rate to 0.42%
by using a voted ensemble of 9 neural networks.
Dierent congurations of the random projection algorithm were also tested. The re-
sults show that if 8000 output features were used instead of 4000, the logistic regression
alone would have behaved better. However, the system as whole behaves slightly worse
yielding a higher two-class error rate.
2.4 Detection of non-executable les
Most of the papers that deal with malware classication problems take into considera-
tion only executable les. Malicious JavaScript code detection, which can be found in
both webpages and pdf les has started to be addressed only recently.
Since these kind of malware usually come in obfuscated forms, most of the papers
address the problem by using an emulator or something similar that can run the code
and deobfuscate it. However, this tends to limit the system to only small datasets, as
running the le can take more time.
An important contribution to this eld are the papers [34] and [35] which explain how
the wepawet service [36], a well known system that is used to identify if a le is malicious
or not, works. The authors deal with very large datasets and provide the architecture
of a system that successfully identies malware using both static and dynamic features.
2.4.1 Caeine Monkey: Automated Collection, Detection and Analysis of
Malicious JavaScript
A good classication of JavaScript obfuscation techniques can be found in [37]. The
autors present them in the order of their complexity, starting with the most simple
ones to detect. The techniques discussed, in the order of appearance, are:
Chapter 2. State Of The Art. Related Work 56
1. white space randomization
2. addition of random comments
3. string obfuscations using simple encryption techniques, like a xor operator or
Caesar cipher or even more complex ones like RC4, string splitting, writing strings
in hexadecimal
4. variable name randomization and function pointer reassignment(a new name is
assigned for a standard function, ie unescape)
5. integer obfuscation: writing numbers as a result of mathematical operations in
order to hide certain numbers with a special meaning
6. block randomization: some statements are changed with equivalent ones (ie. a
for loop is written as a while loop)
By analyzing both malicious and benign scripts the authors have observed that the ratio
between object instantiations, use of eval, string instantiations and documents writes
is very dierent for malicious les vs benign documents. The benign scripts use more
document.write() functions while malicious are relying more on string instantiations,
mainly as a result of string obfuscations. The authors also provide a tool, called Caeine
Monkey that hook certain functions in order to help analyst deobfuscate malicious
scripts.
2.4.2 The Rise of PDF Malware
An analysis of the use of PDF les as attack vectors is discussed by the authors of [38]
They make the observation that even though the number of vulnerabilities found in the
popular pdf reader, Adobe Reader, is close to the number reported for the Microsoft
Oce products, it seems that pdf documents are more frequently used in attacks. If
at the beginning of 2007, the attacks that used documents were 100% based on oce
documents, at the beginning of 2010 these accounted for only 5% while the rest are
mostly attributed to pdf documents. The main reason for this change is that pdf format
is simpler and using it can target more products (browsers).
Chapter 2. State Of The Art. Related Work 57
The authors point that the main distribution of these attacks can be split in 3 cate-
gories:
1. mass-mailing: the pdf documents come as an attachment to an e-mail. Usually
the attacker also uses social engineering to increase the chances of the document
to be opened
2. drive-by downloads: the pdf is silently downloaded when the user browses a web
page. This is usually accomplished by the use of exploit kits
3. targeted attacks: more than one vulnerabilities or zero day vulnerabilities are
used in order to infected important targets
Even though most of the vulnerabilities used rely on JavaScript (so deactivating the
use of JavaScript will make the browser more secure), there are also some exploits that
incorporate in the document other elements, like a
ash le. This way, by the use of
action script in the
ash le, the attackers have almost the same
exibility as using
JavaScript.
Another information brought by the paper is the analysis of dierent obfuscation tech-
niques used in pdf malware. Besides the use of obfuscation methods employed by the
JavaScript there are additional techniques that make use of the architecture of the pdf
malware:
the pdf format is not that restrictive, so the malware authors can insert special
buers at dierent location in the le without making it invalid
the payload can be split in dierent objects that are scattered inside the document
and when needed are referenced by name
the pdf format supports compression and encoding and the authors make use of
these features by creating streams that use many nested encoders
malware authors encrypt the payload using AES or RC4 algorithms in order to
restrict access of anti-viruses to the payload
Chapter 2. State Of The Art. Related Work 58
2.4.3 Detection and Analysis of Drive-by Download Attacks and Malicious
JavaScript Code
In the paper [34] the authors propose a system, called JSsand that will extract dy-
namic features from JavaScript les in order to learn a normal behavior and detect any
anomalies that are specic to malware JavasScript code. To do so the authors observe
the steps that a drive-by attack usually takes and create features that can detect it.
The features are split in 4 categories:
1. Redirection and cloaking: many exploits redirect victims to dierent domains in
order to hide their traces. The following features are extracted in order to detect
this behavior: number of redirections, targets of redirections (domain names),
browser based dierences (the suspect domain is visited twice with dierent
browsers to see if it changes its behavior)
2. De-obfuscation: most of the attacks come in an obfuscated form. To detect this,
the following features are extracted: number of calls to functions that process
strings (ie. fromCharCode), number of calls to functions used for dynamic code
execution (ie. eval), length of the code that is dynamically executed
3. Environment preparation: most of the exploits rely on memory corruption. To
be successful, in most of the case large chunks of memory need to be allocated.
To detect this behavior the following features are extracted: number of bytes al-
located, number of strings that are longer than 256 bytes and contain unprintable
characters
4. Exploitation: many exploits target vulnerabilities in Activex objects or other
browser plugins. To monitor this behavior, the following features are extracted:
number of instantiated components (plugins or Activex), value of attributes and
arguments in method calls (large strings or large integers), sequence of method
calls by instantiated objects
By analyzing hundreds of exploits the authors concluded that these features are robust
and every working exploit uses at least two of them.
Chapter 2. State Of The Art. Related Work 59
To build a model that will be used for anomaly detection, the authors use several
algorithms from the libAnomaly library:
1. Token nder: tests if the values of the features are outside of an enumeration.
This is especially useful for detecting abnormal values for the arguments
2. String length and occurance counting: this helps identify the usual length of
string or events that happen much often than normally (for example: memory
allocation)
3. Character distributions: identies the normal distribution of characters in strings.
Since most string in JavaScript are human readable, any encoded string will be
detected
4. Type learner: this model was developed by the authors. It tries to identify the
normal type of an argument. For example, this will detect a script that will use
a very large integer instead of a boolean value
All of these models are summed and the score is compared to a threshold.
To test the system, the authors gathered 140000 les that were grouped in dierent
sets:
1. a dataset consisting of 11215 benign les gathered from top ranked web pages
2. four known malware datasets
urls found in spam e-mail that deliver drive-by download
websites that were infected through sql injection
malware found on forums or specialized exploit websites
malware submitted by users
3. two uncategorized datasets
pages obtained by crawling
unrecognized data submitted by users
Chapter 2. State Of The Art. Related Work 60
To test the false positive rate, the authors have split the good dataset in three, trained
the model on one portion, used the second to adjust the threshold and the third was
used for testing. No false positives were detected. However, when the authors tested
the model on data obtained from crawling, a false positive rate of 0.0135 was obtained.
The tests on the known malware dataset reported a detection rate of 98.9%.
2.4.4 Prophiler: a Fast Filter for the Large-Scale Detection of Malicious
Web Pages
One of the disadvantages of the system previously discussed is that it needs to run the
le in order to tell if its malicious or not. However, when dealing with large datasets,
this tends to take allot of time. For this reason, the authors have proposed another
classier that will work as a prelter for the previous one. The system, called Proler
and presented in [35] works by statically analyzing samples and classifying them using
dierent machine learning algorithms.
In order to do so, authors have extracted 77 dierent static features that target the
sample le from dierent angles
HTML features: mainly statistical information that can be extracted from the
html code. This totals 19 features and among them there are the number of
scripts tag, the number of hidden elements of objects that occupy a very small
surface from the page, number of elements that contain suspicious content (more
than 128 in length with less than 5% white spaces) or number of characters.
JavaScript features: This totals 25 features that contain statistical information
found in code written in JavaScript language. Among the features there are
the number of occurrences for dierent keywords that are found in many scipt
based malware (eval, settimeout, setinterval), entropy for strings and scripts, or
whitespace percentage
URL and host features: there are 33 features extracted from the url or from the
information associated with it. For example, there are syntactical features, like
Chapter 2. State Of The Art. Related Work 61
the domain length, the tld or presence of suspicious pattern or port numbers and
features exctracted from the who-is information and dns information, like the ip,
the ttl for the A record, its geoip information.
To test how the system behaves, the authors gathered a database that consists of 787
malicious web pages that deliver drive by download attacks and 51171 benign web
pages that were obtained by crawling the rst two level of the 100 top alexa web sites.
The machine learning algorithms tested are Naive Bayes, random forest, decision trees
and logistic regression. The algorithms were rst separately tested with each set of
features previously created in order to see which one brings the best results. Testing
was done using 10 fold cross validation. For the html features, the best model was
the binary decision tree, which yield a true positive rate of 63.% and a false positive
rate of 0.8%, while on the JavaScript features, the binary decision tree obtained a true
positive rate of 78.6% and a 0.4% false positives. The random forest algorithm also had
god results, obtaining a bigger detection rate (81.9%) but also a slightly bigger false
positive rate (0.5%). When the algorithms were used with url features (which were
the most) the best algorithm, the binary decision tree, obtained a surprisingly 90.4%
detection rate and only 0.6% false positive rate.
By analyzing the results for each case, the authors concluded that the best features
are the shellcode presence probability, the presence of decoding routine , the maximum
string length and entropy of the string. For the JavaScript features, the best seemed
to be the number of url, number of iframes and the whitespace percentage. Finally, for
the URL, the tld of the domain, the absence of a subdomain in the url or the ttl of the
host's dns record provided the best separation.
Using all the features, the system was also tested on a validation dataset that contained
153115 pages of which 139321 benign and 13794 malicious. This obtained a false
positive rate of 10.4% and a false negative rate of 0.54%. The results show that these
results are primarily due to html features and then javascript features.
Chapter 2. State Of The Art. Related Work 62
Finally, a large-scale evaluation test is carried on. The system was tested over a 60 days
period on a dataset consisting of 18.939.908 web pages which were unlabeled. Proler
labeled 14.3% of them as malicious from which 13.7% were false positives. For the false
negative rate, 1% (162315) of the dataset was analyzed using dynamic analysis and the
results were compared to those of Proler. In this case, only 3 samples were missed
(0.0018%)
2.4.5 Spyproxy: Execution-based Detection of Malicious Web Content
An idea proposed in [39] is to use a proxy that lters all trac through a virtual
machine. The virtual machine will run any code and if it is malicious will stop it before
it reaches the user. In order to minimize the performance impact the user sees, the
page is rst static analyzed and only if it is marked as suspicious it will be run.
The static analysis just searches any content that might change dynamically such as
ActiveX. Non html objects are also treated as dangerous since they might contain
vulnerabilities. By analyzing 17 hours trace of web requests the authors concluded
that 54.8% of the html page requested do not contain dynamic content. The case in
which there is an html exploits can only be detected if static processing is disabled.
The virtual machine that processes webpages monitors the eects that visiting a ma-
licious webpage can produce. This can be a crash of the browser, a modication in
registry keys or a creation of a le on disk on an unusual location.
To test how well it behaves in a real-life situation, the authors have collected 100
malicious web pages from 45 distinct websites. Spyproxy achieved 100% detection
rate. No information is given about the false positives.
The system purpose is to protect the user and can not be used in large datasets clas-
sication since it must run many of the samples. Also, Spyproxy can be bypassed by
using sleeps methods or other non-deterministic methods.
Chapter 2. State Of The Art. Related Work 63
2.4.6 Zozzle: Low-overhead Mostly Static JavaScript Malware Detection
The authors of [40] propose a malware detection system that will be run in a browser
and classify all rendered html content. To do so, the system will use the detour binary
instrumentation library to intercept the "Compile" function from jscript.dll. This is
used by the Internet Explorer browser each time a new script is executed by the usage
of an iframe or script tag or even by the usage of the JavaScript keyword eval.
In order to do classication the authors propose a method to automatically extract
features. Each feature consists of two components:
a context in which it appeared: in a loop, a try/catch block. Dierent level of the
context are used. For
at features, no context is used, while for n-level features,
the whole abstract tree is used.
a string from the abstract syntax tree (only expressions and variable declarations
were selected)
The features are boolean (present or absent from the le).
Since parsing the javascript le can yield many features, only the ones that are repre-
sentative to their class are selected using the 2algorithm.
For classication, the authors use naive Bayes algorithm. Even though this produces an
equation which evaluation is linear in respect to the features used, the authors suggest
that string comparison should be done using an automata in order to reduce processing
time.
In order to nd out how the algorithm behaves with dierent settings the authors test
it using a dataset made up of 919 malicious (deobfuscated scripts) and 7976 benign
les gathered from top 50 Alexa websites. Three sets of automatic features are created,
one containing 948
at features, one containing 1589 1-level features, and another one
made up of 2187 n-level features. A set of 89 manual features were also created in order
to highlight the benet of using automatic features.
Chapter 2. State Of The Art. Related Work 64
In terms of accuracy, the system that used automatic features scored better than when
manual features were used on all train cases. An observation made by the authors
is that the manual features seem to be dependent of the size of the context, while
the automatic ones are not. The false positive rate for the system when trained with
automatic features is also signicantly lower (between 0.01 and 0.02) than when trained
with manual features (1.5% and 4.5%). Practically, when using automatic features,
the false positives are almost inexistent. However, the results are quite opposite when
comparing the false negative rate. (between 5.84% and 11.08% for the automatic
features versus 1.26% and 5.14% for the manual features). Among the three context
sizes the level 1 features (in which only level of the syntax tree is kept) obtained the
best results.
The authors have also tested how much of the training set is necessary to be used in
order to obtain a good classier. For this reason they have split the dataset in chunks
of dierent sizes from 1% to 25% and carried dierent tests. The results show that a
training dataset of 10% of the original one is sucient.
The same question was asked about the number of features to be used. The authors
performed dierent tests using various number of features from 50 to 1300. Using less
than 200 features lead to poor results, while using more than 600 does not improve the
accuracy.
To highlight the little overhead the system produces, the processing time of the scripts
were also measured. Depending on the number of features, the system classies a
sample from 1.3 milliseconds (for 30 features) to about 7 milliseconds for 1300 features.
2.4.7 Pattern Mining for Future Attacks
Using data mining techniques the authors in [41] try to build a statistical method for
identifying signicant groups of tokens that come together in malicious and benign
scripts.
Chapter 2. State Of The Art. Related Work 65
Each le that is labeled is parsed and a set of transactions is extracted. Depending on
the class of the le these can be positive or negative. A transaction is a collection of
items over a nite alphabet, in this situation, keywords that are relevant to JavaScript
classication.
The authors dene an itemset as a collection of maximum n features, where n is usually
set to 5 or 6. Each itemset has a frequency among the dataset. The rst step is to select
only frequent items. This is done by using the apriori algorithms with a threshold of
S
2nwhereSis the number of transactions of the class of the transaction. The next step
is to reject all itemsets that have a frequency lower than the noise in the dataset. To
do so, a probability distribution is assigned to each itemset, where the probability is
computed by dividing the frequency by the number of total transactions. The authors
consider an itemset to be frequent if its frequency is greater than K=2nwhere K is
the number of transactions. In order to make the threshold more
exible the authors
also provided a more complex formula that can be tunned using an input value. The
probability and the transaction form an item generation model (IGM).
After the transaction dataset is ltered, the next step is to create a mixture of IGM
for each positive and negative transaction set, where for each IGM is assigned a coef-
cient. The learning of these coecients is done using the Expectation Maximization
algorithm.
For classifying a transaction, the likelihood of the transaction is computed under the
two obtained models, and the transaction is assigned a class accordingly.
To test the models, a training set of 13500 pieces of JavaScript snipets is collected of
which 12704 are benign and 839 are malicious. An additional dataset, for testing, is
created that consists of 26900 pieces of JavaScript code of which 453 are malicious and
26479 are benign. Among this dataset there are also two unknown zero-day vulnerabil-
ities. The set of used features consists of 34 important strings that are usually found
in malware and benign JavaScript code. Using these features, the average transaction
size for the training dataset is 11, while for the testing dataset is 6. After computing
Chapter 2. State Of The Art. Related Work 66
the two models, the framework achieved a 90% precision rate and a recall of 70%. It
has also succeeded in identifying the two unknown vulnerabilities.
2.4.8 Fast Plagiarism Detection System
Clustering les based on similar code was also addressed by other authors in order to
detect plagiarism. In order to speed-up the plagiarism detection, the authors of [42]
use a dierent approach in order to reduce the time needed to compare a le with each
other.
The rst step is to tokenize every le and transform it to a string of tokens. The
next one is to build a sux array that contains all tokens found in the collection of
documents. Since the array is sorted, a string search in this structures requires a binary
search algorithm.
For every le that is given as input, the algorithm will extract non-overlapping strings
that contain dierent number of tokens, and search it in the sux array in order to
obtain les that contain the substring. A similarity score is computed that is the
number of matched tokens divided by the total number of tokens in the le. If on a
search a le matches multiple times the same substring, the size of the matched string
is extended and only the largest one will be kept. On the other hand, if multiple strings
in the input les are matched by the same string in the test le, only the largest string
will be kept.
Running the algorithm for each le will obtain sets of similar les. Since dierent score
can be obtained when comparing a le " a" to a le "b" than when comparing the same
two les but in reverse order, the similarity between two les is given by the average
score of their two obtained similarity scores.
In order to evaluate the system, the authors has gathered 220 Java programs from
undergraduate students. By running dierent well known plagiarism detection systems
(MOSS, JPlag and Sherlock) the authors considered that a le is a plagiarism if at
Chapter 2. State Of The Art. Related Work 67
least two out three of mark it. From 155 les marked, the system successfully
agged
115 of them, a result that is better than any of the three when run alone.
2.4.9 JPlag: Finding plagiarisms among a set of programs.
Plagiarism detection for programming code was also addressed by authors in [43]. Their
system, called JPlag, nds pair of les and as well as zones of code that are considered
to be similar.
The rst step of the algorithm is to tokenize every le. The next step is to try to
cover the set of tokens from one le with as many tokens as possible from the other.
The percentage achieved is the similarity value. This process is carried on using an
algorithm called greedy string tiling that works in two phases:
1. Find the longest match of tokens, by iterating every token from the rst le and
compare it to every token in the second le. If a match is found, this is extended
in order to get a match as big as possible
2. All matches from phase 1 are marked (tokens from the source that matched will
not be used again).
The algorithm will run until no match will be found. In order to decrease false positives,
a minimum match length is used.
The algorithm has a complexity O(n3) which make it unusable for large datasets.
In order to reduce the complexity, the authors propose the use of the Rabin-Karp
algorithm. To implement it, every list of consecutive tokens which size is greater than
the minimum match length will be hashed by a function. The results are saved in a
hash tables specic to every le. Comparing 2 le can be reduced (if collisions are
ignored) to searching using hash tables.
The minimum match length suggested by the authors is 9, while a cuto criteria
(amount of similarity between two les) of 50% yields very good results.
Chapter 3
Improving Malware Detection By Feature Extraction
Since malware number began to increase exponentially in 2006, it has been clear that
in order to achieve a good detection rate something had to change in the way mal-
ware detection was being done. Traditional detection methods based on hashes were
being considered obsolete even before that so a new method was need. A temporary
solution has been writing of heuristics that could detect clusters of samples, but even
this has proved to be inecient when the number of unique malware samples that ap-
peared daily started to be more than a team of malware researchers could cope with.
Nowadays, when there are over 5000000 samples that appear each day, the only viable
solution in achieving a good detection rate is to use a combination between automatic
methods, like machine learning techniques and traditional ones, like signatures based
on hashes and human written heuristics.
However, if a machine learning algorithm is going to be used in antivirus product ,
several restrictions must be considered:
The algorithm must provide a high detection rate. The main benet of these
algorithms is to decrease the number of samples a malware researcher has to an-
alyze. Also, some of the machine learning algorithms may need a lot of resources
at training time so the results should justify the investment.
68
Chapter 3. Improving Malware Detection By Feature Extraction 69
It should also be able to able to work with big data. This means that the algo-
rithm should be able to process millions of records, should nish in a reasonable
amount of time and of course provide good results. This restrictions become more
evident when we take into consideration that for training a machine learning al-
gorithm both malware and benign samples are needed and there are more than
300 million malware samples in the wild. Coping with big data is also the main
reason why machine learning is needed for malware detection.
It should provide a very low number of false positives. In the antivirus industry,
false-positives are a very dangerous thing. It is usually preferable to miss a sample
on an infected computer than to detect a clean le. This is because an antivirus
will delete a detected le and removing an important document or even a system
le could bring more harm. Since restricting the false positive rate to 0 will
greatly impact detection we will work with a false positive rate of 0.01
Malware detection through the use of machine learning is not new; it has been used
by other researchers and is currently used in many av-products. However, the results
obtained can be better either in terms of lower false positives or by increasing the
detection rate. The way I choose to solve this problem is not by modifying the new al-
gorithms, but rather by extracting new features from existing ones so that the accuracy
of the original algorithm will increase. I choose this method because I consider there is
an already good algorithm, named One Side Class Perceptron (OSCP) [8] which was
tested for malware detection with low false positive rate and achieved good results.
The One Side Class Perceptron (OCSP1) works by creating a hyper-plane that is able
to separate the two classes of samples (malware and benign) such that no samples from
one class will be detected. It is very well suited for our problem and by the way it is
constructed it guarantees that, on training phase, there will be no false positives.
In order to make the algorithm easy to read, we will consider the following notations
1.Records!R
Chapter 3. Improving Malware Detection By Feature Extraction 70
2.Features!F
3.Label!L
4.maxIterations maxI
5.SingleSetClass S all records belonging to a specific class
Algorithm 1 One Side Class perceptron – OSC1
1:functionOSCPTrain (R)
2:w [w1;w2;:::;w m];wi 0;i 1;m;
3:b 0; a learning rate
4:iteration 0
5:S SSjRj
i=1Ri;where R i:L=benign files
6: repeat
7:TrainFunction (R;w;b )
8: repeat
9: TrainFunction (S;w;b )
10: errors 0
11: fori= 1!jSjdo
12: if(not IsCorrectClassified (Si;w;b))then
13: errors errors + 1
14: end if
15: end for
16: until (errors = 0 )
17:iteration iteration + 1
18: until (iterationmaxI )or(every record is correctly classified )
19:end function
Algorithm 2 Train Function
1:functionTrain (Records; w; b )
2: fori= 0!jRecordsjdo
3: ifRecords i:L((Pm
j=1Records i:Fjwj) +b)0then
4: forj= 1!mdo
5: wj wj+Records i:LRecords i:Fj
6: end for
7: b b+Records i:L
8: end if
9: end for
10:end function
A representation of how the OSCP algorithm works can be seen in 3.1. The rst image
describes a classic training for the perceptron algorithm in which a hyperplane (in this
Chapter 3. Improving Malware Detection By Feature Extraction 71
(a)
(b)
Figure 3.1: OSC training process. Training step (a). Class minimize step (b).
case, a simple line) that best separated the red squares from the green circles is found.
In the second image, the hyperplane is adjusted so that no elements belonging to a
class (in this case, green circles) will be misclassied.
The algorithm ends when all the elements have been correctly classied, or when a
number of iterations has been reached. Even though this algorithm solves the problem
of false positives it also dramatically reduces the detection rate. Nevertheless, the idea
of deterministically controlling the number of false positives given the features of each
sample is a very important component in creating a machine learning algorithm that
could be used in practice for malware detection.
So given the OSCP algorithm (Algorithm 1), the next important question is how could
the detection be improved. In order to answer this, I have concentrated on creating
new features from existing ones such that a better linear separability could be obtained.
3.0.1 Feature Creation Using Mapping
One simple way in which this can be obtained is called "mapping" and works by using
the result of the combination of every two features as a new feature. The operator
used in this combination can be any mathematical one but the AND operator seems to
yield the best results. The mapping method provides, theoretically, a better detection
rate because the records that were inseparable using the initial features will have a
Chapter 3. Improving Malware Detection By Feature Extraction 72
higher probability to become separable since the number of features has increased ex-
ponentially. However this method has also its drawbacks. A square increase of memory
needed for the training side can make the algorithm impractical if the number of initial
features is too big. Even though this can be solved by generating the feature only
when it is needed, it still doesn't solve the most important one: the time complexity.
A square increase of features also means that the algorithm will take exponentially
more time to nish. The best way to solve these problems is to start with a number of
features that allows the algorithm to nish in a reasonable time frame. If the amount of
initial features is greater than that number, then only the most important ones should
be selected using dierent feature selection algorithms like F-score[44].
The F-score is a statistical method that allows to see how discriminative a feature
can be. It is based on the ratio of two other scores, the F1-score which measures
the variability between group of les and the F2-score which also measures variability
but this time inside the group. The F-score for a feature, given the sets of clean and
malicious les is described in Equation 3.1.
S1=F(i) =(x(+)
i xi)2+ (x( )
i xi)2
1
n+ 1Pn+
k=1(x(+)
k;i x(+)
i)2+1
n 1Pn
k=1(x( )
k;i x( )
i)2(3.1)
wherexi,xi(+)andxi( )are the average of the i-th features of the whole, benign and
malicious data sets, respectively; x(+)
k;iis the i-th feature of the k-th benign le while
x( )
k;iis the i-th feature of the k-th malicious one.
The algorithm for feature mapping is presented in Algorithm 3.
1. sort the entire set of existing feature based on a specic score (usually F-score)
2. choose the rst \n"
3. combine every selected feature with the other ones using the AND operator
Chapter 3. Improving Malware Detection By Feature Extraction 73
Algorithm 3 MapAllFeatures algorithm
1:functionMapAllFeatures (R)
2:newFeatures
3: fori= 1!jR:Fjdo
4: forj= 1!jR:Fjdo
5: newFeatures newFeatures[R:F i
R:F j
6: end for
7: end for
8:end function where
can be any operator. using AND we obtained the best
results.
3.1 Feature Creation Using Genetic Programming
The dataset on which the algorithms will be tested consists of 358.144 records of which
42.940 are malicious and 315.204 are benign. Even though the dataset is not as large
as it would be when training for real life malware classication, I consider it big enough
to allow us to make an idea of how the algorithms will perform. Also, the proportion
of malicious and benign les is close to the one found in the wild. The les from
the dataset are all PE-les, mainly executable and dll les, which were gathered over
a period of two months (april and may 2014). The les were statically scanned and
dierent features were extracted like the size of the le, the number of imports, exports,
the geometry of the le and so on. Dynamically features were also extracted and include
things like: if the le connects to the Internet, if it writes les in certain directories, if
it modies the registry and so on. In total, 300 features were gathered from each le.
I prefer working with boolean attributes since they occupy less memory (just 1 bit)
are more resistant to malware changes and every continuous value can be transformed
into a boolean one. Even though a xed number could better identify a sample, due
to many changes that happen in a malware family, I consider boolean features are
more powerful. For example, it helps more to know that a le size is between 1 and 2
kilobytes than knowing it is 1327 bytes.
All the les that are included in this dataset were ltered. For example, le infectors,
which is a kind of malware that adds their code to other executable les, often don't
Chapter 3. Improving Malware Detection By Feature Extraction 74
produce dierent features than the original le. For this reason, records with the same
features but of dierent classes are highly likely to appear. Since this is detrimental for
how the OSCP algorithm works it is better to eliminate those from the start. Another
type of les that were removed is adware. These resemble very close to benign les and
there are also people who don't consider them malicious. A similar case is for keygens,
patcher or cracks.
Mapping seems to provide good results. It increases the detection rate while the number
of false positive stays low enough to permit the algorithm to be used in practical
applications. However, a big disadvantage of this method is the number of features
which it needs to create. If we store each feature in one bit then each record will
take approximately 38 bytes of memory. Since the dataset consists of 358144 elements,
then it will all take approximately 14 MBytes. Considering the size of the database, 14
MBytes of data is quite small. However, when the mapping procedure is performed, the
number of features increases ton(n+1)
2, where n is the number of original features. (the
number of all features should be divided by 2 only if we use a mathematical operator
for mapping that is commutative. This way, we will eliminate duplicate features that
result after mapping). In our case, the number of attributes resulted for each record
after mapping is 45150 which occupies almost 5.5 Kbytes of data. For 358144 records,
the dataset will occupy 1927 Mbytes. Even though 2 gigabytes of data is not a problem
for todays hardware, we must take into consideration that a dataset of 358144 records is
small compared to the one used in training by machine learning algorithms for malware
detection where the number of elements in the database can reach millions. If more
features are used the problem is even worse, since each new feature used increases the
size of a record exponentially when mapping is performed.(For example, if an additional
feature was used the size of a mapped record would increase with 37 bytes).
In the following subsection I will provide a method, based on genetic programming, that
will allow us to increase the detection rate of the algorithm. Even though achieving the
same results as those obtained through mapping but using just the square root number
of features is very hard (probably impossible for many datasets) we can still increase
Chapter 3. Improving Malware Detection By Feature Extraction 75
the detection of the OSCP algorithm. The idea is to create features from existing ones
using genetic programming. To validate the results, they will be compared with those
obtained through mapping, but also a more realistic measurement will be used. The
best features obtained through the mapping process will be selected and an OSCP
algorithm will be trained. If the results obtained using genetically created features are
better than the ones using the best features of mapping, when the number of attributes
is the same, we can say that our method has managed to create better features. One
should notice that it is not always possible to select the best features that mapping
will create since those are obtained by applying a feature selection algorithm over the
whole new dataset, which as previously explained, could be too big to t in memory.
In order to nd the best features, rst we need to see which boolean operator brings
the best result when mapping is used. For this reason, the and,orandxoroperators
were tested. The results were computed by rst creating the new dataset using each
boolean operator at a time and performing a 3-fold cross validation.
Algorithm Detection False Positives Accuracy
OSCP 58.12% 17 94.96 %
OSCP-MAP-AND 85.35% 44 98.21 %
OSCP-MAP-OR 75.20% 20 97.01 %
OSCP-MAP-XOR 85.38% 32 98.22 %
Table 3.1: The results of the 3-fold cross-validation for OSCP and OSCP-MAP
algorithms
It can be easily observed in Table 3.1 that by using mapping, with any boolean operator,
the detection rate is improved with at least 17%. This mainly happens since the increase
in the number of features improves the linear separability of the data. However, the
mapping performed with the operators andandxorincreased the detection rate with
more than 27%.
The next step is to select the best 300 features and compare the results. The feature
selection is performed, using the F-score (Equation 3.1), previously described. By using
this function we will lter those attributes that are very representative for a class of
Chapter 3. Improving Malware Detection By Feature Extraction 76
les but not representative for the other. The results of testing the OSCP algorithm
with the best 300 features selected after mapping with each boolean operator, can be
viewed in Table 3.2. Again, these are obtained using 3-fold cross validation on our
dataset.
Algorithm Detection False Positives Accuracy F Used
OSCP-MAP-AND-best300 8.54% 1 89.03 % 67
OSCP-MAP-XOR-best300 1.37% 4 88.17 % 54
OSCP-MAP-OR-best300 2.41% 3 88.3 % 92
F used represents the number of unique original features being used in the new
attributes.
Table 3.2: The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according F-score
In contrary to what we have expected, the detection rate has seriously dropped, even
bellow of the one obtained of using the OSCP algorithm with the original features.
Even though it appears that these new features are worse than the original one, this
is not the case. After analyzing the created features it appears that many of them are
based on the same original features. This is highlighted in Table 3.2 in the last column.
As it can be seen for the results of the andoperator, even though we are using 300
features for training the OSCP we are actually only using 67 original features. This
reduces the diversity and increases the probability that records belonging to dierent
classes will have the same set of features. This did not happen when mapping was
used since there, among the new created features, there are also the original ones. The
easiest way to solve this problem is to test the OSCP algorithm with 600 features
instead of 300: the new created features plus the original ones. The results can be
viewed in Table 3.3.
The best results are the ones obtained using the andoperator. For this reason we will
use this as a threshold to test our new features.
Chapter 3. Improving Malware Detection By Feature Extraction 77
Algorithm Detection False Positives Accuracy
OSCP-MAP-AND-best300+300 original 65.53% 15 95.85 %
OSCP-MAP-XOR-best300+300 original 58.87% 16 95.05 %
OSCP-MAP-OR-best300+300 original 62.13% 15 95.45 %
Table 3.3: The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according the F-score plus the original 300 features
3.1.1 Using the boolean andoperator for crossover
To produce new features from existing ones we need to have a function. Since the
searching space is too big we will use genetic programming to create features from
existing ones that are very good at separating the records from the dataset. We start
by dening the chromosome. A chromosome is a series of original features of arbitrary
length. Each feature is linked to the rest using a boolean operator. Since the best results
after using mapping were obtained by using the andoperator, we use this to link fea-
tures between them. More precisely, if the original features are F=fF1;F2;F3;:::F ng,
a chromosome is dened as C=fG1;G2;G3;:::;G ngwhereGiis a gene in the chro-
mosome represented as an integer non-null number (the feature index) that respects
the condition 1 <=jGij<=n;i=1;n. A description of how a chromosome looks like
is depicted in Figure 3.2.
Figure 3.2: Chromosome
The initial population is made up of 300 chromosomes where each one is initialized to
an original feature, according to Algorithm 4.
This simple step is also depicted in Figure 3.3.
Chapter 3. Improving Malware Detection By Feature Extraction 78
Algorithm 4 Initial Population
functionInitialPopulation (F)
P fg
for allFi2Fdo
C fi+ 1g
P P[fCg
end for
end function
Figure 3.3: Chromosome
For the algorithm to work the tness function, mutation operator and crossover oper-
ator must be dened.
Mutation is implemented using the boolean notoperator. A gene is randomly selected
from each chromosome and with a probability of 3% its state will be negated. So the
chromosome is actually a boolean function over a set of features where each one is
linked to the rest using the boolean andoperator and some of them are negated.
Crossover is used by randomly selecting two chromosomes from each population, ran-
domly choosing a splitting point on each chromosome and exchanging genes from one
end to the spiting point. The crossover point can be done on any gene from 0 to the
length of the chromosome. This way, we make sure that crossover is possible from the
start where there is only one gene per chromosome. There are some problems that
can appear when using these operators. Before doing crossover it is possible that one
chromosome has a certain gene while the other has the same gene, but negated. It is
possible that one of their ospring to end up having both versions of the genes. Since
the boolean andoperator is used to link these genes, the chromosome will always eval-
uate to false. To prevent this from happening, if a gene exists in both forms, negated
and non negated, it will simply be removed. Another thing that could happen after
Chapter 3. Improving Malware Detection By Feature Extraction 79
crossover is that the same gene can appear multiple times. Even though this does not
immediately has an impact of the result of the tness function, it can produce very
long chromosomes without any increase in tness score. For this reason, if a gene is
found multiple times, only one copy will be retained.
On each generation, chromosomes that go into crossover are randomly selected and
1000 new ospring are created. Generating 1000 new chromosomes instead of choosing
pairs and decide whether to create ospring or not allows us to explore more of the
searching space. The number 1000 used here is arbitrarily chosen. Any larger number
can be used, provided that the selection will not take too long. This is highlighted in
Figure 3.4
Figure 3.4: Crossover operation
The tness function used for selecting the best chromosomes is again the F-score (Equa-
tion 3.1). From all the ospring created(1000) only the best 300 are being kept. This
is shown in Figure 3.5.
The steps taken to evolve the population are better described in Algorithm 5.
The transition from a chromosome to a feature that can be used by the OSCP algorithm
is straight forward. When evaluating the chromosome against a record, each gene,
which is actually an index of an original feature, will be tested against the record. For
the genes represented by negative indexes (due to mutation) the result will be negated.
At the end, a boolean andoperation will be performed between the results obtained
from evaluating each gene. The evaluation process is described in Algorithm 6.
Chapter 3. Improving Malware Detection By Feature Extraction 80
Algorithm 5 Evolve Population
functionEvolvePopulation (P)
combination 0newPop fg
whilecombination<combinationsThreashold do
C1 P[Random (jPj)]
C2 P[Random (jPj)]
cut1 Random (jC1j)]
cut2 Random (jC2j)]
C1left=Scut1
i=1C1i
C1right=SjC1j
i=cut1+1C1i
C2left=Scut2
i=1C2i
C2right=SjC2j
i=cut2+1C2i
CLL fgCLR fg
CRL fgCRR fg
ifRandom (100)>50then
CLL C1left[C2left
CRR C1right[C2right
ifRandom (100)>mutationThreashold then
index 1 =Random (jCLLj)
CLL index 1= CLL index 1
index 2 =Random (jCRRj)
CRR index 2= CRR index 2
end if
else
CLR C1left[C2right
CRL C1rightt[C2left
ifRandom (100)>mutationThreashold then
index 1 =Random (jCLRj)
CLR index 1= CLR index 1
index 2 =Random (jCRLj)
CRL index 2= CRL index 2
end if
end if
ifCLL not empty then
newPop newPop[fCLLg
end if
ifCRR not empty then
newPop newPop[fCRRg
end if
ifCLR not empty then
newPop newPop[fCLRg
end if
ifCRL not empty then
newPop newPop[fCRLg
end if
end while
returnnewPop
end function
Chapter 3. Improving Malware Detection By Feature Extraction 81
Figure 3.5: Selection Operation
Algorithm 6 Convert Chromosome to Feature
functionConvertChromosomeToFeature (F;C)
F0 fg
for allG2Cdo
ifG> 0then
F0 F0[fFG 1g
else
F0 F0[fFG 1g
end if
end for
end function
Testing these algorithms on the dataset did not provide the expected results. In fact,
with each generation, the detection rate kept getting worse. The results after dierent
iterations can be seen in Table 3.4
After analyzing the chromosomes an interesting behavior can be observed. With each
generation the number of original features used in combinations decreases, thus de-
creasing the searching space considerably. Even though this could be considered as a
good thing (feature selection) what is actually happening is that the solution being
found is a local minimum.
A graphical display of the number of features being used after each generation can be
consulted in Figure 3.6.
To solve this problem the crossover operation was altered. On every generation, the
Chapter 3. Improving Malware Detection By Feature Extraction 82
Iteration Number Detection False positives Accuracy
1 59.16 17 95.09 %
2 58.97 11 95.07 %
5 45.92 13 93.5 %
10 28.75 7 91.45 %
15 27.39 2 91.29 %
20 21.24 3 90.55 %
25 18.44 0 90.22%
50 12.22 1 89.47 %
75 1.7 0 88.21 %
90 0.01 0 88.01 %
100 0.0 0 88.01 %
Table 3.4: Results of using chromosomes resulted in dierent generations
Figure 3.6: Number of original features used on each generation
original features are added as chromosomes. This way, they can always be used and
allow the chromosome to escape from a local minimum. Another reason for why the
detection kept getting worse is that by using only the resulted features it is possible to
end up with records that belong to dierent classes that have the same features. This
is very detrimental to how the OSCP algorithm works. Adding the original features to
the ones resulted after the run of the genetic algorithm, solves the problem. Thus, the
crossover operation becomes as described in Figure 3.7.
After running the algorithm again, this time with the aforementioned modications,
the number of original features that were used by the chromosomes varied between 200
Chapter 3. Improving Malware Detection By Feature Extraction 83
Figure 3.7: Crossover operation including original features
and 245. This can be seen in Figure 3.8
Figure 3.8: Number of original features used on each generation
Another modication brought to the algorithm is to also include the parents in the
selection phase. This is done since it is possible that the resulted chromosomes will
be less t than their parents. For this reason, the selection is performed on 1300
chromosomes instead of 1000. The selection phase is better described in Figure 3.9
The results after running the algorithm again are presented in Table 3.5.
Even though the results have greatly improved, the detection rate still decreases, al-
though more slowly. After studying the chromosomes from dierent generations an
Chapter 3. Improving Malware Detection By Feature Extraction 84
Figure 3.9: Selection Operation
Iteration Number Detection
1 64.67 %
2 64.83 %
5 66.15 %
10 61.83 %
20 58.4 %
40 59.6 %
Table 3.5: Results on dierent iteration for arbitrary length features
interesting problem was observed: the chromosomes length kept getting bigger with-
out any benet in tness score. This is due to the evaluation function that treats
each chromosome separately. Even though each of these chromosomes are very good
when it comes to tness score, they separate the same malicious and benign les. So,
basically, they are one and the same feature. Using many of these features is detrimen-
tal since these translates in using less features for classication. An example of such
chromosomes can be seen in Table 3.6.
In Table 3.6, each line is specic to one chromosome. The rst two columns re
ect the
number of benign and malicious records that each chromosome separates, while the
last column shows the index of the original features that make up the chromosomes.
Chapter 3. Improving Malware Detection By Feature Extraction 85
Benign Records Malicious Records Feature Chain
26023 583 -237,-52,239,245,246,254,
26023 583 -237,-52,239,245,246,252,266,
26023 583 -237,-52,239,245,246,251,252,254
26023 583 -237,-52,239,245,246,251,252,254,255,266
Table 3.6: An example of bloating problem
The indexes are from 0 to 299 since we are using 300 features. A negative number
means that mutation was performed on that place of the chromosome, so the feature
will be used in negated form. As it can be seen, all of the four features presented in
Table 3.6 separate the same number of malicious and benign records. Also, the rst
feature is a prex for the other ones, the second is a prex for the following two and
so on. In fact, half of the chromosomes were created in this way. This phenomenon,
called bloating, has dierent xes. In this case, we have just limited the length of the
chromosome to a constant. A slighter better approach can be consulted in the next
section. The results obtained after applying the length limitation of the chromosome
can be seen in Table 3.7. As it can be seen, the detection rate gets to almost 70% after
which it slowly drops.
Iteration Number Detection
1 66.61 %
2 66.13 %
5 66.21 %
10 66.99 %
15 69.09 %
20 68.6 %
25 67.65 %
50 67.11 %
75 65.38 %
90 65.18 %
100 64.66 %
Table 3.7: Results on dierent iteration for arbitrary length features
Chapter 3. Improving Malware Detection By Feature Extraction 86
The Table 3.8 presents our best results compared to the ones obtained by the algorithm
that we use as threshold (OSCP with the best 300 features obtained after mapping using
the and operator, plus the 300 original features).
Algorithm Detection False Positives Accuracy
OSCP-MAP-AND-best300+300 original 65.53% 15 95.85 %
OSCP-GA-best300+300 original 66.99% 17 96.39 %
Table 3.8: Comparison of the results when using features resulted from mapping
versus genetic programming
As it can be observed, features obtained through genetic programming bring an increase
of almost 4.5% in detection rate. Another advantage that this method has over features
obtained through mapping is the low usage of memory. Obtaining the best features
from mapping implies that one must rst obtain all features, which in some cases might
not be possible since due to the exponential increase in memory usage.
Since this results conrm that it is possible through genetic programming to obtain
better results than mapping, the next question is how many genetically obtained fea-
tures are needed to achieve the same detection rate as the one obtained by using the
mapping method. For this reason we gradually increase the number of features as well
as the population size. The results are provided in Table 3.9.
Number of features Population size Detection Rate False positives Accuracy
1000 3000 73.01% 21 96.74%
2000 5000 77.16% 30 97.24%
3000 6000 80.66% 31 97.65%
4000 8000 81.16% 34 97.71%
Table 3.9: Results using dierent number of features
As it can be seen, with 4000 features the detection rate is 81.16% which is close to
85.35% obtained through mapping which produces 15 times more features (45150).
Chapter 3. Improving Malware Detection By Feature Extraction 87
3.1.2 Using all boolean operators for crossover
Using the boolean andoperator we have obtained satisfying results. However, we would
like to see if using more boolean operators translates into better results. Even though
the boolean set made of fand,notgis considered a generator set, (every other boolean
operation can be obtained just using these operators), in practice it takes too many
iterations to achieve the same results as using a simple operator like ororxor. For
this reason, in this section, the boolean operators fand,or,xorgwill be used to create
features through genetic programming.
For this, we have to change how a chromosome is implemented. A chromosome is a set
of features (1 to 300) each one connected to other through a boolean operator. Since
the order in which the expression is evaluated changes the result, a priority is also given
to each boolean operator. The structure can be seen in Figure 3.10.
Figure 3.10: Chromosome
Notice that in Figure 3.10 each Fi, wherei=1;300 refers to one of the original
features, each OPirefers to one of the boolean operations fand,xor,organd eachPi
is a number that re
ects the priority of the boolean operation among the expression.
The last operator and priority are used only when the crossover point is at the end of
the chromosome and a new piece from another chromosome will be attached.
To understand how such a chromosome is evaluated, an example is provided below.
We will start with a chromosome illustrated in Figure 3.11.
Figure 3.11: Chromosome
Chapter 3. Improving Malware Detection By Feature Extraction 88
For evaluating this expression an ordering of the boolean operations is needed. Since
doing this for every record is detrimental, an intermediate expression will be computed
where the operations needed for evaluation are already sorted. The intermediate ex-
pression will be built only for the rst record. The steps to build such expression are
the following:
1. ignore the last priority and operator
2. choose the maximum priority
3. evaluate the operation given by the operator corresponding to the chosen priority
and the two corresponding operands: the one on the same index with the operator
and the one from the next index
4. assign the result to a temporary variable
5. replace the two operands used with the temporary variable
6. repeat from 2 until all the operands are all the same temporary value
For the example provided, the temporary values are computed as following:
1.T1 =F10XOR F 12
2.T2 =F8AND F 9
3.T3 =F1AND F 5
4.T4 =T3XOR T 2
5.T5 =T4OR T 1
A more detailed view of these steps are illustrated in Figure 3.12. Notice, that at the
nal step, all of the features are replaced with the same temporary operand, T5.
Of course, this is just a way of evaluating a mathematical expression. If a tree would
be build for the provided example, it would look like the one in Figure 3.13. The chro-
mosomes could have also been implemented using binary trees, however I consider this
method to be more practical since it allows mutation on every part of the chromosome
(priority, operator and feature)
Chapter 3. Improving Malware Detection By Feature Extraction 89
Figure 3.12: Chromosome example evaluation
Figure 3.13: Evaluation tree
As in the previous subsection, we rst need to dene the initial population. The
rst generation is constructed by generating 300 chromosomes, of length 1. Each
chromosome has one of the original features, each operation is randomly chosen among
the set of boolean operators fand,or,xorgand each priority is initialized to 0. This
step is described in Algorithm 7.
After the initialization process, the population will look as illustrated in Figure 3.14.
The crossover operation is similar to the one described when using just the and operator:
Chapter 3. Improving Malware Detection By Feature Extraction 90
Algorithm 7 Initial Population
functionInitialPopulation (F)
P fg
for allFi2Fdo
Cfeature fi+ 1g
Coperation Random (fxor;and;org)
Cpriority 0
P P[fCg
end for
end function
Figure 3.14: Initial Population
1000 new chromosomes are generated by randomly selecting two parents and performing
crossover on a randomly selected point. The mutation operation is dierent since the
chromosome is more complex. A probability of mutation of 3% is considered. From
the new generated set of 1000 chromosomes, three sets of 30 chromosomes are chosen.
For the elements in the rst set, a feature will be randomly selected and negation
will be performed. For the elements in the second set, one of the operators will be
switched to a randomly chosen one from the boolean set fand,or,xorg. Finally, for
each chromosomes in the nal set, a priority will be chosen, and based on a randomly
generated number it will be incremented or decremented.
Based on the F-score (Equation 3.1), the best 300 chromosomes will be selected. The
parents are also included in the selection process in order not to lose the best chromo-
somes found so far. The genetic steps can also be consulted in Algorithm 8.
As described in the previous subsection, when using just the boolean andoperator,
the bloat problem is also found in the current approach. If in the previous section
the problem was solved by just limiting the length of the chromosome, I consider the
Chapter 3. Improving Malware Detection By Feature Extraction 91
Algorithm 8 Evolution process
P initial population
Pnew fg
forit 1;100do
==Crossover
whilejPnewj<1000 do
Offspring 1;Offspring 2=Crossover (P)
Pnew Pnew[fOffspring 1;Offspring 2g
end while
==Mutation
Pnew MutateFeature (Pnew;3%)
Pnew MutateOperator (Pnew;3%)
Pnew MutatePriority (Pnew;3%)
==Selection
Pnew Pnew[fPg
P fg
maxScore 0
bestChromosome
whilejPj<300do
for allchromosome2Pnewdo
ifFitness (chromosome )>maxScore then
maxScore Fitness (chromosome )
bestChromosome chromosome
end if
end for
Pnew PnewnfbestChromosome g
P P[fbestChromosome g
end while
end for
Ptraining P
approach ill suited for the current case since there are more operations involved. For
this reason, the following method is used: if there are two dierent chromosomes that
classify the same number of malicious and benign records then only the shortest one
will survive in the next generation. Even though relying just on numbers to decide if
two chromosomes separate the records is not the same thing as actually inspecting the
les, in practice it seems these measures are close related. Of course, if these numbers
Chapter 3. Improving Malware Detection By Feature Extraction 92
are very small then the probability of having such chromosomes that separate dierent
records increases. For this reason, in order not to limit the exploration power of the
algorithm, the bloating solution is applied only if the number of malicious and benign
records exceeds a certain threshold. (100 seems to work good in practice).
In order to see how the evolved features impact the classication algorithm, at each
generation the features are extracted and used to train an OSCP. The features tness
is also monitored in order to see how the genetic algorithm is doing. The gures below
(Figure 3.15 and Figure 3.16) show how the detection rate and false positive rate vary
while the tness score of the population increases.
Figure 3.15: Evolution of the detection rate and false positives rate
Figure 3.16: Evolution of the tness score of each generation
As it can be seen, the detection rate slowly increases at the beginning but drops sud-
denly after generation 20 with almost 10%. From that point, the detection rate slowly
Chapter 3. Improving Malware Detection By Feature Extraction 93
decreases until generation 60 where it remains constant at 58 %. The graphic of false
positives follows the one of detection rate. The number of false alarms drops after
generation 20 and remains in the interval 11, 12 until the end of the algorithm.
Regarding the variation of the population tness, this is not correlated to the variation
in detection rate. As observed from Figure 3.16, the population score keeps increasing
until generation 60 where it reaches a plateau and stabilizes at 6400.
Since the variation in detection rate does not depend on the variation of tness score, it
means that the tness function it not well correlated with the classication algorithm.
3.1.3 Testing a better tness function
All the chromosomes that get in the nal generation have a high F-score. Since the
F-score is computed based on how many benign and malicious records each feature
is activated on, the chromosomes that will have a higher score are the ones that get
activated on most of the records. However, since the records are not uniformly dis-
tributed, most of the chromosomes (if not all) will end up being activated on the same
subset (the largest one) of the records. This in turns decreases the detection rate of the
OSCP algorithm, since the records belonging to other subsets are very dicult to be
classied. This problem can be better explained through the image presented in Fig.
3.17. As it can be seen, the features from 1 to 300 have the highest F-score since they
get activated on the subset with the largest number of records. Those are the features
that will survive to the next generation. The features 301 and 302 get activated on a
lower number of records so they will have a lower F-score. This means that they will
not pass to the next generation. This is detrimental since all the rst 300 features get
activated on the same subset. It would be better if two of the rst 300 features would
be replaced with the feature 301 and feature 302 since the OSCP algorithm would have
more distinct features (in terms on what records a feature gets activated on) to work
with.
Chapter 3. Improving Malware Detection By Feature Extraction 94
Figure 3.17: Dierent features being activated on the same subset
There are dierent approaches for this problem (called multi modal optimization prob-
lem), but the one that was tested and provides good results is called sharing. The idea
of sharing in Genetic Algorithms was rst presented in [45] and, as most of the Genetic
Algorithm approaches, comes from nature. The idea behind it is that if there are too
many individuals consuming a certain type of resource then it is more favorable for
some individuals to orientate to another type of resource. The name sharing comes
from the way the tness is being computed: the payo of the individual is degraded
based on how many individuals are consuming the same resource.
In our case the resource is represented by each record. Thus, in each generation, for
every chromosome (feature) we will count the number of clean and malicious records
where the feature is being activated. This will be noted with CMalandCClean. Another
observation that we must take into account is that for some records, there are many
features that get activated, while for other only a few get activated. Since many
individuals can choose to consume a versatile record we must prioritize those that get
activated on records with a scarce number of features. For each feature we will also
compute a second score, called SMalandSClean that also takes into account how many
features are activated, beside this one, on each record. For a Feature Fithese scores
are computed according to the following Equations.
Chapter 3. Improving Malware Detection By Feature Extraction 95
CMal=nr malX
j=1Rj:Fi
SMal=nr malX
j=1Rj:Fi1
PjRjj
k=1Rj:Fk(3.2)
CMal=nr cleanX
j=1Rj:Fi
SMal=nr cleanX
j=1Rj:Fi1
PjRjj
k=1Rj:Fk(3.3)
The rst approach was to write the F-score using SMalandSClean. For this, the averages
used in the F-score were computed by using SMalinstead ofCMalandSClean instead
ofCClean. The value of the i-th feature in the F-score was also computed by dividing
it to the number of features with value 1 for that particular record. Using this method
we were only able to reach 65.34% in detection rate and 15 false positives. The second
approach was to simply multiply the F-score with a value computed based on SClean
andSMal:SMal+SClean
Total Mal+Total Clean. This proved even worse since the number of false positives
remained the same while the detection rate dropped to 61.5%.
Seeing this we decided to try another tness function, not dependent on F-Score. The
function dened in Equation (3.4) will give a very low tness score (negative) for the
individuals that get activated more on the clean records than on the malicious ones and
for the others will normalize the dierence between the number of malicious and clean
records where the feature is being activated based on how important is that feature for
all the malicious records.
Fitness 2=8
>>>>><
>>>>>:CMal= 0and C Clean = 0Fitness 2= 1
CMal<C Clean Fitness 2=CMal CClean
CMal>=CClean (CMal CClean)SMal(3.4)
Chapter 3. Improving Malware Detection By Feature Extraction 96
Using this function, the detection rate increases to 69.96% while maintaining a low
number of false positives (17 false positives). However, the function presented in Equa-
tion 3.4 can be further optimized. Since the function is based on the dierence between
the number of malicious and benign records where a feature is being activated, it does
not make any distinction between a case where a feature is being activated on 1000
malicious records and 10 benign versus a case where it is being activated on 5000 ma-
licious records and 4010 benign. It is clear that in the rst case the feature is more
representative for the malicious class. For this reason the function presented in Equa-
tion 3.4 is updated by dividing the dierence to the number of malicious records. Thus,
a new tness function is obtained in Equation 3.5
Fitness 3=8
>>>>><
>>>>>:CMal= 0and C Clean = 0Fitness 3= 1
CMal<C Clean Fitness 2=CMal CClean
CMal>=CClean(CMal CClean )SMal
CMal(3.5)
Using this function the detection rate increases to 73.23% while the number of false
positives increases with just 2 units, reaching to 19 false positives. The graphic in Fig
3.18 presents how the detection rate modies with each generation. The graphic in
Fig. 3.19 presents information regarding to how each generation evolves. The yellow
line represents the minimum number of features that are activated on every record on
each generation, while the vertical bars represent the standard deviation of the number
of features activated on every record on each generation.
As it can be seen the detection rate increases as more features are being set on the
records with a low number of activated features.
An important conclusion of this work is that adding more boolean operations did
not improve the detection rate. In fact, after analyzing the features obtained on the
last generation when using the Fitness 3function, 23 out of the best 30 features were
obtained using just the boolean notand the boolean andoperations. The same thing
Chapter 3. Improving Malware Detection By Feature Extraction 97
Figure 3.18: Evolution of the detection rate
Figure 3.19: Evolution of the number of features activated on each record
can be said about the bloating solution. Even though there was no limit set for the
maximum length of the feature, the best features were made of two operands. This
also explains why the detection rate has a sharp increase to almost 70% in the rst
generations.
Chapter 3. Improving Malware Detection By Feature Extraction 98
3.2 Feature Creation Using Restricted Boltzmann Machine
The idea of creating features using a mathematical operator can be extended. First of
all, the mathematical operator only brings a limited kind of relation between features.
Secondly, features could be created from the combination of more than two initial
features. In this way the new created features may contain more information and
could prove more important in separating the records. To address these ideas the
mathematical operator should be replaced with a function that has as arguments all
the existing features and the result of this function is a new feature. Since the newly
created features will be used as input for a linear classier it is important that the
function used to create them should be a non-linear one. Finding a function that is
optimized for this task may seem dicult, however these restrictions are very similar
to the ones imposed by a restricted Boltzmann machine [46].
A restricted Boltzmann machine is a stochastic neural network composed of two layers:
one made up of visible units (the data ) and another one made up of hidden units (the
new created features). Every unit from one hidden layer has a symmetrical connection
with every unit from the visible layer, however there are no connections between units
from the same layer. It is called restricted because of the lack of connectivity inside the
same layer. Each neuron from the network has a binary state: 0 or 1. Figure 3.20 is
an example of restricted Boltzmann machine composed of 4 visible units and 3 hidden
units and their corresponding biases.
Figure 3.20: A Restricted Boltzmann Machine
Chapter 3. Improving Malware Detection By Feature Extraction 99
The value of each unit is generated according to a probability function. Considering
that the weights of each connection are stored in a symmetric matrix wandviandhj
represent the states of the visible unit iand hidden state jthen their probabilities are:
p(hj= 1jv) =1
1 +e P
vi2vviwij(3.6)
p(vi= 1jh) =1
1 +e P
hj2hhjwij(3.7)
Learning in this network consists of adapting the weights such that the hidden units
will be able to nd relations between visible units. This is very similar to how an
encoder works. The weights of the connection will be adjusted such that given data
to the visible layer the hidden units will receive values from which a new kind of
data could be generated that is very similar to the original one. Thus hidden units
will contain encoded data of the visible units and the weight matrix is the way to
encode and decode it. The most known algorithm for learning a restricted Boltzmann
machine is called contrastive divergence [47] and works in the following way: The visible
units are rst initialized with data from the training set( vi) and the weight matrix is
randomized. Using the Equation 3.6 the probability of every neuron from the hidden
layer will be computed. The next step is to sample each hidden unit according to
this probability thus obtaining the values for the hidden layer ( hj). The values of the
hidden units are the encoded values of the visible units with respect to the encoding
conguration (weight matrix). The next step is to nd out how the decoded data will
look like and adjust the weight matrix such that the decoded data to be similar to the
original one. So, the next step is to generate probabilities for the visible units, given
the hidden units and the weight matrix, using equation 3.7. By sampling according
to the probabilities obtained, a new set of data is generated (^ vi). The nal step is to
generate another set of hidden units in a similar way as before but this time starting
from the new generated data, thus obtaining encoded values for the new data ( ^hj).
Chapter 3. Improving Malware Detection By Feature Extraction 100
The steps previously described where a new set of data and hidden values is generated
are considered to be one iteration. The weight matrix will be updated according to the
equation 3.8 where is the learning rate,n refers to the number of iterations and <>
is the mean over the training data.
wij=(<vihj>
