University Alexandru Ioan Cuza of Iasi, Romania [601407]

Licență

University Alexandru Ioan Cuza of Iasi, Romania [601407]

Byadmin ianuarie 1, 2024

University Alexandru Ioan Cuza of Iasi, Romania
Faculty of Computer Science
Improving Detection Of Malicious
Programs
Author:
Mihai – Razvan BencheaSupervisor:
Prof. Dr. Henri Luchian
Thesis committee:
Prof. PhD Dan Simovici, University of Massachusetts, United States of America
Prof. PhD Denis Enachescu, University of Bucharest, Romania
Prof. PhD Daniela Zahariei, West University of Timisoara, Romania
A thesis submitted in fulllment of the requirements
for the degree of Philosophiae Doctor (PhD)
in the
Faculty of Computer Science
University Alexandru Ioan Cuza of Iasi, Romania
December 2015

Abstract
The thesis presents dierent methods that can be used in practice in order to increase
the detection rate of an antivirus product. The two most common types of malware
are analyzed: executable les and document les. A study over the privacy threats
brought by applications for mobile devices, as well as the necessary steps to create a
framework capable of analyzing and detecting them are also provided.
To increase the detection rate for the executable les, the following feature extraction
techniques are used:
Genetic Programming: New features are created from the original ones by evolv-
ing a set of boolean equations using the boolean operators not;and;or .
Neural Network: A Restricted Boltzmann Machine is used in order to nd non-
linear relations between features.
The resulted features are used to train a linear classier called One Side Class Percep-
tron (OSCP). The results are compared to the mapped version of the OSCP algorithm.
For the malicious documents, two types of clustering are tested, in order to see which
one is better for practical means: hierarchical Bottom-Up clustering and Hash based
clustering. Classication of document based malware is tested using a combination of
a binary decision tree and the OSCP algorithm as well as a Hidden Markov Model.

Acknowledgements
I want to express my gratitude to my advisor, Professor Henri Luchian, for the great
advices and ideas he provided to me. His experience helped to look at my work from
a dierent perspective while his constructive critics, helped me to realize and repair
my mistakes. I would also like to thank him for his patience and for allowing and
encouraging me to follow my research interests.
A special gratitude goes to my collegue, Dragos Gavrilut, who responded each time
I requested his help. His practical experience in writing algorithms helped me to
overcome any diculties that I had in testing my ideas.
I would also like to thank the members from the advisory committe and to the members
from Professor Luchian's research group for the insightful advices and for showing me
dierent approaches to my problems.
I am indebted to Bitdefender's management who oered me the nancial support, the
infrastructure, as well as the data necessarry to complete my thesis. The thanks go also
to my collegues from Bitdefender with whom I worked in developing the algorithms
presented in this thesis.
Last, but not least, I want to thank to my family and friends for the support and
understanding they oered during my phd years.
ii

Contents
Abstract i
Acknowledgements ii
Contents iii
List of Figures vi
List of Tables vii
List of Papers ix
1 Malware Detection Problems 1
1.1 Malware Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mechanisms For Evading Detection . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Defensive Mechanisms.Packers and protectors . . . . . . . . . . 3
1.2.2 Oensive mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Non executable malware les . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Attack Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Vulnerability Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Malware detection requirements . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Security and privacy concerns . . . . . . . . . . . . . . . . . . . . . . . 21
1.7.1 Privacy concerns regarding mobile devices . . . . . . . . . . . . 22
2 State Of The Art. Related Work 26
2.1 Malware Detection using Machine Learning . . . . . . . . . . . . . . . . 26
2.1.1 Malware detection using machine learning . . . . . . . . . . . . 27
2.1.2 Practical Optimizations for Perceptron Algorithms in Large Mal-
ware Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Building a practical and reliable classier for malware detection 29
2.2 Feature Extraction using Genetic Programming . . . . . . . . . . . . . 31
iii

Contents iv
2.2.1 A Hybrid Approach to Feature Selection and Generation Using
an Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Automatic Feature Generation for Machine Learning Based Op-
timizing Compilation . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Evolving Novel Image Features Using Genetic Programming-based
Image Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Feature Extraction Using Genetic Algorithms . . . . . . . . . . 34
2.2.5 Feature Generation Using Genetic Programming With Applica-
tion to Fault Classication . . . . . . . . . . . . . . . . . . . . . 35
2.2.6 The Evolutionary Pre-Processor Automatic Feature Extraction
for Supervised Classication using Genetic Programming . . . . 37
2.3 Improving Malware Detection through Feature Extraction . . . . . . . 38
2.3.1 Data Mining Methods for Detection of New Malicious Executables 38
2.3.2 The proactivity of perceptron derived algorithms in malware de-
tection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.3 Learning to Detect and Classify Malicious Executables in the Wild 41
2.3.4 Comparison of feature selection and classication algorithms in
identifying malicious executables . . . . . . . . . . . . . . . . . 42
2.3.5 Data mining methods for malware detection . . . . . . . . . . . 45
2.3.6 Detecting unknown malicious code by applying classication tech-
niques on opcodes representation . . . . . . . . . . . . . . . . . 46
2.3.7 A feature extraction method and recognition algorithm for de-
tection unknown worm and variations based on static features . 48
2.3.8 An interpretable string based malware detection system using
svm ensemble with bagging . . . . . . . . . . . . . . . . . . . . 49
2.3.9 Ecient virus detection using dynamic instruction sequences . . 50
2.3.10 A static malware detection system using data mining methods . 52
2.3.11 Large-scale malware classication using random projections and
neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Detection of non-executable les . . . . . . . . . . . . . . . . . . . . . . 55
2.4.1 Caeine Monkey: Automated Collection, Detection and Analysis
of Malicious JavaScript . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.2 The Rise of PDF Malware . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Detection and Analysis of Drive-by Download Attacks and Ma-
licious JavaScript Code . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4 Prophiler: a Fast Filter for the Large-Scale Detection of Mali-
cious Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.5 Spyproxy: Execution-based Detection of Malicious Web Content 62
2.4.6 Zozzle: Low-overhead Mostly Static JavaScript Malware Detection 63
2.4.7 Pattern Mining for Future Attacks . . . . . . . . . . . . . . . . 64
2.4.8 Fast Plagiarism Detection System . . . . . . . . . . . . . . . . . 66
2.4.9 JPlag: Finding plagiarisms among a set of programs. . . . . . . 67

Contents v
3 Improving Malware Detection By Feature Extraction 68
3.0.1 Feature Creation Using Mapping . . . . . . . . . . . . . . . . . 71
3.1 Feature Creation Using Genetic Programming . . . . . . . . . . . . . . 73
3.1.1 Using the boolean andoperator for crossover . . . . . . . . . . . 77
3.1.2 Using all boolean operators for crossover . . . . . . . . . . . . . 87
3.1.3 Testing a better tness function . . . . . . . . . . . . . . . . . . 93
3.2 Feature Creation Using Restricted Boltzmann Machine . . . . . . . . . 98
3.2.1 Training Ensemble of Linear Classier and Restricted Boltzmann
Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.2 Testing the Ensemble and Results . . . . . . . . . . . . . . . . . 105
4 Clustering and Classication of Script based Malware 109
4.1 Clustering Document Based Malware . . . . . . . . . . . . . . . . . . . 116
4.1.1 Hierarchical Bottom-up Clustering . . . . . . . . . . . . . . . . 117
4.1.2 Hash Based Clustering . . . . . . . . . . . . . . . . . . . . . . . 127
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.1.4 Detecting document based malware . . . . . . . . . . . . . . . . 135
4.2 Classication of Script Based Malware . . . . . . . . . . . . . . . . . . 140
4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2.2 Using the One Side Class Perceptron algorithm . . . . . . . . . 142
4.2.3 Using the Hidden Markov Model . . . . . . . . . . . . . . . . . 145
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5 Mobile Privacy Issues 148
5.1 Application Store Protection and Malware . . . . . . . . . . . . . . . . 148
5.1.1 AppStore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.2 Google Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.2 Security and Privacy Risks . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.2.1 Privacy risks Of Third Party SDKs . . . . . . . . . . . . . . . . 153
5.2.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.2.3 Remote Code Execution via Android SDKs . . . . . . . . . . . 160
5.2.4 Methods for Privacy Risks Detection . . . . . . . . . . . . . . . 164
6 Conclusion and Future Work 167
Bibliography 169

List of Figures
1.1 Malware Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Growth of market share of Mobile Phones OS . . . . . . . . . . . . . . 23
1.3 Growth of sapplication store . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 OSC training process. Training step (a). Class minimize step (b). . . . 71
3.2 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4 Crossover operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5 Selection Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Number of original features used on each generation . . . . . . . . . . . 82
3.7 Crossover operation including original features . . . . . . . . . . . . . . 83
3.8 Number of original features used on each generation . . . . . . . . . . . 83
3.9 Selection Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.10 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.11 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.12 Chromosome example evaluation . . . . . . . . . . . . . . . . . . . . . 89
3.13 Evaluation tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.14 Initial Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.15 Evolution of the detection rate and false positives rate . . . . . . . . . 92
3.16 Evolution of the tness score of each generation . . . . . . . . . . . . . 92
3.17 Dierent features being activated on the same subset . . . . . . . . . . 94
3.18 Evolution of the detection rate . . . . . . . . . . . . . . . . . . . . . . . 97
3.19 Evolution of the number of features activated on each record . . . . . . 97
3.20 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . 98
3.21 Restricted Boltzmann machine and Linear Classier . . . . . . . . . . 100
4.1 Distribution for the Manhattan distance for all pair of les . . . . . . . 122
4.2 Distribution for the Manhattan distance for all pair of les . . . . . . . 126
4.3 Steps for building the classier . . . . . . . . . . . . . . . . . . . . . . 140
4.4 Sequence Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.1 Network trac of CricketLand . . . . . . . . . . . . . . . . . . . . . . . 156
5.2 Widdit Update Architecture . . . . . . . . . . . . . . . . . . . . . . . . 162
vi

List of Tables
3.1 The results of the 3-fold cross-validation for OSCP and OSCP-MAP
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according F-score . . . . . . . . . . . . . . . 76
3.3 The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according the F-score plus the original 300
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4 Results of using chromosomes resulted in dierent generations . . . . . 82
3.5 Results on dierent iteration for arbitrary length features . . . . . . . . 84
3.6 An example of bloating problem . . . . . . . . . . . . . . . . . . . . . . 85
3.7 Results on dierent iteration for arbitrary length features . . . . . . . . 85
3.8 Comparison of the results when using features resulted from mapping
versus genetic programming . . . . . . . . . . . . . . . . . . . . . . . . 86
3.9 Results using dierent number of features . . . . . . . . . . . . . . . . 86
3.10 The results of the 3-fold cross-validation for OSCP, OSCP-MAP, OSCP-
RBM algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1 Token classes and their value . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2 Token classes and their value . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3 List of tokens after performing simple string operations . . . . . . . . . 120
4.4 Token classes and frequencies . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Cluster components for Manhattan distance algorithm . . . . . . . . . 123
4.6 Cluster obtained using custom metric distance . . . . . . . . . . . . . 126
4.7 Cluster components for hash-based clustering method . . . . . . . . . . 130
4.8 Nr. of PDF les containing javascript code . . . . . . . . . . . . . . . . 131
4.9 Time comparision between Hash Based Clustering and Hierarchical Bottom-
up Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.10 Results for dierent values for normalization factor . . . . . . . . . . . 133
4.11 Similarity of les from top 5 biggest clusters after manual analyzes . . 133
4.12 Clustering results obtained with and without precomputation . . . . . 134
4.13 Mean detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.14 Top 20 token classes according to score S1. . . . . . . . . . . . . . . . 138
4.15 Top 20 token classes according to score S2. . . . . . . . . . . . . . . . 139
vii

List of Tables viii
4.16 3-fold cross validation results for OSCP, OSCP-frequency BT-OSCP-xxx
and HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1 Actions Performed by TOP5 3rd Party SDKs . . . . . . . . . . . . . . 155
5.2 Permissions required to integrate Widdit SDK . . . . . . . . . . . . . . 163

List of Papers
The content of this thesis is based on the work presented in the following papers:
1. "Feature Creation using Genetic Algorithms for Zero False Positive Malware Clas-
sication", Razvan Benchea , Dragos Gavrilut, Henri Luchian, EVOLVE 2015,
2015
2. "Feature Creation Using Genetic Programming with Application in Malware De-
tection", Cristina Vatamanu, Razvan Benchea , Dragos Gavrilut, Henri Luchian,
International Symposium on Symbolic and Numeric Algorithms for Scientic
Computing (SYNASC), 2015
3. "Combining Restricted Boltzmann Machine and One Side Perceptron for Malware
Detection", Razvan Benchea , Gavrilut Dragos, the International Conference on
Conceptual Structures (ICCS), 2014
4. "A Practical Guide for Detecting the Java Script-Based Malware Using Hidden
Markov Models and Linear Classiers", Doina Cosova, Razvan Benchea , Dra-
gos Gavrilut, International Symposium on Symbolic and Numeric Algorithms for
Scientic Computing (SYNASC), 2014
5. "Building a Practical And Reliable Classier For Malware Detection", Cristina
Vatamanu, Dragos Gavrilut, Razvan Benchea , Journal in Computer Virology,
2013
6. "A Practical Approach on Clustering Malicious PDF documents", Cristina Vata-
manu, Dragos Gavrilut, Razvan Benchea , Journal in Computer Virology, 2013
7. "Practical Optimizations for Perceptron Algorithms in Large Malware Dataset",
Dragos Gavrilut, Razvan Benchea , Cristina Vatamanu, International Sympo-
sium on Symbolic and Numeric Algorithms for Scientic Computing (SYNASC),
2012
ix

List of Papers x
8. "Optimized Zero False Positives Perceptron Training for Malware Detection",
Dragos Gavrilut, Razvan Benchea , Cristina Vatamanu, the International Sym-
posium on Symbolic and Numeric Algorithms for Scientic Computing (SYNASC),
2012
9. "Performance Testing Framework: Evaluating The Impact On The System Speed",
Paul Bot, Cristina Vatamanu, Dragos Gavrilut, Razvan Benchea , the Work-
shop on Antimalware Testing Research (WATER), 2014
10. "Hiding The Network Behind The Network. Botnet Proxy Business Model",
Cristina Vatamanu, Alexandru Maximciuc, Razvan Benchea , Virus Bulletin
(vb2014), 2014
11. "Google and Apple Markets: Are Their Applications Really Secure?!", Vlad Bor-
dianu, Razvan Benchea , Dragos Gavrilut, Virus Bulletin (vb2013), 2013
12. "In-depth Analysis of PushDO Botnet", 16th Association of Anti-Virus Asia Re-
searchers International Conference 2013 (AVAR), Alexandru Maximciuc, Cristina
Vatamanu, Doina Cosovan, Dragos Gavrilut, Octavian Mihai Minea, Razvan
Benchea , 2013

Chapter 1
Malware Detection Problems
1.1 Malware Evolution
Starting with the rise of internet access the malware environment began to change.
Before that, most malware samples were being spread by removal media such as
oppy
disks or CD-ROMs, which of course limited the access to few computers. The access
to information was also very limited, compared to our days. This may be the reason
why many types of malware were just proof of concepts or bad jokes. However, access
to internet gave malware creators access to many computers. Starting with the year
2007 malware started to increase exponentially. More malware was produced in 2007
than in all years before that put together. The sudden growth was mainly in
uenced
by the idea that malware could be used to earn money. Almost all possible ways of
earning money through malicious code were exploited. From the most obvious ones, like
stealing access to bank account or asking for a ransom to give access to some encrypted
les, to less obvious ones like simulating user behavior for certain websites in order to
increase trac and with it money income for that website. According to an article
published in 2003 [1],the annual cost of cyber activity in United States of America,
was between 24 billion to 120 billion dollars. This year, more than 300.000.000 samples
were discovered, with a 100.000.000 more than the previous year. The Figure 1.1, taken
from the independent anti-virus testing company, AV-Test [2], best describes how the
number of malware les has changed from year to year.
1

Chapter 1. Malware Detection Problems 2
Figure 1.1: Malware Evolution
Trying to detect these samples has created an arm race between antivirus products and
malware. As antiviruses and operating systems are getting more and more complex
and include more technology in order to protect computers, the malware adapts and
brings new methods for evading detection. These methods will be described in the
following sections.
1.2 Mechanisms For Evading Detection
At the beginning, almost all the anti-viruses were signature based. The way they
worked was by computing a hash function on certain parts of the scanned le and then
compared it to a list of hashes that was built in the laboratory after analyzing malware
samples. The method worked well since, most malware les were static (they never
changed from one infection to another) and the number of les were quite few. There
was no need of heuristics since malware analyst could gather all the les and manual
analyze them.

Chapter 1. Malware Detection Problems 3
When malware creators realized that the quantity of money they can earn is dependent
on the time their malware remains undetected, several method of protecting the virus
were developed.
1.2.1 Defensive Mechanisms.Packers and protectors
One of the simplest methods to use by a malware creator to avoid detection of its
program is to use a packer. The packer is a program that given an executable will
generate another one that has a smaller size than the original but does the same thing.
It works by compressing the original code and inserting a stub that will decompress it in
memory when the program is executed. There are several advantages from the malware
creator perspective when using a packer. First, the le on disk is smaller thus it can be
transported more easily. Secondly, the resulted le will look completely dierent than
the original, thus a signature based anti-virus will not be able to be detected. Packers
are not malicious by themselves. They are also used by many applications, such as
installers, in order to compress their data. This is why detecting the packer can lead
to many false alarms.
Another method that is frequently used is to protect the code from people that want
to reverse engineer it (such as security researchers that work in an antivirus company).
The protector is a piece of code that gets executed before the protected code and,
depending on its complexity, provides dierent methods to revoke access to code and
data unless certain conditions are satised. The main idea is to make analysis of the
code very dicult, but allow the malicious code to be run on a system where no analysis
is being performed. The protectors are also used by programmers in order to prevent
reverse engineering of their code. They usually work by encrypting the code and data
and decrypting it on execution only if certain conditions are met. These conditions can
be one of the following types:
Debugger is not present: if a debugger is attached to the current process, the
process will force itself to crash, exit, or behave abnormally.

Chapter 1. Malware Detection Problems 4
Debugging tools are present: If tools that are usually used to analyze the behavior
of a process are running on the system such as network monitoring tools or
programs that show processes components, the process will force itself to crash,
exit, or behave abnormally.
Virtual machine detection: Most packers don't run on a virtual machine. This is
mainly because virtual machines are used by malware analysts to study samples
behavior. There are dierent methods to detect if a process is running in a virtual
machine, starting from simple ones like detecting dierent types of hardware that
are specic to a certain vendor to more complex ones, like detecting if certain
instructions are emulated and others are executed.
Code tampering detection: The method works by calculating some checksums on
some parts of the code and data to see if it was modied during the execution of
the sample. The tampering usually happens when the sample is being analyzed
in a debugger.
Besides these methods of validation, protectors usually include dierent methods that
make analysis very dicult:
Anti-dumping: The dumping method is usually applied in order to get access to
a code that is protected by a packer. Since the code that needs to run has to be
unpacked in memory, a simple method to get access to that code is to dump the
memory after its execution. The anti-dumping method works by overwriting in
memory pieces of code that were executed and are no longer needed.
Virtualizing: The method works by transforming pieces of the original code in
another language, understood only by the packer. The packer then executes the
code using a virtual machine. In order to analyze a sample that is virtualized
one must understand how the virtual machine works and what each new created
byte code represent.

Chapter 1. Malware Detection Problems 5
Obfuscators: Their main purpose is to make analysis and reverse engineering
very dicult. They work by rewriting the original code with instructions that
do the same but using a code that is very dicult to read. Examples of methods
used in obfuscations are: random variable names, garbage code, calls to functions
that do not change the behavior of a program, encryption of strings and function
names
Polymorphism is usually related to le infectors. The main purpose is to make
detection based on hashes inecient. The method usually relies on an encryp-
tion algorithm that takes a key as an argument. Every time a new infection is
generated, it will use a dierent key, thus will result a le that looks dierent,
but does the same thing.
Server side polymorphism works in the same way as polymorphism, but instead of
being performed on the infected computer the le is generated on a server. This
brings several advantages to the attacker, the most important being that it has
total control of the algorithm that generates the le. He can adapt it and change
it at any time. This method is very popular among malware creators because
they can create a new sample, scan it with every known antivirus product and
only if it is not detected the le will be sent into the wild.
Metamorphism represents the method through which a malware can change itself
without aecting its behavior. This is more complex than polymorphism since
on polymorphism the only dierence from samples is due to dierent encryption
key. On metamorphism the sample is usually disassembled and an instruction or a
group of instructions is replaced with a new instruction or a group of instructions
that does the same thing but in a dierent way. For example, a multiplication
might be changed with several operations of addition.

Chapter 1. Malware Detection Problems 6
1.2.2 Oensive mechanisms
Remaining undetected can also be achieved by tempering with the anti-virus product
or making changes to the operating systems. Even though in some cases, the operating
system usually warns the user of changes made to the conguration of the os, many users
don't read the message or just accept it anyway. There are cases when a malware tries
an action so often, that the notication message is displayed in an aggressive manner.
To stop the system from showing it, many users simply deactivate the operating system
protections without looking for the cause of the notications.The following methods of
oensive protection were encountered when analyzing malware families:
Avoid running when antivirus is installed: this is the simplest and the most
used method of avoiding detection. The malware simply lists the processes and
searches their names into a list of le names used by antivirus products. If it
nds it then it will simply uninstall itself. This way, it will take longer until the
malware analysts get access to the malware le and other users that don't run
an antivirus product will stay infected longer
Terminating the antivirus product: this is harder for the malware le to do since
most of the anti-malware products have a driver that protects their les from
inside kernel. However, if the virus le managed to get into the kernel (usually
by using an exploit or by installing a digitally signed driver) then this is possible.
Disable updates: usually, the Windows updates are the ones that are disabled.
This way, if the user has the default antivirus product, no updates will be de-
livered. There were also some malware samples that deactivated the update
mechanisms by simulating mouse click on the anti-malware product interface.
Disable or fooling rewall: Since most malware les try to connect to a command
and control center in order to receive updates or commands, generating network
trac is something that can not be avoided. In order to remain undetected, there
are several ways to trick the rewall. An obvious one is to simply deactivate it.

Chapter 1. Malware Detection Problems 7
Another one is to perform the actions by injecting into a well known application
that generates network trac (like a browser). However, when an antivirus is
installed and network trac is scanned, many malware les simply encrypt their
trac and make it look like a genuine one. For example, a banker family transfers
the updates as an image le.
Change DNS settings: by changing the domain name server or by editing the hosts
le (a le where domain names and the ip they are resolving to are entered) a
malware le can block the antivirus product from contacting the update server.
Change permissions: an interesting trick used by one the most well known worms,
Concker, to avoid being scanned by the anti-virus product was to change its
permissions. After being installed, the malware would set the permissions of its
les to allow only execution (no reading or writing). This way, when a scanning
was being performed, the antivirus would skip it since it had no access.
Install a rootkit: this is the most ecient way to avoid detection. By installing
a driver, a malware can hook functions used for listing les, registry keys or
processes . Each time one of these functions should return information about a
component of the malware, it would just skip it, thus appearing hidden to the
user or even the anti-virus product. Current version of the Windows operating
system requires any installed driver to be digitally signed. This has signicantly
reduced the number of rootkits. To install a driver, most of the malware that
succeed are doing it through the use of a stolen certicate.
Not writing the le using the le system: to achieve this, one method used by
malware is to have a component on the disk that reads a buer from the registry
in order to execute it and have the important malware component in the registry.
The le on the disk will appear as benign to an analysis since all the malware
components are in the registry.

Chapter 1. Malware Detection Problems 8
The list of methods described above is by no means complete. In the last years there
were reports of malware that remained hidden for a great period of time. New methods
are being developed as anti-viruses get more complex. For example, this year a malware
was discovered that overwritten the rmware of an hard-drive in order to make it look
smaller. On the remaining disk storage (not used by the hard-drive) the malware would
store its components. To get access to the kernel, the virus exploited a vulnerability of
an already installed driver.
1.3 Non executable malware les
Most users know that they should not open an executable le that is received through
e-mail because it may contain a virus. However, not many realize that malware les
come in many forms. The reason behind this is that many often used programs receive
as input les that can contain code. Even though the use of code is mainly for the
increase of functionality it also allows attackers to control the behavior of the program.
All of these programs come with protection mechanisms in order to avoid running of
malicious code but sometimes, through the use of a vulnerability, these mechanisms
fail.
The most used le formats that carry malware are listed below:
.SCR: the .scr les are associated with screen savers. They are usually run when
there is no user interaction after a dened period of time. A thing that most
users don't know is that .scr les are actually executable les. They have the
same format (MZ-PE format) as any executable le on the system. Thus, they
can come disguised as malware.
.DLL: the dynamic link library le contains a set of functions that are exported
in order for other programs to use them. Almost every medium or large program
uses these les in order to separate the functionality of the program in dierent
parts. Their format is the same as any executable (MZ-PE) which means that

Chapter 1. Malware Detection Problems 9
they are sometimes associated with malware. A
ag in the header of the DLL
le forbids running them directly, however they can be run through other means.
For example, an executable from the Windows operating system (rundll32.exe)
loads any dll given as argument. They can also be loaded by Internet Explorer or
Windows Explorer if they have a certain format (Browser Helper Object format)
.BAT: the batch le is a script le that contains a series of commands to be
executed by the command line interpreter in windows. Even though malware
could be written in this language, there are rare cases of this happening. However,
many malware families rely on these les to perform secondary actions. One of
the most used case is where a malware process will attempt to delete its les after
it nishes executing. Since it can not delete its les while its running it needs
an extra process that will do this when it nishes. For this reason a batch le is
created that attempts to delete the malicious le in a loop.
.VBScript: Visual Basic script les are also used by malware creators. They are
usually preferred by hackers that just learn how to write a malware, thus visual
basic script viruses have usually a simple structure and a straight forward beha-
vior. The reason why hackers choose the visual basic script language is because it
is easy to write simple code and most important, the Windows operating system
comes with a visual basic script interpretor already installed (wscript.exe). Most
malware that come written in this language are simple worms that copy them-
selves to startup folders, downloaders that are used to get more complex malware
from the internet or simple trojans (for example les that delete all documents
from a computer). Even though this is rare, there were even le infectors written
in visual basic script language. These would infect other visual basic scripts les.
JavaScript code in HTML les: This is the main method of infection that was
used in the last years. There are usually two main purposes for the JavaScript
code when it comes to its use in malware behavior.

Chapter 1. Malware Detection Problems 10
{In the rst case it is used to exploit a vulnerability in the browser. Even
if the vulnerability is generated by how the browser handles a html tag, or
some function of a plugin found in the browser, the control is usually taken
using JavaScript through the use of a technique called heap spray. This
works by allocating large regions of memory in which malicious binary code
will be copied. The purpose is to increase the probability for the execution
to reach a memory region that is controlled by the malware in case of an
invalid memory access that is usually a result of an exploit.
{The second case is for validation of the victim (in which case it veries if
the victim is running a certain operating system or browser in order to see
if it is vulnerable), redirection (either to a malicious web-page or to a well
known benign webpage in case the user is not vulnerable) or for drive-by
downloads. Drive-by downloads have become more and more frequent in
the last years. They work by automatically downloading a le to a user
computer when the victim visits a web-page. Most often, the downloaded
le is an adware or a spy-ware program but downloaders or trojans were
also installed using this method.
.PDF: Malware embedded in pdf les have started to be used in 2007. Malware
found in these les usually target vulnerability in the Adobe Acrobat Reader
software. Since this program is found on the majority of computers that run
the Windows operating system, a remote code execution vulnerability in it could
mean a way of infecting millions of users. The majority of malware that come
in pdf les make use of the feature that allows the Acrobat Adobe Reader to
run JavaScript code. Another feature that pdf les have is the ability to embed
binary les inside it by the use of streams. Upon successful exploitation of the
Adobe Reader program, malicious les found in binary streams can be dropped
to disk and executed. PDF le format was chosen by malware creators to deliver
malicious content since it is associated with documents which users consider them
as being a static content. They are also prefered from other document formats

Chapter 1. Malware Detection Problems 11
(i.e the oce format) since the format is easy to understand. Another advantage
of pdf les over other document formats is that they can be opened using a
browser, thus they can be delivered by a web server not only through e-mails or
other targeted methods (i.e. memory sticks). The predominant malware types
associated with pdf les are downloaders. This is because the PDF format is
usually as an attack vector.
.Oce Files: Since these are the most known document formats, malware creators
found an opportunity in them to deliver malware. There are usually two directions
through which malware gets delivered as oce documents: through macros and
through vulnerabilities.
{When Microsoft introduced the ability to run macros inside oce documents
malware associated with documents started to appear. Like the JavaScript
language inside the PDF les, the Visual Basic for Applications (VBA)
was introduced to increase the usability of oce documents (i.e. automate
repetitive tasks). However, through the use of it, malware creators made it
possible to drop les on disk and execute it or send e-mails. Starting with
Oce 2007, when the Oce le format changed and didn't allow macros
to be embedded (unless a special extension is used) the macro malware
decreased rapidly. There are still documents that contain macro malware
but they are few and usually rely on social engineering to convince the victim
into allowing the macro to run.
{Since macro les are disabled in the new oce document format another
method was needed to deliver malware through the use of documents. The
new format also simplied the oce format, thus new vulnerabilities could
be found more easily. Nowadays, most of the malware that gets delivered
through oce documents rely on vulnerabilities. Like the pdf les, the oce
documents are usually used as attack vectors, so the majority of malware
that gets disguised as documents are downloaders.

Chapter 1. Malware Detection Problems 12
SWF Files: The swf extensions is associated with les for the Adobe Flash Player.
This is a format that is used to bring interactive and animated content to the
user through the use of a plugin in the browser. It has its own scripting language,
called Action Script, but it also allows the use of JavaScript. As JavaScript is used
in browsers or pdf les to take control of the program rendering them so is Action
Script for the SWF le. The Flash malware usually exploit a vulnerability inside
the
ash player. Since the same swf le can be run by browsers from dierent
operating systems, a malicious
ash le can target users that don't run windows
software.
Java Files: Software written in Java Language for PC is either a stand alone
program or an applet that requires a plugin in the browser to run. There are few
malware les written in Java language that run as stand alone programs. Most
of them are applets and require a Java plugin installed in the browser in order to
render them. As with SWF les, malware written in Java is portable so it can
aect users that run dierent operating systems. In 2012 and 2013, Java exploits
were the main method of delivering malware through the browser.[3]
1.4 Attack Vectors
An attack vector is a collection of steps that an attacker takes in order to infect the
victims. Even though there are many ways a person can get infected, there are some
patterns by which an attack can be classied. Every attack vector needs an entry point
into the victim's computer. There are four main entry points:
1. External devices: This can be anything, like a CD or DVD but the most used
one is the USB
ash drive. Until Windows 7, when the autorun feature was
deactivated, this was one of the most used methods of spreading malware. In
order to spread, malware running on a computer would list every removable
drive and copy itself to that location. In addition to that it would create an

Chapter 1. Malware Detection Problems 13
autorun.inf le where it would write instructions to auto execute the malware le
copied to the drive when the usb
ash drive is inserted. Every user that opened
the usb
ash drive using windows explorer would get infected. Using this method,
malware could spread to locations that were isolated from Internet or even from
a local network.
2. Vulnerabilities of known applications: either a vulnerability that is located in
the browser or one that relates to a program that opens documents or even a
vulnerability of a certain application that a victim which is important to the
attacker uses. The purpose is the same: to infect the victim's computer. To
increase the probability of infecting a user, hackers have created a package that
contain many exploits, named exploit pack. When a user visits a web page,
the exploit kit tries to take control of the victim machine by feeding all kind of
exploits: from Java exploits to PDF and exploits targeted to the browser that
the victim is using.
3. Direct network attacks: in this case, the computer is attacked either through a
vulnerability of a service that the operating system is running or by trying to
copy to shared folders by brute-forcing the victims passwords. This behavior is
usually related to worms. A vulnerability in an operating system service accessible
through the network can make the malware spread to allover the world. The
Concker worm [4] used both a vulnerability and a brute-force technique to spread
and succeded infecting more than 7 million computers
4. E-mail and social channels: By sending e-mails with malware attachments or
sending les through instant messaging programs the malware tries to involve
more computers in the infected network. This method is usually accompanied by
a social engineering part. The malware writes some lines in the name of the user
and sends itself as an image le in order to look genuine. Even though not all
the users that receive the les get infected (since they need to run it) it is still a
method frequently used

Chapter 1. Malware Detection Problems 14
The previous entry points are used together with social engineering methods or other
technology in order to increase probability of infection. The following attack vectors
are the most used:
1. Spear phishing: this technique relies on sending e-mails to only a small group
of people that have something in common. For example, victims of the Red
October malware [5] received an e-mail containing a document with a sale of
diplomatic cars. The document contained an exploit for a vulnerability of oce
documents that attackers used to drop a backdoor. By sending e-mail to only a
small group of people, it decreases the chance that the malicious le to be sent
to the antivirus company for analysis. Since it relies on information that is of
interest for the victim is does not rise any suspicions and it can also get pass the
antispam lter.
2. Drive-by downloads: using this method a le is automatically downloaded once
a user visits a web site. If the victim is vulnerable to a certain exploit then the
le is automatically executed. Many times, this method relies on using malicious
ads that are displayed on benign web sites. Thus, without suspecting anything,
while visiting a benign web site, the user is being attacked with a collection of
exploits.
3. Black hat seo: in order to get to as many users as possible the attacker uses black
hat search optimization techniques in order to make a malicious webpage appear
through the rst results when searching certain words using well known search
engines like google. Black hat seo includes, but is not limited to, using many
keywords without any meaning to the content of the webpage, duplicate content,
or infecting other webpages with a link that points to the malicious webpage in
order to increase its ranking. If the user opens such a web page, it will result in
a drive-by download.
4. Watering hole: this method is usually used when trying to infect a certain group
of people. The main idea is to infect a benign website that the targeted users

Chapter 1. Malware Detection Problems 15
visit in order to deliver malicious content. It is similar to the drive-be download
attacks the dierence being how the malicious content gets to the user (through
infection of the webpage and not through malicious ads).
1.5 Vulnerability Types
Nowadays, most users know that they must not execute a le that came attached on
an e-mail. More than that, e-mail programs allow restrictions to be set such that the
user is not allowed to run an executable le that came as an attachment. This simple
method blocks the attacks where an executable le uses a document icon (i.e. pdf or
doc).
In order to get pass these restrictions, attackers try to nd critical vulnerabilities inside
well known applications. By leveraging these, an attacker can infect a victim by simply
making him visit a web page (by use of watering hole or black seo techniques). These
vulnerabilities appear due to mistakes of the programmer such that writing over the
boundaries of a buer or using a memory space that has already been deallocated. The
attackers try to exploits these vulnerabilities in such a way that will redirect the exe-
cution pointer to their buer, thus achieving code execution. All these vulnerabilities
deal with a memory space of a xed width (called buer). Depending on what that
buer is used, the vulnerabilities can be grouped in several categories.
1. Stack Over
ows: This is one of the simplest types of vulnerabilities and it ap-
pears when the programmer writes more data into a buer that was declared on
the stack. By doing this, the programmer actually overwrites additional data
structures used by the CPU in order to achieve a good execution.
In assembly language, when a function is being called, the CPU pushes on the
stack dierent values like the arguments of the function, the value of a register
calledEBP and the address of where it has to return. The EBP register is used
to store the value of the stack register ESP (which indicates the head of the

Chapter 1. Malware Detection Problems 16
stack) when the execution pointer is inside the function. When the execution
reaches the end of the function the ESP register will receive the value of EBP .
Basically, the EBP register serves as a backup for ESP and makes tracking
of arguments and local variables easier. Another important thing that happens
when the function has nished executing is that the address where the function
needs to return will be poped from the stack and the execution will jump to that
address.
Standard stack over
ow: the problem with this approach is that the local
variables, including buers, are stored on the same stack as the structures
explained above. When a program lls a buer with the data provided by
the attacker and it doesn't correctly check to see if it passes the boundaries
of the buer, it is possible to overwrite a return address used by a function.
Of course, in order to achieve a malicious code execution the return address
must point to an address controlled by the attacker. This is usually an
address from the stack where the buer sent by the attacks is being stored.
O by one byte: this is similar to the standard stack over
ow but it is more
dicult to achieve code execution. The problem usually appears when a
programmer tries to use a safe function, that checks for buer boundaries,
but species the wrong buer length. It is a common mistake for novice
programmers to incorrectly count the buer length since this starts at index
0 which is not intuitive. So the length provided is usually one byte shorter
than what it should be.
With a single byte to control, the attacker can not directly control the return
address of a function. However it can still modify a byte of a register called
EBP . Since on Intel architecture, values are stored in little-endian, a one
byte over
ow will actually modify the least signicant byte of the EBP
register. When the value of EBP will be assign to ESP , at the end of the
function, the attacker will actually control where the stack begins. Through

Chapter 1. Malware Detection Problems 17
this way it can also control where the function will return, since this value
is also stored on the stack
Using Format String: Like the rest of the stack vulnerabilities, this one also
tries to control the return address of a function. This time this is done by
exploiting a feature of the printf function and its variants. The function
printf allows for a buer to be written to the console. It allows for dierent
arguments that species how the output to be formated. There are dierent
variants of it; some of them are for unicode characters and there is also
a variant that is used for formating a string instead of outputing to the
console.
The formating options are specied using the % (percent) character. There
are many dierent formating options that can control how the printf function
works like: outputting a hexadecimal representation of a number, outputting
a string, a char and there is even an argument that can be used to store the
number of chars outputed in a memory address. However, the function can
also be used without any formating arguments. An user can simply pass the
buer and that buer will be outputed. This is actually the source of the
problem.
Novice programmers may pass a string that is copied from the input to the
printf function without using any formating arguments. An attacker may
send a string that contains formating arguments. Doing so will make the
printf function to search additional arguments on the stack. A very powerful
formating argument is "%n". This tells the printf function to write to an
address specied, the number of bytes processesed. Thus, by providing
several formating arguments one can skip certain addresses on the heap and
reach the return address of the function.
2. Heap Over
ows: buers on the heap are usually used when the amount of data
to be stored is too big to be stored on the stack or the maximum length of it
is not known at compiling time. These are usually created using functions such

Chapter 1. Malware Detection Problems 18
asmalloc andnew and are freed using functions free anddelete . In contrast
to the stack, there are no structures directly used by the CPU that are stored
on the heap. For this reason, these kind of vulnerabilities are more dicult.
There are dierent exploiting techniques meant to achieve code execution using
heap over
ow vulnerabilities. Most of them try to overwrite sections of memory
that will later be referenced in an execution context. This is possible since on
heap, buers are usually stored on appropriate locations. So writing beyond the
boundaries of a buer stored on heap it is very likely to write on another buer.
Overwriting Function Pointer: A good way to achieve good execution is
to overwrite function pointers. On some occasions these are stored on the
heap. A good example is a C++ class that uses virtual functions. Since
these kind of functions are initialized only when the object is initialized
they are stored as function pointers. More than that, every object that uses
virtual functions also has a pointer to a table, called virtual function table,
that stores all these pointers. Overwriting the table or the pointer to that
table is a good way of achieving execution of the malicious code.
Use after free: There are situations when an object is allocated on the heap,
freed and later used again. Normally, this would lead to a crash, since
unallocated memory is being used. However, if an attacker manages to real-
locate the freed memory and copies there his code, the crash will not happen
anymore and the new code will be executed instead. These kind of vulner-
abilities are often exploited in browsers. This is possible since browsers,
through Javascript language, allow heap-spraying. Basically, heap-spraying
means that an attacker can allocate large chunks of memory on heap and
ll the bytes with the right values. Using large chunks of memory increases
the probability to overwrite the previous freed locations.
3. Integer Over
ows: These vulnerabilities usually happen when the program per-
forms some arithmetic operations, prior to a buer operation but don't take into
account the boundaries of the arithmetic results or how the compiler treats the

Chapter 1. Malware Detection Problems 19
result. Over
owing an integer is only useful when the results is used as an argu-
ment of a buer processing function, such as strncpy.
Boundaries check: A classic scenario of this kind of vulnerability is the
following: an arithmetic operation is being performed on an unsigned inte-
ger. The result of this operations determines the length of a buer to be
allocated. After which, some user controlled buer is being copied to the al-
located buer. If the arithmetic result isn't validated, an attacker can make
the result over
ow, thus instead of obtaining a very large number, a small
number will be obtained. If this happens, the size of the allocated buer
will be too small leading to a classic stack or heap over
ow.
Signedness Errors: This happens when a signed variable is used both as
signed and unsigned in dierent scenarios. For example, a programmer may
use a signed integer to measure the length of a buer. The signed integer will
be veried against a maximum length and if its lower, a buer operations
will be performed. Passing a negative value as the length of the buer,
will make it pass the maximum length validation. However, when a strncpy
function is used, or equivalent, it will treat the supplied length as unsigned,
thus overwriting other memory zones.
1.6 Malware detection requirements
Given the fact that the malware families evolve very fast and use sophisticated tech-
niques to avoid detection it is very important for the malware classication methods
to be able to handle very large datasets in a reasonable amount of time. The amount
of malware les being delivered to the wild is so big that only an automated or semi
automated mechanism can handle it at this time. For this reason, a machine learning
algorithm can be used to lter what malware researcher analyze or it can even be used
as a primary detection method. However, this detection mechanism must also undergo
some restrictions in order for it to be used in practice:

Chapter 1. Malware Detection Problems 20
1. It has to be able to adapt fast to changes in malware landscape. By fast adap-
ting I mean that it needs to observe the similarities between new malicious les
and old malicious les of the same family, but it also needs to be able to make
clear distinctions between new malicious les that appear on a daily basis and a
vast set of benign les. Training speed is also important when it comes to fast
adaptability. If the algorithm takes too much to train, when it will be nished
the practical results will be far less than expected since the malware landscape
has already changed. So, to achieve best results, the algorithm must be able
to be trained in a couple of days. Even though the today computer power can
compensate for an increase in time complexity of the algorithm we must take into
account that training can be done on millions of records.
2. Testing speed is also very important. If we want the method to be included in
an antivirus product, it must provide as little overhead as possible. An antivirus
product usually works by intercepting access to all les in the system. So, when
a process is being executed, all les that are being accessed by the process will
be rst scanned. If no malicious les is detected, the process is allowed to run.
If a process takes too much to execute because of the antivirus product, many
users will prefer to disable the protection mechanism. For this reason, testing if
a le is malicious or not must be determined almost instantaneous.
3. Memory restrictions: Even though for the training phase we can use computers
with many gigabytes of RAM we have to take into account that the average
amount of RAM that many user computers have is below 4 gigabytes. This RAM
is shared by the operating system, the antivirus product and all other processes
run by the user. This is why the model produced by the classication algorithm
must t in a very small amount of memory.
4. No false positives: Perhaps, this is one of the most important, but also the most
dicult restriction. Detecting a benign le such as an important document or
a critical system le as malware, can make the antivirus product delete it. This

Chapter 1. Malware Detection Problems 21
can lead to information loss, system crash or corruption of important les. This
is way, a critical aspect of the machine learning algorithm is not to provide any
false positives. Dealing with large datasets means that a small percentage in false
positives rate translates into many benign les being detected. Even though the
machine learning algorithm will not achieve its best performance when used with
such a restriction, in practice it will not be used alone. Other mechanism, such
as signatures or hand written heuristics will compensate for the loss of detection.
By obeying this restriction I will try to present in this paper several methods that can
be used to help on classifying malicious and benign les.
1.7 Security and privacy concerns
If the rst types of viruses were trying to delete les, corrupt data or encrypt documents
in order to request a ransom, nowadays the landscape is very dierent. There is no point
in making a system unusable since this will not provide any benet to the attacker.
Instead, the victim can be considered as an asset for the attacker where dierent types
of techniques can be used to make him produce money. For example, an infected
system can be used as part of a botnet. The botnet can be used as a denial of service
(DOS) weapon on a certain website. So, the botnet can be rented to persons who have
interests in shutting down websites. Another example is to use the computing power
of the infected machine for virtual currency mining (bitcoin mining).
Besides of these techniques that are clearly malicious there are also other means of
making money through the use of some gray methods. For example, a le can reside
on the victims computer that monitors all the web searches performed by using a web
browser as well as the products that a user has interest in. Based on these requests, a
behavior model can be created for the victim and targeted ads can be displayed that
can convince the user to buy several things. If this is malicious or not is debatable. A
user might consider this useful since it helped him nd products that he is interested

Chapter 1. Malware Detection Problems 22
in. On the other hand, sending data that indicated behavior information might be
considered intrusive and could also be classied as a privacy threat.
The example provided above is one where just the browsing behavior is monitored.
However, there are cases when more private data is submitted to a server. For example,
the e-mail addresses of the users friends or the locations where the user is using his
device for chatting (mobile phone, tablet or laptop). There are applications that request
access the social page of the user in order to post ads in his behalf.
Usually, these kind of applications promise better web search results, access to great
oers and discounts or even prizes in lotteries. Some users will consider detecting these
applications as denying them the ability to earn dierent prizes while other will consider
missing these as a huge gap in the detection mechanism of the antivirus product.
Needless to say, money can be made by using private data of the user and these mech-
anisms are being more and more used as computers are becoming safer.
1.7.1 Privacy concerns regarding mobile devices
During the last decade the demand for smart phones has rapidly increased and got to
a point where it is very dicult nowadays to nd a mobile phone that is not able to
handle external applications. There are almost 1.91 billion smart phones worldwide
and this is fore-casted to increase to 2.56 billion by 2018 [6]. Of course, the power
of a smartphone actually stands in the application it uses. Even though there are
dierent operating systems that run on mobile devices, two of them stand out: iOS
from Apple and Android from Google, totalizing a 93.8 % in market share (14.8% for
iOS and 81.5% for Android). A statistic taken from [6] regarding market share of each
operating can be viewed in Figure 1.2
These two operating systems have also launched the idea of an application store: a
place where programmers can submit their applications and where interested users
can nd them and buy them. Even though there are many application stores for the

Chapter 1. Malware Detection Problems 23
Figure 1.2: Growth of market share of Mobile Phones OS
Android OS, the main and the largest one is Google Play. The equivalent application
store for the iOS is called AppStore. Starting with their launch in 2008 these stores
have constantly increased overpassing 1 million of applications in 2014. A graphical
display, provided by appgures [7], regarding their growth can be view in Figure 1.3
Since these mobile operating systems were created when malware attacks were frequent,
special attention was payed to security. Each application runs in a separate container,
thus it can not modify les belonging to other applications. Also, each time the appli-
cation wants to access sensitive information (contacts, sms, photos, etc.) or want to
perform possible dangerous actions (calls, send messages or even push notications) it
requests a permission that a user must allow. The model diers slightly for iOS where
the permission is requested every time is performed. In Android the permissions are
located in a manifest le and are granted the rst time the application runs.
Another important improvement regarding security is brought by a validation of each

Chapter 1. Malware Detection Problems 24
Figure 1.3: Growth of sapplication store
application that is intended to get into the market. If Apple declares that each applica-
tion is validated manually (which involves a lot of human power), Google revealed that
the Android applications are validated using a semi-automated system called Bouncer.
Given these new security measures, the attackers needed to nd new ways through
which they can monetize their victims. For these reasons, they turned to issues that
are more related to privacy. This is not something new when it comes to malware found
in the wild, since adware programs have already been doing this for more than 10 years.
As in adware, it is very dicult to tell if a permission requested by an application is
intended for the normal use of an application or is just a part of a malicious behavior
meant to steal sensitive information. For example, it is quite normal for an application
that backups the contact list into the cloud to access the contacts. However, if a game
accesses the contacts lists, it should be marked at least as suspicious. And this scenario
is available for almost every permission.
Another problem that aects users of mobile applications is the vulnerabilities that
are embedded inside benign apps. Not all developers think of security and when we
are talking of more than 350.000 developers that write code for the app stores it is
not dicult to nd an application that has security
aws. Take for example the same

Chapter 1. Malware Detection Problems 25
application discussed above that requires access to the contact list in order to make
a backup in the cloud. Even if it is benign, it passes all security checks of both
applications markets and given its scope, it is normal to access the contact list, if it
sends the data to the cloud without using encryption or without properly implementing
security protocols (like https), an attacker may use a man in the middle attack to
extract the submitted data. The situations becomes worse if the problem does not
reside in the code written by the developer of the app but in a library that many
applications use.
So, in order to protect users from privacy risks, some methods are needed that can
test an application from dierent perspectives and determine the risks associated with
using it.

Chapter 2
State Of The Art. Related Work
2.1 Malware Detection using Machine Learning
A great amount of work regarding malware detection using machine learning was carried
out by Dragos Gavrilut. Even though many of the papers published rely on variations
of the percepton algorithm, he achieved good results in both accuracy and proactivity.
Since the aim of his research is to provide a reliable method that can be used by
antivirus companies to detect malware, several restrictions are maintained in all of the
papers:
A very low number of false positives: A zero amount of false positives is preferable,
but since this can drastically reduce the detection rate, the 0.01% threshold is
acceptable. False positives are very dangerous for antivirus products. Classifying
an important document or system le as malware can lead to important data
loss or system failure. A higher number of false negatives is preferable than an
increase of false positives.
The algorithms needs to be very fast on the testing side: This restriction is
required in order not to aect the users experience. On most of the antivirus
products, every time a program is run, it is rst blocked, then scanned and if
it is classied as benign it will be allowed to run. The time in which the le is
blocked must be reduced to a minimum.
26

Chapter 2. State Of The Art. Related Work 27
2.1.1 Malware detection using machine learning
In the paper [8] the authors describe the One Side Class Percepton algorithm and apply
dierent variations in order to see how it behaves in classifying dierent datasets. In
order to see the practical applications of the algorithm, 3 datasets are created. Every
record from the dataset is dened by 308 features. If there are multiple records that
contain the same combination of features, only one of them is kept:
1. training dataset: consists of 7822 unique malware records (obtained from 27475
les) and 415 clean records (obtained from 273133 les). This is used to train
the perceptron
2. test dataset: consists of 506 unique malware records(obtained from 11605 les)
and 130 clean records (obtained from 6522 les). This dataset is used in order to
evaluate the perceptron
3. Scale-up dataset: consists of 12817 records(obtained from approx 3 million sam-
ples) and 16437 records (obtained from approx 180.000 samples). This dataset is
further splited in pieces, each one consisting of 10%, 20% and so on in order to
evaluate how the algorithm behaves when the dataset is being increased
The one-side class perceptron(OSCP) is introduced as a variation of the original per-
ceptron algorithm [9]. It works in two steps: rst, just like in the original algorithm,
a hyperplane is obtained that best classies the two classes of les. Following this,
another training is performed, this time adjusting the hyperplane such that no element
from one class is misclassied. Using the one-side class perceptron, several variations
are tested in order to see which one brings the best results:
1. Mapped one-side class perceptron: this is the one-side class perceptron, but every
record in the datasets is modied to contain a combination of every two of its
features. The combination is done using the and operator
2. Mapped one-side class perceptron with feature selection: Only the best 30% of
all features resulted after mapping were used in order to reduce the training time

Chapter 2. State Of The Art. Related Work 28
3. The dual form of the OSCP with the training entry mapped into a larger space
via several kernel functions(Polynomial Kernel function with the degree 2, 3 or 4
and Radial-Base Kernel Function )
After cross-validation, the algorithm that provided the best detection rate (sensitivity)
was a cascade of one-side perceptron algorithm with the Polynomial kernel function
with the degree 4. However, due to a higher number of false positives, the best algorithm
to be used in an antivirus engine is the one-side perceptron algorithm and its mapped
version.
Other contributions of this paper are related to optimization that can be added to
the OSCP algorithm in order to make it run faster. This is especially needed when
working with very large datasets, such as the scale-up dataset described above. The rst
optimization relies on the observation that in order to correctly classify all the samples
belonging to a certain class, the hyperplane must be adjusted only in a direction that
does not aect the correct classication of the elements belonging to the target class.
Since these elements do not modify the hyperplane (the hyperplane gets modied only
when items are incorrectly classied) it is not necessary to evaluate them on every
iterations. A second optimization relies on a programming tweak. Since
oat operations
are costly in terms of CPU clocks, it is more advantageous to use operations that work
with numbers represented as integers (the decimals are considered to be the last digits).
By using these tricks, the algorithm can be implemented to be faster by 2.5 times.
2.1.2 Practical Optimizations for Perceptron Algorithms in Large Malware
Datasets
Other optimizations for the One Side Class Algorithms are presented in [10]. One of the
rst modications of the algorithm relates to the way it is implemented. Since features
are boolean values, it is better to change the multiply instruction with a comparison.
This is due to the fact that the multiply instructions take more clock cycles than a
comparison one. Using this trick the algorithm was optimized by 96%.

Chapter 2. State Of The Art. Related Work 29
Another idea was to use a cache for every record. After the training part has started
to adjust the hyperplane, each record that will be evaluated will be selected according
to an algorithm. The aim is to test as often as possible misclassied samples and skip
the correctly classied ones (since the perceptron only adjusts the hyperplane when it
nds misclassied samples). The algorithm is described bellow:
If a record is incorrectly classied it will be selected for verication in the following
iterations. The number of iterations can be adjusted dynamically
If a record is correctly classied the sample will not be veried for the next itera-
tions. The number of iterations in which the record is skipped can be increased
from one step to the other. For example, if the rst time the sample was tested
only after 5 iterations, the next time it should be tested should increase, i.e to
10. In this way, records that are correctly classied are tested less and less.
2.1.3 Building a practical and reliable classier for malware detection
If on previous papers the main target was to reduce the number of false positives, in
paper [11] the authors try to increase the detection rate. This is done with several
methods:
1. Mapping features into a higher space: this involves selecting a limited set of
features according to a score (F-score is chosen by the author) and combining
one with each other using a binary operator. The authors have found that the
andoperator is well suited for this operation. The reason why only a part of the
features are chosen is due to the large amount of new features that are created
using this operation. If all features were selected, the algorithm will need n(n+
1)=2sizeof (feature ) memory space which, for a large dataset, is too much(where
nis the number of original features)
Another variation of this method is to rst map all features, and then use a score
function to sort the features.

Chapter 2. State Of The Art. Related Work 30
2. Using an Ensemble system: A hyperplane obtained by an one side class algorithm
will rst separate the samples in two sets: the rst set contains benign and
malicious samples while the second contains only malicious samples. Since the
records from the second set are well classied, they will be removed from the set.
The next step is to use another one side class perceptron algorithm to classify
the remaining records. Depending on the dataset size and available time, this
method can be used in several steps, each time obtaining a better accuracy
3. Vote System: Each model will vote for the elements it classies. It will give a
vote for both positive and negative records. Since OSCP algorithm is very good
at classifying a class of elements it will give a higher vote for records belonging to
that class a smaller one for the others. The nal result will be a weighted average
of the votes given by every model
4. Hybrid method: This method comes from how malware analysis is done in prac-
tice. It is easier to create a model that correctly classies a malware family than
to classify all the malware samples. For this reason, the malware set will be rst
spitted using a binary decision tree according to a certain feature. The feature
can be chosen with dierent algorithms: either F score or median close. For each
obtained cluster, an OSCP algorithm will be used
Using the methods described above, the authors have managed to increase the detection
rate from 53.4% to 69.55% if the mapped algorithm was used, to 54.71% if an ensamble
of ve iterations was used, 61.63% if the voting system was used on clusters of les
created using the Manhattan distance and 99.53% if the hybrid method was used with
the MedianClose function for choosing the feature and a decision tree of 8 levels (256
models resulted).

Chapter 2. State Of The Art. Related Work 31
2.2 Feature Extraction using Genetic Programming
2.2.1 A Hybrid Approach to Feature Selection and Generation Using an
Evolutionary Algorithm
The authors of [12] have managed to improve the accuracy of detecting certain sub-
stances by combining feature extraction through genetic algorithms and feature selec-
tion.
For feature extraction using a genetic algorithm, the authors dene a chromosome as
a structure of two elements: the rst one is a characteristic vector of features that
determine which ones are activated and which are not and the second element consists
of mathematical combinations of features activated in the previous element.
The crossover operator is adapted for the dened chromosome. Since the chromosome
is made of two parts, the cutting point has to be selected such that the arithmetic
operations that are going to be passed to the other chromosome can be computed with
the activated features. The mutation operator is very simple: it randomly activates or
deactivates features. Feature generation is performed by randomly selecting a mathe-
matical function, ltering from the activated features just the ones compatible with the
operator, applying the operator over the features and appending the resulted feature
to the chromosome. The selection function used for evaluation is a support vector
machine.
The algorithm is tested on real data used in Chromatography. The dataset consists of
200 records of 5000 features each. Feature selection algorithms like Forward selection,
Backward elimination and genetic algorithms are applied in order to obtain features
well suited for the learning problem.
The genetic algorithm is congured to have a population size of 30 elements and is run
for 50 generation.

Chapter 2. State Of The Art. Related Work 32
The results clearly show that the features obtained through the use of genetic algorithm
improve the accuracy of the results in comparison to just using support vector machine
with the standard features. The average absolute error for the target values has dropped
from 2.2 to 0.07 for a set of data and from 22 to 0.1 for another set.
2.2.2 Automatic Feature Generation for Machine Learning Based Opti-
mizing Compilation
An interesting paper regarding feature extraction is written by authors of [13]. Their
aim is to automatically produce features that can help a machine learning algorithm
determine the loop unrolling factor used in the compilation process. Doing so will
increase the speed of the compiled program.
Their method is composed of two components:
1. feature search component: this is used to explore the features space represented
by the intermediate representation and provide the feature values to the machine
learning algorithm.
2. machine learning tool: computes how good the feature provided by the feature
search component is along with the already found features(base feature set). The
feature will be continually improved until this is no longer possible. At that point
it will be added to the base feature set.
The feature space is described using a special grammar derived from the intermediate
representation (parse trees). Using these grammar the feature search component is
able to produce new features by expanding non terminals.
The cross-over operation is implemented by randomly selecting non-terminals of the
same type and subtrees are exchanged, thus creating new parse trees.
Mutation is performed by replacing non-terminals with a random expansion of the
same terminal.

Chapter 2. State Of The Art. Related Work 33
The algorithm was tested using 57 benchmarks from public sources. The genetic algo-
rithm was congured to use a population of 100 individuals and run for 200 generations
or until no improvement is observed in the last 15 generation. The machine learning
tool used is a c4.5 decision tree. Results are validated using ten cross validation.
The authors state the their method achieves a 75% prediction of the correct unrolling
factor while the public gcc compiler only achieves 3%. A support vector machine using
hand written features achieves only 59%.
2.2.3 Evolving Novel Image Features Using Genetic Programming-based
Image Transforms
Feature creation using genetic programming was also used by authors in [14] in order to
detect muscular dystrophy from analyzing gray-scale images. The features are created
using pixels found in a certain square of the images through the use of Cartesian genetic
programming.
Each new feature is a graph that consists of mathematical operations between previous
features or inputs found in the image. The following mathematical functions are used:
max,min,xy,x+y,x=y,xy,abs(x),xx,xy. The cross-over is implemented
using single-point swap by treating the graph as a list of nodes. Mutation operates on
the connectivity or the node types.
For validating the resulted features, machine learning algorithms are used (decision
trees, ridor rules or k-nn) all run through weka [15] framework.
To test their method the authors used a public dataset consisting of 386 healthy and
220 sick cells images. From these, 186 of the healthy and 200 of the sick images were
used for training while the rest were used for validation.

Chapter 2. State Of The Art. Related Work 34
The authors compared the results with the ones obtained by using 16 hand made
features (thresholded area, variance, mean radius, radius variance, perimeter, com-
pactness, entropy, fractal dimension). The genetic algorithm was congured to run for
50 generation with a population size of 200.
Through out their tests, using the 1-NN algorithm and the decision tree algorithm, the
authors show that by using genetically created features an improvement of 38% can be
achieved.
2.2.4 Feature Extraction Using Genetic Algorithms
The authors of [16] and [17] use a weight vector to perform feature extraction or feature
selection. If the weight vector has binary values, then it is used for feature selection.
For feature extraction, the values are usually in the interval 0.0 and 10.0. The idea is
to transform the feature space such that more of the records become linear separable.
The tness function used is a custom one that is based on the k-nn results. It takes
into account the number of elements left after classication was done using the weight-
modied k-nn and the cardinality of the near-neighbors minority set.
For the elements that are not linear separable a second approach is proposed: a new
feature can be combined with other through mathematical functions. Since nding
these features can only be done through the use of suboptimal heuristics, the authors
use genetic algorithms.
Their approach is rst tested on a generated dataset according to a template. The
algorithm is rst trained using 240 examples belonging to 8 classes and tested on
1200 new examples generated according to the same template. The results show that
the weighted GA-knn achieved a 4.13% error rate while the standard normalized knn
achieved 35.93%. The authors state that the error rate can be reduced even further to
3% if feature selection is performed before classication by using weights of 0 and 1.

Chapter 2. State Of The Art. Related Work 35
The method is also tested on a real world data set with application in biology, more
exactly classication of soil samples. The dataset consists of 232 samples that must
be grouped in 3 classes. Each record has 96 binary features. Using the k-nn algorithm
for classication and using weighted features, the error drops from 8.3% to 2% in
comparison of using the original feature set.
By performing feature selection on the same dataset through the use of binary values
in the weight vector, the number of features reduced from 96 to 31. This alone helped
in dropping the error rate to 0.83%. By performing feature extraction using values in
the interval 0.0 and 10.0 for the weight vector the error rate further reduced to 0.24%
In the last part of the paper the authors test if the correct features can be found
through the use of genetic algorithms. For this reason, an articial dataset is created,
some features are generated and also two hidden features that can be obtained through
the original ones are hidden in order to see if the algorithm manages to nd them. To
generate new feature, the multiplication operator is used. The authors show that the
genetic algorithm succeeds in nding the new features and using them achieves a 0%
error rate.
2.2.5 Feature Generation Using Genetic Programming With Application
to Fault Classication
Feature creation using genetic algorithms was also studied by authors in [18]. In their
approach, which has some common parts to the one presented later in this paper, the
authors try to improve detection of faults bearings before the bearing gets broken.
To do so, the authors represent the chromosome as a mathematical expression, more
exactly as a binary tree where each node is a mathematic operation and each leaf is an
original feature. The mathematical functions used are:+ ;;;=,square, sqrt, sin, cos,
asin, acos, tan, tanh, reciprocal, log, abs, negator.

Chapter 2. State Of The Art. Related Work 36
The genetic operators are simple to implement. The crossover operation works by
choosing two chromosomes and exchanging subtrees while mutation works by randomly
choosing a node from a chromosome, deleting the subtree that correspond to the chosen
node and randomly generating another one. The tness function used is a modied
version of the Fisher criterion, which looks very similar to the F-score (Equation 3.1)
used in this paper.
The resulted features are tested using two kinds of classiers: an articial neural net-
work (a multi layer perceptron with one hidden layer) and a support vector machine.
The authors created two types of features:
The rst feature is generated by running the genetic algorithm or 1000 generation,
with the maximum tree depth of 5 and the population size of 10. The entire
process took 5 minutes.
For the second types of features, 10000 generation are used with the maximum
tree depth of 10 and a population size of 14. The process took about an hour.
By comparing the two types of features, the authors conclude that the second type
of features succeed on separating dierent classes of records where the rst type of
features fails.
By comparing the results obtained using the standard features with the ones obtained
through the use of genetically evolved features using the two classiers, the authors
state that they have obtained a 96.5% accuracy for the articial neural network in
comparison with 90.8% and 97.1% for the support vector machine in comparison to
91.2%.

Chapter 2. State Of The Art. Related Work 37
2.2.6 The Evolutionary Pre-Processor Automatic Feature Extraction for
Supervised Classication using Genetic Programming
A very interesting paper is provided in [19]. Here, the authors dene a framework, called
Eprep (Evolutionary Pre-Processor) where each chromosome is actually a solution to
the problem.
The system has many degrees of freedom, allowing it to automatically select the number
of features to extract, uses many non-linear functions in order to nd correlations
between features and determines which classier is best for the given problem.
Each chromosome is composed of two parts:
The rst one is a parse tree for functional expressions. There are many functions
that are available for use (addition, multiplication, ratio, polynomial expansion,
natural logarithm, equals to, less than or equal to, logical or, logical not, discreet
Fourier transform, cosine, arc-cosine, vector lter) but the authors recommend to
limit to just a few of them in order to reduce the size of the searching space. The
terminals of the parsing trees are either input variables, user-supplied constants
or random generated constants.
The second part is the classier. There are three available classiers that the
chromosome can choose from: Generalized Linear Machine (GLIM), K-nearest
neighbors (with k=3) and Maximum Likelihood Classier.
The initial population is created by randomly selecting functions and terminals with a
uniform distribution. The tness function is represented by the percent of misclassied
samples from a validation set.
The authors used 9 datasets to test the algorithms, gathered from dierent external
sources with application in medicine, nance, image analysis and engineering design.
The results obtained by a classier with the best features was compared with the results

Chapter 2. State Of The Art. Related Work 38
of the same classier on the raw data. For two of the problems (image processing),
the algorithm managed to reduce the error rate by half. For the others, the dierence
is not signicant. Even so, for almost all of the data sets, Eprep has only selected a
subset of the original features, thus reducing the dimensionality.
2.3 Improving Malware Detection through Feature Extraction
2.3.1 Data Mining Methods for Detection of New Malicious Executables
The authors in [20] were one of the rst who used machine learning in order to achieve
a better detection rate and to be more proactive. Even though their methods were
tested on a relative small dataset compared to the number of malware present today in
the wild (3265 malicious and 1001 benign les) it still provides important information
on how to construct a classier for malware detection. An important thing to note
is that their dataset was constructed based on the malware landscape present at the
time of writing the paper (5% trojans and 95 % viruses). Nowadays, the percents are
opposite.
The features used in the learning algorithms are constructed automatically, by statical
processing each le. The information extracted is dierent and depends on the learning
method used. The following methods are used to construct features:
1. By using a third party library, called libBFD from the GNU's Bin-Utils, it extract
the dll-names, the functions imported from each dll and how many times each
function is referenced in a binary. However, this is only suitable for PE-le, so
additional methods are needed
2. By using the GNU strings program, the authors extracts each string that appears
in a le. This method can also be used for non-PE le. However, strings are not
that representative for executable les.
3. The nal kind of information extracted is represented by sequences of bytes.
There is no explanation how these sequences are chosen.

Chapter 2. State Of The Art. Related Work 39
Dierent learning algorithms were used for dierent kind of features. For features based
on dlls and functions, the Ripper algorithm was used ( an inductive rule-based learner
that generates boolean rules). For string and byte sequences, a Naive Bayes algorithm
was used. Since the features extracted didn't t all in memory for all the data, the
dataset was spitted in several chunks. On each of these chunks, a Naive Bayes was
trained. The nal results were a combination of the results provided by each of the
Naive Bayes classiers.
The best results are obtained by running the Naive Bayes classier on the features
based on strings extracted from binaries (97% detection rate and 3.8% false positives).
Multiple Naive Bayes, that were trained using string and chunks of binary data, have
also obtained very good results (97% detection rate and 6% false positives). The other
methods didn't achieve the hoped results. For example, the Ripper algorithm only
achieved 71.05% detection rate. However, all of the machine learning methods outper-
formed the signature method when it comes to proactivity. The signature method only
achieved 33.75% detection rate.
2.3.2 The proactivity of perceptron derived algorithms in malware detec-
tion
An interesting aspect is approached by the authors in [21]. The aim of the paper is to
test the proactivity of dierent linear classiers over a period of 14 weeks. In order to
do so, the authors have collected malicious les for a period of one year during 2010
and 2011. Half of the data was used for training (data gathered in the rst 44 weeks)
and the other half was used for testing (data gathered in the last 12 weeks).
For each le approximately 2500 features were obtained. The features are both static
(obtained by parsing the PE header) and dynamic (obtained by the use of an emulator).
The next step was to reduce the number of features. For this reason, the F score was
computed for every feature and only the rst 550 were kept. The nal dataset consisted
of 975882 les from which 110144 are malicious.

Chapter 2. State Of The Art. Related Work 40
The authors tested many classication algorithms like
1. the standard batch perceptron
2. the margin perceptron: a batch perceptron that limits the misclassied records
to a maximum number ( three dierent versions were used for the limit: 500,
1000 and 1500)
3. One side perceptron: similar to the margin perceptron but this time the limit for
the false positives is set to 0
4. Mapped version of the one side perceptron
5. Mapped versions of the margin perceptron, with the margin set to 1500, 2000,
and 3500
6. Linear voting One Side: a voting ensemble made up of 16 one side perceptrons
7. Linear voting M-1600: a voting ensemble made up of 16 margin perceptrons with
the margin set to 1500
8. Cascade ensemble: a collection of the following algorithms: One-Side percep-
tron with mapped features, a clustering algorithm based on centroids and nally
another one-side perceptron that classies centroids
In order to test their proactivity, a new formula is introduced that decreases the score
if more false positives are present.
ProactivityScore =Tp
Tp+Fn+1
kFpwhere k should be less than 0:5% (2.1)
Among all algorithms, the best detection rate was obtained by the cascade ensemble
followed by the mapped version of the one side algorithm. However, the cascade en-
semble has also obtained the biggest false positive rate. The algorithm that obtained
the least false positives was the one side perceptron.
Regarding proactivity, all of the algorithms proved to be stable for the rst 4 weeks.
After this period, the proactivity dropped signicantly.

Chapter 2. State Of The Art. Related Work 41
2.3.3 Learning to Detect and Classify Malicious Executables in the Wild
Another approach for detecting malware using machine learning is described in [22]
Using a database that consists of 1971 clean executables and 1651 malicious les, the
authors have tested several machine learning algorithms in order to see how well they
perform at malware detection. In order to extract features, the binary le was converted
to hexadecimal form and n-grams of 4 bytes were extracted, thus resulting a total of
255.904.403 n-grams. Each of the n-grams was treated as a boolean feature (present or
not). Since the number of features was too big to work with, the next step was to select
only the most relevant. This was done by computing the Information Gain (Equation
2.2) for each n-gram and keeping only the most 500 relevant n-grams. The number of
n-grams to keep and the size of the n-gram was chosen empirical based on the results
obtained by training dierent classiers.
IG(j) =X
vj2f0;1gX
CiP(vj;Ci)logP(vj;Ci)
P(vj)P(Ci)(2.2)
whereCiis the class of the le, vjis the value of the jthfeature,P(vj;Ci) is the
proportion that the jthfeature has the value vjin the class Ci,P(vj) is the proportion
of elements where the jthn-gram takes the value vj, andP(Ci) is the proportion of
elements that have the class Ci.
The training was done using algorithms found in Weka framework [15]. A total of
four dierent algorithms were used: k-nearest neighbors, Naive Bayes, Support vector
machines and decision trees. Each of these were tested in the standard form as well in
boosted form (using AdaBoost algorithm). Testing was performed using 10-fold cross
validation.
The best results are obtained using boosted binary trees (99% detection rate and 0.05%
false positives). Among the non boosted classiers, the best results are obtained by the

Chapter 2. State Of The Art. Related Work 42
support vector machines which achieved a 99% detection rate and about 0.07% false
positive rate.
A nal contribution of this paper is the test of proactivity of the selected classiers.
The authors gathered another set of 291 samples and tested the models previously
obtained. This time, the boosted decision tree achieved a 0.94% detection rate with
0.01% false positive rate. The support vector machine was the second best model and
the rst among the non-boosted algorithms and achieved 0.82% detection rate with
0.01% false positive rate.
Even though the results are very good, it is worth noticing an observation stated by
the authors: none of the benign les were obfuscated or packed while 25% of the
malicious les had this property. This means that it is possible that the machine
learning algorithms just learned to predict this property.
2.3.4 Comparison of feature selection and classication algorithms in iden-
tifying malicious executables
An important contribution to malware detection using machine learning is provided by
the authors in paper [23].
The authors focus on analyzing how well dierent machine learning algorithms perform
at malware classication and what are the best feature selection algorithms that can
be used in order to obtain the best results.
The database used in the paper consists of 1074 benign les and 3680 malicious les
that were collected from an intrusion detection system. The les are transformed in
hexadecimal format and n-grams are extracted that will be used to form the features.
The authors tested dierent n-gram sizes from 1 to 10 bytes. For each le, a feature
vector is created, where each feature represents the frequency of the n-gram in the le.
There are 7 feature selection methods tested:

Chapter 2. State Of The Art. Related Work 43
1. Maximal dierence: the new set of features consists of the rst 50% of features
that are found in clean les and the rst 50% of features that are found in
malicious les.
2. Maximal normalized dierence: similar to the maximl dierence but this time
the frequencies are normalized based on their standard deviation.
3. Maximal ratio: instead of using the dierence, a ratio is used or the dierence of
the logarithms.
4. Maximal weighted ratio: similar to maximal ratio, but the result is multiplied
by the feature frequency. This way, the frequencies that appear more often will
have a greater importance.
5. Maximal Kullback-Leibler distance: it is based on the Kullback-Leibler diver-
gence that measures the dierence between two probability distributions. Since
this is not a symmetric measurement (so it can not be used as a distance) the
formula is applied twice (once to measure the information lost from one distri-
bution to the other and the second time in the opposite way) and the results are
summed.
6. Bi-normal separation: it is based on the bi-normal separation algorithm [24] in
order to determine the features that contribute the most to the true positive rate
and the least to the false positive rate. In order to obtain this information, the
Naive Bayes algorithm is used and the features analyzed the n-grams of 1 byte
in length (so a total of 256 features)
7. Forward stepwise selection and backward stepwise selection: features are added
one at a time in order to see if it improves the model. The backward method starts
with all features and then features are eliminated in order to see if it improves
the model. Both of these methods are very time consuming.
The classiers tested are:
1. Naive Bayes classier

Chapter 2. State Of The Art. Related Work 44
2. Product classier: multiplies the frequencies of the features that appear more in
clean les than malicious ones and divide the result by the product of frequencies
of the features that appear more in malicious ones
3. Entropy classier:the same idea but from the entropy of the features that are
most found in benign les is substracted the entropy of the features that are
mostly found in malicious les
4. Support vector machine classier
Each classication algorithm was tested using the features extracted through the seven
methods described above. For each combination, dierent versions were tested by
varying the number of features. From all the combinations, except support vector
machines, the product classier obtained the best performance when it is used with
features extracted by the maximal weighted ratio or forward selection. The best feature
selection method (the one that on average gives the best results for all classiers) is
the forward selection method, but it is also the most time consuming and classier
dependent.
The authors also tested features that were created based on n-grams of 1, 2, 5 and 10
bytes in length. Only the rst 1000 features with the highest frequency were kept. The
accuracy improvement was marginal when the feature size increased. Related to the
number of features used, the best results were obtained by using at lest 20 features,
but no more than 40.
The support vector machines were tested with 10, 20, 40 and 100 variables and 2
dierence selection methods: maximal distance and bi-normal separation. The best
detection rate using features selected according to the bi-normal separation method,
was 94% and a false positive rate of 1% when the number of features was xed at 40.
For the maximal distance, the detection rate improved when the number of features
was increased (from 40 to 100) from 85% to 95%. According to the authors the support
vector machines provided the best results among all the algorithms.

Chapter 2. State Of The Art. Related Work 45
2.3.5 Data mining methods for malware detection
The author of this phd paper [25] analyzed dierent methods for malware detection by
using op-codes found in binaries as features. The dataset is made up of 820 les, half
malicious and half benign. There are no encrypted or obfuscated les in the dataset
since the author considered these are not relevant for the classiers.
Each le is represented as a feature vector where each feature is a frequency of an
opcode. There were a total of 1510421 features extracted, from which 62608 were
unique. Using a simple method for feature selection, called unary variable removal, the
author removed all the features that did not have a frequency that was at least 10% of
the number of les. In addition to this, using a chi-square test of independence that
tries to nd a relation between the feature and the target class, a signicant amount
of features were removed that did not show to be related to the target class. The total
number of features that were left is 633.
Using 70% of the data and the features extracted, three types of algorithms were
trained:
1. logistic regression
2. a multi layer perceptron with 10 hidden units
3. a decision tree with max depth of 30 where the entropy was used as a split criteria
The best detection rate was obtained by the decision tree, 98.4% with a rate of 4.9%
in false positives. Similar results were obtained by the multi layer perceptron (97.6%),
but with slighter less false positives. (4.1%). One should notice that since the testing
dataset is very small (about 250 les), the dierence in the percentage from the two
algorithms is equivalent to just one le.

Chapter 2. State Of The Art. Related Work 46
2.3.6 Detecting unknown malicious code by applying classication tech-
niques on opcodes representation
Instead of using n-grams that are based only on bytes, without any meaning, the
authors in [26] used n-grams that are based on opcodes present in the binary le. The
paper is a research that tries to answer questions regarding the best machine learning
algorithms for malware detection, best n-grams length and the number of features that
should be used. The dataset started with 7688 malicious les collected from virus
heaven website and 22735 benign les. Since some of them failed to be disassembled,
the nal database ended up at 5677 malicious les and 20416 benign les.
From each of these les, n-grams of dierent sizes were extracted (1,2,3,4,5,6), each of
them made up of several opcodes. Depending on the n-gram size, the number of unique
features resulted were 515, 39011,443730, 1769641, 5033722 and 11948491. To nd out
which is the right number of features that an algorithm should use, dierent sets were
made consisting of the top 50,100, 200 and 300 features.
To order the features, a feature selection algorithm needed to be used. For this, the
authors tested three selection methods (document frequency, gain ratio, Fisher score).
Document frequency consists in computing a ratio between the frequency of the n-gram
in a le and the frequency of the n-gram that appears the most in that le.
Using the resulted data sets, 8 dierent machine learning algorithms were tested using
the weka framework: support vector machines, logistic regression, random forest, arti-
cial neural networks, decision trees, naive bayes, boosted decision trees, and boosted
naive bayes.
The authors performed many tests in order to nd the best combination of n-gram
size, number of features, feature selection algorithm and machine learning algorithm.
In total, 1152 combinations were tested using 5-fold cross validation.
According to the results, the best size of n-grams is 2, even if this ended up in providing
a lower detection rate then using n-grams of 3. The accuracy was better due to its

Chapter 2. State Of The Art. Related Work 47
low rate of false positives. The authors also concluded that document frequency is the
best feature selection algorithm in test of accuracy. Fisher score came very close to
document frequency, contributing to a higher true positive rate, but using it yield more
false positives.
Among the classier, the worst were naive bayes and its boosted version. The other
classiers performed similar, on all lengths of the n-grams. On comparing the rst 6
classiers using n-grams of length 2 and a dataset of 300 features selected by document
frequency, the random forest and boosted decision tree yielded the best accuracy and
lowest false positive rate.
On the second part of the paper, the authors tested if a dataset that consists of n-grams
of dierent size improves the accuracy. For this reason, two additional datasets were
created:the rst one was created by selecting 300 features for each of the six n-gram
sizes and the second was created by selecting the best 1800 features from all the 6
datasets put together. Neither of these proved to be better than the dataset made up
of only n-grams of size 2.
Another interesting problem that the authors tested is the imbalance problem. Ac-
cording to their hypothesis, training a classier on a data set that has the percentage
of malicious programs dierent from what it is found in the wild is detrimental. For
this reason, additional datasets were created where the percentage of malware varied
from 5% to 50% in steps of 5. All of the algorithms, except the neural network, had a
better false positive rate when tested on a dataset containing a percentage of malware
that is higher than 15% and lower to 30%. This percentage is very similar to what is
found in the wild.
Finally, in order to highlight the importance of training the classiers regularly, the
dataset was split in respect to the years it was collected (from 2000 to 2007). Each
classier was trained on a specic dataset and tested on the one obtained from the
year after. All of the algorithms yield a better accuracy when they were retrained.

Chapter 2. State Of The Art. Related Work 48
An interesting observation is that at the nal year (2007) the accuracy has dropped
signicantly, meaning there were many changes in the malware landscape.
2.3.7 A feature extraction method and recognition algorithm for detection
unknown worm and variations based on static features
A dierent approach is used by the authors from [27]. The proposed solution for
malware detection (although the authors focus mainly on worms) consists in two steps:
The rst one is to process the features in order to give them a score and the second
one is to use all the scores previously computed to give a verdict (malicious or benign).
The feature scoring algorithm rst needs to normalize each frequency of the feature by
dividing it to the maximum number of occupancies of that feature in all the training
set. After that, a mean square deviation is computed for every feature
D(Ai) =q
(E(Ai)E(Av
i))2+ (E(Ai)E(AN
i))2 (2.3)
whereE(Ai) is the total mean frequency of the feature Ai(from all training set), E(Av
i)
is the average frequency of the feature Aiin the benign les and E(AN
i) is the average
frequency of the feature Aiin the malicious les.
All of the features will be sorted based on this value and only a limited number of them
will be used for the classication algorithm.
The malware classication part works by computing for every feature how probable
it is to be malicious or benign. It does this by comparing its frequency to its mean
frequency across the malicious and benign datasets. Depending on which of the classes
the feature is closer, its mean square deviation will be added to one of the two control
variables (one for benign and one for malicious). Based on these control variables and
a counter that is incremented each time a suspected feature is detected, the algorithm
will take a decision.

Chapter 2. State Of The Art. Related Work 49
The decision also takes in consideration a threshold, called warning level, that can be
adjusted in order to make the algorithm more aggressive or less prone to false positives.
To test the algorithm, a very small dataset was constructed of 120 benign les and 100
malicious les from which 22 static and dynamic features were extracted. With the
threshold set to high, the true positive rate obtained is: 95.71% while the false positive
rate is 2.86%.
2.3.8 An interpretable string based malware detection system using svm
ensemble with bagging
A dierent kind of approach was used by the authors of [28]. Instead of using n-grams
of bytes or opcodes, as other authors did, this time the strings were used as features. To
collect these, the system needs a parser that will unpack the executables, parse the PE
header, extract function names and every string that is made of printable characters.
The features obtained are used to train a support vector machine with bagging. The
bagging methods mean that many bootstrap samples are extracted from the training
set and on each of these a svm is trained. The nal result is given by the vote of the
majority of the classiers.
The authors also propose to use the system for classifying the samples in more than
just 2 classes. This can be achieved by training k(k1)=2 classiers where each one
of them can detect to classes.
For the testing phase, the authors have gathered 39838 executables of which 8320 are
benign and 31518 are malicious. From this set, 9838 were selected and used for training.
The class of the les are: 2320 benign, 1936 backdoors, 1769 spyware, 1911 trojans and
1902 worms. On the rst step, 13448 interpretable strings were extracted, but since
most of them had a very low frequency, only the rst 3000 were kept according to the
max-relevance algorithm.

Chapter 2. State Of The Art. Related Work 50
Based on the data, a voting system made of 100 support vector machines were trained
using bootstrapping samples of 1230 executables. The resulted classier was tested
against dierent measurements.
First, the detection rate on 1000 executables not used in training was compared with 4
well known antivirus products. The proposed system detected all of the 1000 malicious
les, while none of the antiviruses achieved this.
To test how well the system would behave on malicious les that are not related to the
malware families found in the training set, the authors gathered 500 new samples and
again compared the results with 4 well known antivirus products. The detection rate
achieved was 93.8% while the best antivirus product obtained only 70%.
For the false alarm test, the authors have selected 2000 samples (1000 benign les
that are mostly Chinese software and 1000 malicious) and tested the product against
4 known anti-virus products. The false positive rate was 1.6%. It worths noticing
that the anti-viruses tested are not focusing on the Chinese market so their algorithms
might not have been trained on Chinese software.
A nal test provided by the authors reveal that their proposed method behaves much
better than a single support vector machines, naive Bayes with bagging and decision
trees with bagging.
2.3.9 Ecient virus detection using dynamic instruction sequences
The main contribution of the paper [29] to malware detection is the way features
are chosen. Instead of using n-grams of bytes, string or opcodes, the authors use
relations between assembly instructions. For this reason the process of selecting features
starts with monitoring dynamic execution of the le. This can be done in ecient
manner by using virtual machines or hooking mechanism. The authors report that the
method they propose can extract almost 6000 instructions per second. The assembly
instructions extracted will be only those executed and most of the time, just a fraction

Chapter 2. State Of The Art. Related Work 51
from all of the instructions in the le. This is what the authors call logic assembly.
Doing so, also allows the system to capture self modifying code.
Based on the logic assembly, an abstract assembly is extracted that consists of a col-
lection of associations between instructions. There are three types of associations
extracted, each one being more restrictive than the previous. The type1 relation does
not contain instruction order, the type2 contains instruction order but instructions do
not need to be consecutive (garbage instructions can be found between) and type3 is
very similar to opcode n-grams in which instruction order is considered and instructions
must be consecutive.
The models trained use abstract assembly (relations) as features. In order to select
only features that are important to classiers (those that are frequent and are repre-
sentative to a class) the Apriori algorithm was used. The minimum support parameter
of the algorithm (the minimum percentage of basic blocks that contains the instruction
sequence) was set to dierent values in the training phase. To detect if an instruction
block is representative the frequency of that block is measured among the benign les
and the malicious les. If the ratio of frequencies found in those classes is big, then the
feature will be chosen to be used in clasication.
The classication algorithms used are support vector machines and decision tree (where
entropy is used as a split factor). The dataset consists of 267 malware les and 368
benign les. All tests were carried using 5-fold cross validation.
The parameters that were tested are:
number of features to be used. The algorithms were tested using 10,20,50,100,
200 and 300 features.
the support level for the Apriori algorithm. The tested values were 0.003, 0.005
and 0.01.
number of instructions to capture. Tested values were: 1000,2000,4000, 6000 and
8000.

Chapter 2. State Of The Art. Related Work 52
what type of relations to be used as features. All 3 types of relations were used.
From the two models tested, the SVM obtained on average a slighter better accuracy
than the binary decision tree (0.911 accuracy). The increase in number of features
translated into an increase in accuracy. However, after 200 features, the increase was
very insignicant. The best number of instructions to be captured seemed to be 2000.
Any value greater than that translated into a slighter decrease of accuracy. This may
be due to that fact that any increase of number of instructions captured translates to
more noise. For the same reason, the support level value of 0.01 provided the best
results. Regarding to the feature types, the type2 features (where instructions need
to be ordered, but not consecutive) proved to be the best suited, across almost all
congurations. The reason for this might be that it is restrictive enough to avoid being
confused as noise but it still can bypass obfuscation techniques. This is in contrast to
the n-gram that other researchers are using.
2.3.10 A static malware detection system using data mining methods
In order to detect malicious les, the authors in [30] extracted features that are mainly
found in the PE header. For this reason, they designed an application, named pe-miner
that parses the pe-header and extract dierent values. The le is also parsed in order
to extract imported dlls, function names and how frequent these are called.
Every feature that was gathered is sorted using the information gain and only a fraction
of them are kept. To further transform and reduce the feature set, the authors apply
a Principal Component Analysis algorithm.
The authors test three classiers (a support vector machine, a naive Bayes and binary
decision tree) using the Weka framework. The dataset used for training and testing is
made up of 247348 windows executable les from which 236756 malicious and 10592
benign. For the executables that come in compressed or protected form an unpacked
version is obtained. The tests are performed using 10-fold cross validation. There were

Chapter 2. State Of The Art. Related Work 53
several tests carried out with each kind of features (pe-header information, dlls and api
calls) and of course, tests that involved using all features. The best detection rate was
obtained by the binary decision tree (99.5% with 2.7% false positive rate) when using
features extracted from the pe header. Using all features proved to decrease the number
of false positives for the support vector machine but was detrimental for the decision
trees as detection rate dropped by 2 percents. The same behavior was observed for the
naive Bayes algorithms.
In order to see if the principal component analysis brings an advantage, the authors
tested the algorithms with the features before and after PCA was applied. Each time,
the true positives rate increased, but on average it was just 0.7%
2.3.11 Large-scale malware classication using random projections and
neural networks
A very interesting paper that deals with large datasets is described in [31]. The aim of
the authors is to be able to classify les as malicious or benign. A second purpose is
to classify many of the malicious les to dierent families. A set of 2.6 million les of
which 1.843.359 malicious and 817.485 benign was gathered. Among these, there are
also samples that belong to 134 important malware families.
The feature extraction method uses a lightweight virtual machine that collects dierent
information like:
null terminated patterns (that usually translates to strings)
n-grams of size 3 of consecutive api calls. The authors call this trigrams
calls to api functions and arguments
Extracting these data from the dataset yield 50 million features, a number too big to be
processed by current hardware. The next step is to apply a feature selection algorithm.

Chapter 2. State Of The Art. Related Work 54
The feature selection works in two steps. The rst step was to compute the mutual
information [32] of each sample in respect to all 136 classes. Based on the results,
3000 features were selected that were representative for every malware family class and
another 120000 features for each of the two generic classes (malicious and benign). By
eliminating the common ones, it resulted approximately 179 thousand features. Again
this number was too big to work with.
To further reduce this number, the authors used a technique called Random Projection
[33] that transforms a vector space into a smaller one while keeping the sparsity. After
this step, the number of features resulted was 4000.
For classication the authors proposed an ensemble of machine learning algorithms.
First, in order to nd non-linear correlation between features, a feed forward neural
network is used. Secondly, to classify the samples a linear regression algorithm that
uses a softmax function is applied.
The test was carried on using an additional set of 1.1 million les. 10% of the original
2.6 million records were used for validating the neural network. The system was trained
for 40 epochs with dierent variations on the number of layers (1, 2 or 3). The learning
rate was set to 0.03 and to decrease training time, a momentum was used with the
value of 0.9. The numbers of hidden units in the hidden layers were respectively 1536,
2048 and 1024.
Dierent versions of the ensemble are trained, one in which no neural network or
random projection is used, one in which random projection is used and nally, one in
which all of the algorithms are used but with dierent number of layers for the neural
network.
Surprisingly, the best results were obtained by the combination of the random projec-
tion and a neural network of just 1 layer: a false positive rate of 2.41% and a test error
of 12.37%; the two-class error rate (if malware les that were labeled by the family
they belong to were labeled just as malicious) was 0.5%.

Chapter 2. State Of The Art. Related Work 55
The authors claim that they have succeeded to reduce the two-class error rate to 0.42%
by using a voted ensemble of 9 neural networks.
Dierent congurations of the random projection algorithm were also tested. The re-
sults show that if 8000 output features were used instead of 4000, the logistic regression
alone would have behaved better. However, the system as whole behaves slightly worse
yielding a higher two-class error rate.
2.4 Detection of non-executable les
Most of the papers that deal with malware classication problems take into considera-
tion only executable les. Malicious JavaScript code detection, which can be found in
both webpages and pdf les has started to be addressed only recently.
Since these kind of malware usually come in obfuscated forms, most of the papers
address the problem by using an emulator or something similar that can run the code
and deobfuscate it. However, this tends to limit the system to only small datasets, as
running the le can take more time.
An important contribution to this eld are the papers [34] and [35] which explain how
the wepawet service [36], a well known system that is used to identify if a le is malicious
or not, works. The authors deal with very large datasets and provide the architecture
of a system that successfully identies malware using both static and dynamic features.
2.4.1 Caeine Monkey: Automated Collection, Detection and Analysis of
Malicious JavaScript
A good classication of JavaScript obfuscation techniques can be found in [37]. The
autors present them in the order of their complexity, starting with the most simple
ones to detect. The techniques discussed, in the order of appearance, are:

Chapter 2. State Of The Art. Related Work 56
1. white space randomization
2. addition of random comments
3. string obfuscations using simple encryption techniques, like a xor operator or
Caesar cipher or even more complex ones like RC4, string splitting, writing strings
in hexadecimal
4. variable name randomization and function pointer reassignment(a new name is
assigned for a standard function, ie unescape)
5. integer obfuscation: writing numbers as a result of mathematical operations in
order to hide certain numbers with a special meaning
6. block randomization: some statements are changed with equivalent ones (ie. a
for loop is written as a while loop)
By analyzing both malicious and benign scripts the authors have observed that the ratio
between object instantiations, use of eval, string instantiations and documents writes
is very dierent for malicious les vs benign documents. The benign scripts use more
document.write() functions while malicious are relying more on string instantiations,
mainly as a result of string obfuscations. The authors also provide a tool, called Caeine
Monkey that hook certain functions in order to help analyst deobfuscate malicious
scripts.
2.4.2 The Rise of PDF Malware
An analysis of the use of PDF les as attack vectors is discussed by the authors of [38]
They make the observation that even though the number of vulnerabilities found in the
popular pdf reader, Adobe Reader, is close to the number reported for the Microsoft
Oce products, it seems that pdf documents are more frequently used in attacks. If
at the beginning of 2007, the attacks that used documents were 100% based on oce
documents, at the beginning of 2010 these accounted for only 5% while the rest are
mostly attributed to pdf documents. The main reason for this change is that pdf format
is simpler and using it can target more products (browsers).

Chapter 2. State Of The Art. Related Work 57
The authors point that the main distribution of these attacks can be split in 3 cate-
gories:
1. mass-mailing: the pdf documents come as an attachment to an e-mail. Usually
the attacker also uses social engineering to increase the chances of the document
to be opened
2. drive-by downloads: the pdf is silently downloaded when the user browses a web
page. This is usually accomplished by the use of exploit kits
3. targeted attacks: more than one vulnerabilities or zero day vulnerabilities are
used in order to infected important targets
Even though most of the vulnerabilities used rely on JavaScript (so deactivating the
use of JavaScript will make the browser more secure), there are also some exploits that
incorporate in the document other elements, like a
ash le. This way, by the use of
action script in the
ash le, the attackers have almost the same
exibility as using
JavaScript.
Another information brought by the paper is the analysis of dierent obfuscation tech-
niques used in pdf malware. Besides the use of obfuscation methods employed by the
JavaScript there are additional techniques that make use of the architecture of the pdf
malware:
the pdf format is not that restrictive, so the malware authors can insert special
buers at dierent location in the le without making it invalid
the payload can be split in dierent objects that are scattered inside the document
and when needed are referenced by name
the pdf format supports compression and encoding and the authors make use of
these features by creating streams that use many nested encoders
malware authors encrypt the payload using AES or RC4 algorithms in order to
restrict access of anti-viruses to the payload

Chapter 2. State Of The Art. Related Work 58
2.4.3 Detection and Analysis of Drive-by Download Attacks and Malicious
JavaScript Code
In the paper [34] the authors propose a system, called JSsand that will extract dy-
namic features from JavaScript les in order to learn a normal behavior and detect any
anomalies that are specic to malware JavasScript code. To do so the authors observe
the steps that a drive-by attack usually takes and create features that can detect it.
The features are split in 4 categories:
1. Redirection and cloaking: many exploits redirect victims to dierent domains in
order to hide their traces. The following features are extracted in order to detect
this behavior: number of redirections, targets of redirections (domain names),
browser based dierences (the suspect domain is visited twice with dierent
browsers to see if it changes its behavior)
2. De-obfuscation: most of the attacks come in an obfuscated form. To detect this,
the following features are extracted: number of calls to functions that process
strings (ie. fromCharCode), number of calls to functions used for dynamic code
execution (ie. eval), length of the code that is dynamically executed
3. Environment preparation: most of the exploits rely on memory corruption. To
be successful, in most of the case large chunks of memory need to be allocated.
To detect this behavior the following features are extracted: number of bytes al-
located, number of strings that are longer than 256 bytes and contain unprintable
characters
4. Exploitation: many exploits target vulnerabilities in Activex objects or other
browser plugins. To monitor this behavior, the following features are extracted:
number of instantiated components (plugins or Activex), value of attributes and
arguments in method calls (large strings or large integers), sequence of method
calls by instantiated objects
By analyzing hundreds of exploits the authors concluded that these features are robust
and every working exploit uses at least two of them.

Chapter 2. State Of The Art. Related Work 59
To build a model that will be used for anomaly detection, the authors use several
algorithms from the libAnomaly library:
1. Token nder: tests if the values of the features are outside of an enumeration.
This is especially useful for detecting abnormal values for the arguments
2. String length and occurance counting: this helps identify the usual length of
string or events that happen much often than normally (for example: memory
allocation)
3. Character distributions: identies the normal distribution of characters in strings.
Since most string in JavaScript are human readable, any encoded string will be
detected
4. Type learner: this model was developed by the authors. It tries to identify the
normal type of an argument. For example, this will detect a script that will use
a very large integer instead of a boolean value
All of these models are summed and the score is compared to a threshold.
To test the system, the authors gathered 140000 les that were grouped in dierent
sets:
1. a dataset consisting of 11215 benign les gathered from top ranked web pages
2. four known malware datasets
urls found in spam e-mail that deliver drive-by download
websites that were infected through sql injection
malware found on forums or specialized exploit websites
malware submitted by users
3. two uncategorized datasets
pages obtained by crawling
unrecognized data submitted by users

Chapter 2. State Of The Art. Related Work 60
To test the false positive rate, the authors have split the good dataset in three, trained
the model on one portion, used the second to adjust the threshold and the third was
used for testing. No false positives were detected. However, when the authors tested
the model on data obtained from crawling, a false positive rate of 0.0135 was obtained.
The tests on the known malware dataset reported a detection rate of 98.9%.
2.4.4 Prophiler: a Fast Filter for the Large-Scale Detection of Malicious
Web Pages
One of the disadvantages of the system previously discussed is that it needs to run the
le in order to tell if its malicious or not. However, when dealing with large datasets,
this tends to take allot of time. For this reason, the authors have proposed another
classier that will work as a prelter for the previous one. The system, called Proler
and presented in [35] works by statically analyzing samples and classifying them using
dierent machine learning algorithms.
In order to do so, authors have extracted 77 dierent static features that target the
sample le from dierent angles
HTML features: mainly statistical information that can be extracted from the
html code. This totals 19 features and among them there are the number of
scripts tag, the number of hidden elements of objects that occupy a very small
surface from the page, number of elements that contain suspicious content (more
than 128 in length with less than 5% white spaces) or number of characters.
JavaScript features: This totals 25 features that contain statistical information
found in code written in JavaScript language. Among the features there are
the number of occurrences for dierent keywords that are found in many scipt
based malware (eval, settimeout, setinterval), entropy for strings and scripts, or
whitespace percentage
URL and host features: there are 33 features extracted from the url or from the
information associated with it. For example, there are syntactical features, like

Chapter 2. State Of The Art. Related Work 61
the domain length, the tld or presence of suspicious pattern or port numbers and
features exctracted from the who-is information and dns information, like the ip,
the ttl for the A record, its geoip information.
To test how the system behaves, the authors gathered a database that consists of 787
malicious web pages that deliver drive by download attacks and 51171 benign web
pages that were obtained by crawling the rst two level of the 100 top alexa web sites.
The machine learning algorithms tested are Naive Bayes, random forest, decision trees
and logistic regression. The algorithms were rst separately tested with each set of
features previously created in order to see which one brings the best results. Testing
was done using 10 fold cross validation. For the html features, the best model was
the binary decision tree, which yield a true positive rate of 63.% and a false positive
rate of 0.8%, while on the JavaScript features, the binary decision tree obtained a true
positive rate of 78.6% and a 0.4% false positives. The random forest algorithm also had
god results, obtaining a bigger detection rate (81.9%) but also a slightly bigger false
positive rate (0.5%). When the algorithms were used with url features (which were
the most) the best algorithm, the binary decision tree, obtained a surprisingly 90.4%
detection rate and only 0.6% false positive rate.
By analyzing the results for each case, the authors concluded that the best features
are the shellcode presence probability, the presence of decoding routine , the maximum
string length and entropy of the string. For the JavaScript features, the best seemed
to be the number of url, number of iframes and the whitespace percentage. Finally, for
the URL, the tld of the domain, the absence of a subdomain in the url or the ttl of the
host's dns record provided the best separation.
Using all the features, the system was also tested on a validation dataset that contained
153115 pages of which 139321 benign and 13794 malicious. This obtained a false
positive rate of 10.4% and a false negative rate of 0.54%. The results show that these
results are primarily due to html features and then javascript features.

Chapter 2. State Of The Art. Related Work 62
Finally, a large-scale evaluation test is carried on. The system was tested over a 60 days
period on a dataset consisting of 18.939.908 web pages which were unlabeled. Proler
labeled 14.3% of them as malicious from which 13.7% were false positives. For the false
negative rate, 1% (162315) of the dataset was analyzed using dynamic analysis and the
results were compared to those of Proler. In this case, only 3 samples were missed
(0.0018%)
2.4.5 Spyproxy: Execution-based Detection of Malicious Web Content
An idea proposed in [39] is to use a proxy that lters all trac through a virtual
machine. The virtual machine will run any code and if it is malicious will stop it before
it reaches the user. In order to minimize the performance impact the user sees, the
page is rst static analyzed and only if it is marked as suspicious it will be run.
The static analysis just searches any content that might change dynamically such as
ActiveX. Non html objects are also treated as dangerous since they might contain
vulnerabilities. By analyzing 17 hours trace of web requests the authors concluded
that 54.8% of the html page requested do not contain dynamic content. The case in
which there is an html exploits can only be detected if static processing is disabled.
The virtual machine that processes webpages monitors the eects that visiting a ma-
licious webpage can produce. This can be a crash of the browser, a modication in
registry keys or a creation of a le on disk on an unusual location.
To test how well it behaves in a real-life situation, the authors have collected 100
malicious web pages from 45 distinct websites. Spyproxy achieved 100% detection
rate. No information is given about the false positives.
The system purpose is to protect the user and can not be used in large datasets clas-
sication since it must run many of the samples. Also, Spyproxy can be bypassed by
using sleeps methods or other non-deterministic methods.

Chapter 2. State Of The Art. Related Work 63
2.4.6 Zozzle: Low-overhead Mostly Static JavaScript Malware Detection
The authors of [40] propose a malware detection system that will be run in a browser
and classify all rendered html content. To do so, the system will use the detour binary
instrumentation library to intercept the "Compile" function from jscript.dll. This is
used by the Internet Explorer browser each time a new script is executed by the usage
of an iframe or script tag or even by the usage of the JavaScript keyword eval.
In order to do classication the authors propose a method to automatically extract
features. Each feature consists of two components:
a context in which it appeared: in a loop, a try/catch block. Dierent level of the
context are used. For
at features, no context is used, while for n-level features,
the whole abstract tree is used.
a string from the abstract syntax tree (only expressions and variable declarations
were selected)
The features are boolean (present or absent from the le).
Since parsing the javascript le can yield many features, only the ones that are repre-
sentative to their class are selected using the 2algorithm.
For classication, the authors use naive Bayes algorithm. Even though this produces an
equation which evaluation is linear in respect to the features used, the authors suggest
that string comparison should be done using an automata in order to reduce processing
time.
In order to nd out how the algorithm behaves with dierent settings the authors test
it using a dataset made up of 919 malicious (deobfuscated scripts) and 7976 benign
les gathered from top 50 Alexa websites. Three sets of automatic features are created,
one containing 948
at features, one containing 1589 1-level features, and another one
made up of 2187 n-level features. A set of 89 manual features were also created in order
to highlight the benet of using automatic features.

Chapter 2. State Of The Art. Related Work 64
In terms of accuracy, the system that used automatic features scored better than when
manual features were used on all train cases. An observation made by the authors
is that the manual features seem to be dependent of the size of the context, while
the automatic ones are not. The false positive rate for the system when trained with
automatic features is also signicantly lower (between 0.01 and 0.02) than when trained
with manual features (1.5% and 4.5%). Practically, when using automatic features,
the false positives are almost inexistent. However, the results are quite opposite when
comparing the false negative rate. (between 5.84% and 11.08% for the automatic
features versus 1.26% and 5.14% for the manual features). Among the three context
sizes the level 1 features (in which only level of the syntax tree is kept) obtained the
best results.
The authors have also tested how much of the training set is necessary to be used in
order to obtain a good classier. For this reason they have split the dataset in chunks
of dierent sizes from 1% to 25% and carried dierent tests. The results show that a
training dataset of 10% of the original one is sucient.
The same question was asked about the number of features to be used. The authors
performed dierent tests using various number of features from 50 to 1300. Using less
than 200 features lead to poor results, while using more than 600 does not improve the
accuracy.
To highlight the little overhead the system produces, the processing time of the scripts
were also measured. Depending on the number of features, the system classies a
sample from 1.3 milliseconds (for 30 features) to about 7 milliseconds for 1300 features.
2.4.7 Pattern Mining for Future Attacks
Using data mining techniques the authors in [41] try to build a statistical method for
identifying signicant groups of tokens that come together in malicious and benign
scripts.

Chapter 2. State Of The Art. Related Work 65
Each le that is labeled is parsed and a set of transactions is extracted. Depending on
the class of the le these can be positive or negative. A transaction is a collection of
items over a nite alphabet, in this situation, keywords that are relevant to JavaScript
classication.
The authors dene an itemset as a collection of maximum n features, where n is usually
set to 5 or 6. Each itemset has a frequency among the dataset. The rst step is to select
only frequent items. This is done by using the apriori algorithms with a threshold of
S
2nwhereSis the number of transactions of the class of the transaction. The next step
is to reject all itemsets that have a frequency lower than the noise in the dataset. To
do so, a probability distribution is assigned to each itemset, where the probability is
computed by dividing the frequency by the number of total transactions. The authors
consider an itemset to be frequent if its frequency is greater than K=2nwhere K is
the number of transactions. In order to make the threshold more
exible the authors
also provided a more complex formula that can be tunned using an input value. The
probability and the transaction form an item generation model (IGM).
After the transaction dataset is ltered, the next step is to create a mixture of IGM
for each positive and negative transaction set, where for each IGM is assigned a coef-
cient. The learning of these coecients is done using the Expectation Maximization
algorithm.
For classifying a transaction, the likelihood of the transaction is computed under the
two obtained models, and the transaction is assigned a class accordingly.
To test the models, a training set of 13500 pieces of JavaScript snipets is collected of
which 12704 are benign and 839 are malicious. An additional dataset, for testing, is
created that consists of 26900 pieces of JavaScript code of which 453 are malicious and
26479 are benign. Among this dataset there are also two unknown zero-day vulnerabil-
ities. The set of used features consists of 34 important strings that are usually found
in malware and benign JavaScript code. Using these features, the average transaction
size for the training dataset is 11, while for the testing dataset is 6. After computing

Chapter 2. State Of The Art. Related Work 66
the two models, the framework achieved a 90% precision rate and a recall of 70%. It
has also succeeded in identifying the two unknown vulnerabilities.
2.4.8 Fast Plagiarism Detection System
Clustering les based on similar code was also addressed by other authors in order to
detect plagiarism. In order to speed-up the plagiarism detection, the authors of [42]
use a dierent approach in order to reduce the time needed to compare a le with each
other.
The rst step is to tokenize every le and transform it to a string of tokens. The
next one is to build a sux array that contains all tokens found in the collection of
documents. Since the array is sorted, a string search in this structures requires a binary
search algorithm.
For every le that is given as input, the algorithm will extract non-overlapping strings
that contain dierent number of tokens, and search it in the sux array in order to
obtain les that contain the substring. A similarity score is computed that is the
number of matched tokens divided by the total number of tokens in the le. If on a
search a le matches multiple times the same substring, the size of the matched string
is extended and only the largest one will be kept. On the other hand, if multiple strings
in the input les are matched by the same string in the test le, only the largest string
will be kept.
Running the algorithm for each le will obtain sets of similar les. Since dierent score
can be obtained when comparing a le " a" to a le "b" than when comparing the same
two les but in reverse order, the similarity between two les is given by the average
score of their two obtained similarity scores.
In order to evaluate the system, the authors has gathered 220 Java programs from
undergraduate students. By running dierent well known plagiarism detection systems
(MOSS, JPlag and Sherlock) the authors considered that a le is a plagiarism if at

Chapter 2. State Of The Art. Related Work 67
least two out three of mark it. From 155 les marked, the system successfully
agged
115 of them, a result that is better than any of the three when run alone.
2.4.9 JPlag: Finding plagiarisms among a set of programs.
Plagiarism detection for programming code was also addressed by authors in [43]. Their
system, called JPlag, nds pair of les and as well as zones of code that are considered
to be similar.
The rst step of the algorithm is to tokenize every le. The next step is to try to
cover the set of tokens from one le with as many tokens as possible from the other.
The percentage achieved is the similarity value. This process is carried on using an
algorithm called greedy string tiling that works in two phases:
1. Find the longest match of tokens, by iterating every token from the rst le and
compare it to every token in the second le. If a match is found, this is extended
in order to get a match as big as possible
2. All matches from phase 1 are marked (tokens from the source that matched will
not be used again).
The algorithm will run until no match will be found. In order to decrease false positives,
a minimum match length is used.
The algorithm has a complexity O(n3) which make it unusable for large datasets.
In order to reduce the complexity, the authors propose the use of the Rabin-Karp
algorithm. To implement it, every list of consecutive tokens which size is greater than
the minimum match length will be hashed by a function. The results are saved in a
hash tables specic to every le. Comparing 2 le can be reduced (if collisions are
ignored) to searching using hash tables.
The minimum match length suggested by the authors is 9, while a cuto criteria
(amount of similarity between two les) of 50% yields very good results.

Chapter 3
Improving Malware Detection By Feature Extraction
Since malware number began to increase exponentially in 2006, it has been clear that
in order to achieve a good detection rate something had to change in the way mal-
ware detection was being done. Traditional detection methods based on hashes were
being considered obsolete even before that so a new method was need. A temporary
solution has been writing of heuristics that could detect clusters of samples, but even
this has proved to be inecient when the number of unique malware samples that ap-
peared daily started to be more than a team of malware researchers could cope with.
Nowadays, when there are over 5000000 samples that appear each day, the only viable
solution in achieving a good detection rate is to use a combination between automatic
methods, like machine learning techniques and traditional ones, like signatures based
on hashes and human written heuristics.
However, if a machine learning algorithm is going to be used in antivirus product ,
several restrictions must be considered:
The algorithm must provide a high detection rate. The main benet of these
algorithms is to decrease the number of samples a malware researcher has to an-
alyze. Also, some of the machine learning algorithms may need a lot of resources
at training time so the results should justify the investment.
68

Chapter 3. Improving Malware Detection By Feature Extraction 69
It should also be able to able to work with big data. This means that the algo-
rithm should be able to process millions of records, should nish in a reasonable
amount of time and of course provide good results. This restrictions become more
evident when we take into consideration that for training a machine learning al-
gorithm both malware and benign samples are needed and there are more than
300 million malware samples in the wild. Coping with big data is also the main
reason why machine learning is needed for malware detection.
It should provide a very low number of false positives. In the antivirus industry,
false-positives are a very dangerous thing. It is usually preferable to miss a sample
on an infected computer than to detect a clean le. This is because an antivirus
will delete a detected le and removing an important document or even a system
le could bring more harm. Since restricting the false positive rate to 0 will
greatly impact detection we will work with a false positive rate of 0.01
Malware detection through the use of machine learning is not new; it has been used
by other researchers and is currently used in many av-products. However, the results
obtained can be better either in terms of lower false positives or by increasing the
detection rate. The way I choose to solve this problem is not by modifying the new al-
gorithms, but rather by extracting new features from existing ones so that the accuracy
of the original algorithm will increase. I choose this method because I consider there is
an already good algorithm, named One Side Class Perceptron (OSCP) [8] which was
tested for malware detection with low false positive rate and achieved good results.
The One Side Class Perceptron (OCSP1) works by creating a hyper-plane that is able
to separate the two classes of samples (malware and benign) such that no samples from
one class will be detected. It is very well suited for our problem and by the way it is
constructed it guarantees that, on training phase, there will be no false positives.
In order to make the algorithm easy to read, we will consider the following notations
1.Records!R

Chapter 3. Improving Malware Detection By Feature Extraction 70
2.Features!F
3.Label!L
4.maxIterationsmaxI
5.SingleSetClassSall records belonging to a specific class
Algorithm 1 One Side Class perceptron – OSC1
1:functionOSCPTrain (R)
2:w [w1;w2;:::;w m];wi 0;i 1;m;
3:b 0; a learning rate
4:iteration 0
5:S SSjRj
i=1Ri;where R i:L=benign files
6: repeat
7:TrainFunction (R;w;b )
8: repeat
9: TrainFunction (S;w;b )
10: errors 0
11: fori= 1!jSjdo
12: if(not IsCorrectClassified (Si;w;b))then
13: errors errors + 1
14: end if
15: end for
16: until (errors = 0 )
17:iteration iteration + 1
18: until (iterationmaxI )or(every record is correctly classified )
19:end function
Algorithm 2 Train Function
1:functionTrain (Records; w; b )
2: fori= 0!jRecordsjdo
3: ifRecords i:L((Pm
j=1Records i:Fjwj) +b)0then
4: forj= 1!mdo
5: wj wj+Records i:LRecords i:Fj
6: end for
7: b b+Records i:L
8: end if
9: end for
10:end function
A representation of how the OSCP algorithm works can be seen in 3.1. The rst image
describes a classic training for the perceptron algorithm in which a hyperplane (in this

Chapter 3. Improving Malware Detection By Feature Extraction 71
(a)
(b)
Figure 3.1: OSC training process. Training step (a). Class minimize step (b).
case, a simple line) that best separated the red squares from the green circles is found.
In the second image, the hyperplane is adjusted so that no elements belonging to a
class (in this case, green circles) will be misclassied.
The algorithm ends when all the elements have been correctly classied, or when a
number of iterations has been reached. Even though this algorithm solves the problem
of false positives it also dramatically reduces the detection rate. Nevertheless, the idea
of deterministically controlling the number of false positives given the features of each
sample is a very important component in creating a machine learning algorithm that
could be used in practice for malware detection.
So given the OSCP algorithm (Algorithm 1), the next important question is how could
the detection be improved. In order to answer this, I have concentrated on creating
new features from existing ones such that a better linear separability could be obtained.
3.0.1 Feature Creation Using Mapping
One simple way in which this can be obtained is called "mapping" and works by using
the result of the combination of every two features as a new feature. The operator
used in this combination can be any mathematical one but the AND operator seems to
yield the best results. The mapping method provides, theoretically, a better detection
rate because the records that were inseparable using the initial features will have a

Chapter 3. Improving Malware Detection By Feature Extraction 72
higher probability to become separable since the number of features has increased ex-
ponentially. However this method has also its drawbacks. A square increase of memory
needed for the training side can make the algorithm impractical if the number of initial
features is too big. Even though this can be solved by generating the feature only
when it is needed, it still doesn't solve the most important one: the time complexity.
A square increase of features also means that the algorithm will take exponentially
more time to nish. The best way to solve these problems is to start with a number of
features that allows the algorithm to nish in a reasonable time frame. If the amount of
initial features is greater than that number, then only the most important ones should
be selected using dierent feature selection algorithms like F-score[44].
The F-score is a statistical method that allows to see how discriminative a feature
can be. It is based on the ratio of two other scores, the F1-score which measures
the variability between group of les and the F2-score which also measures variability
but this time inside the group. The F-score for a feature, given the sets of clean and
malicious les is described in Equation 3.1.
S1=F(i) =(x(+)
ixi)2+ (x()
ixi)2
1
n+1Pn+
k=1(x(+)
k;ix(+)
i)2+1
n1Pn
k=1(x()
k;ix()
i)2(3.1)
wherexi,xi(+)andxi()are the average of the i-th features of the whole, benign and
malicious data sets, respectively; x(+)
k;iis the i-th feature of the k-th benign le while
x()
k;iis the i-th feature of the k-th malicious one.
The algorithm for feature mapping is presented in Algorithm 3.
1. sort the entire set of existing feature based on a specic score (usually F-score)
2. choose the rst \n"
3. combine every selected feature with the other ones using the AND operator

Chapter 3. Improving Malware Detection By Feature Extraction 73
Algorithm 3 MapAllFeatures algorithm
1:functionMapAllFeatures (R)
2:newFeatures
3: fori= 1!jR:Fjdo
4: forj= 1!jR:Fjdo
5: newFeatures newFeatures[R:F i
R:F j
6: end for
7: end for
8:end function where
can be any operator. using AND we obtained the best
results.
3.1 Feature Creation Using Genetic Programming
The dataset on which the algorithms will be tested consists of 358.144 records of which
42.940 are malicious and 315.204 are benign. Even though the dataset is not as large
as it would be when training for real life malware classication, I consider it big enough
to allow us to make an idea of how the algorithms will perform. Also, the proportion
of malicious and benign les is close to the one found in the wild. The les from
the dataset are all PE-les, mainly executable and dll les, which were gathered over
a period of two months (april and may 2014). The les were statically scanned and
dierent features were extracted like the size of the le, the number of imports, exports,
the geometry of the le and so on. Dynamically features were also extracted and include
things like: if the le connects to the Internet, if it writes les in certain directories, if
it modies the registry and so on. In total, 300 features were gathered from each le.
I prefer working with boolean attributes since they occupy less memory (just 1 bit)
are more resistant to malware changes and every continuous value can be transformed
into a boolean one. Even though a xed number could better identify a sample, due
to many changes that happen in a malware family, I consider boolean features are
more powerful. For example, it helps more to know that a le size is between 1 and 2
kilobytes than knowing it is 1327 bytes.
All the les that are included in this dataset were ltered. For example, le infectors,
which is a kind of malware that adds their code to other executable les, often don't

Chapter 3. Improving Malware Detection By Feature Extraction 74
produce dierent features than the original le. For this reason, records with the same
features but of dierent classes are highly likely to appear. Since this is detrimental for
how the OSCP algorithm works it is better to eliminate those from the start. Another
type of les that were removed is adware. These resemble very close to benign les and
there are also people who don't consider them malicious. A similar case is for keygens,
patcher or cracks.
Mapping seems to provide good results. It increases the detection rate while the number
of false positive stays low enough to permit the algorithm to be used in practical
applications. However, a big disadvantage of this method is the number of features
which it needs to create. If we store each feature in one bit then each record will
take approximately 38 bytes of memory. Since the dataset consists of 358144 elements,
then it will all take approximately 14 MBytes. Considering the size of the database, 14
MBytes of data is quite small. However, when the mapping procedure is performed, the
number of features increases ton(n+1)
2, where n is the number of original features. (the
number of all features should be divided by 2 only if we use a mathematical operator
for mapping that is commutative. This way, we will eliminate duplicate features that
result after mapping). In our case, the number of attributes resulted for each record
after mapping is 45150 which occupies almost 5.5 Kbytes of data. For 358144 records,
the dataset will occupy 1927 Mbytes. Even though 2 gigabytes of data is not a problem
for todays hardware, we must take into consideration that a dataset of 358144 records is
small compared to the one used in training by machine learning algorithms for malware
detection where the number of elements in the database can reach millions. If more
features are used the problem is even worse, since each new feature used increases the
size of a record exponentially when mapping is performed.(For example, if an additional
feature was used the size of a mapped record would increase with 37 bytes).
In the following subsection I will provide a method, based on genetic programming, that
will allow us to increase the detection rate of the algorithm. Even though achieving the
same results as those obtained through mapping but using just the square root number
of features is very hard (probably impossible for many datasets) we can still increase

Chapter 3. Improving Malware Detection By Feature Extraction 75
the detection of the OSCP algorithm. The idea is to create features from existing ones
using genetic programming. To validate the results, they will be compared with those
obtained through mapping, but also a more realistic measurement will be used. The
best features obtained through the mapping process will be selected and an OSCP
algorithm will be trained. If the results obtained using genetically created features are
better than the ones using the best features of mapping, when the number of attributes
is the same, we can say that our method has managed to create better features. One
should notice that it is not always possible to select the best features that mapping
will create since those are obtained by applying a feature selection algorithm over the
whole new dataset, which as previously explained, could be too big to t in memory.
In order to nd the best features, rst we need to see which boolean operator brings
the best result when mapping is used. For this reason, the and,orandxoroperators
were tested. The results were computed by rst creating the new dataset using each
boolean operator at a time and performing a 3-fold cross validation.
Algorithm Detection False Positives Accuracy
OSCP 58.12% 17 94.96 %
OSCP-MAP-AND 85.35% 44 98.21 %
OSCP-MAP-OR 75.20% 20 97.01 %
OSCP-MAP-XOR 85.38% 32 98.22 %
Table 3.1: The results of the 3-fold cross-validation for OSCP and OSCP-MAP
algorithms
It can be easily observed in Table 3.1 that by using mapping, with any boolean operator,
the detection rate is improved with at least 17%. This mainly happens since the increase
in the number of features improves the linear separability of the data. However, the
mapping performed with the operators andandxorincreased the detection rate with
more than 27%.
The next step is to select the best 300 features and compare the results. The feature
selection is performed, using the F-score (Equation 3.1), previously described. By using
this function we will lter those attributes that are very representative for a class of

Chapter 3. Improving Malware Detection By Feature Extraction 76
les but not representative for the other. The results of testing the OSCP algorithm
with the best 300 features selected after mapping with each boolean operator, can be
viewed in Table 3.2. Again, these are obtained using 3-fold cross validation on our
dataset.
Algorithm Detection False Positives Accuracy F Used
OSCP-MAP-AND-best300 8.54% 1 89.03 % 67
OSCP-MAP-XOR-best300 1.37% 4 88.17 % 54
OSCP-MAP-OR-best300 2.41% 3 88.3 % 92
F used represents the number of unique original features being used in the new
attributes.
Table 3.2: The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according F-score
In contrary to what we have expected, the detection rate has seriously dropped, even
bellow of the one obtained of using the OSCP algorithm with the original features.
Even though it appears that these new features are worse than the original one, this
is not the case. After analyzing the created features it appears that many of them are
based on the same original features. This is highlighted in Table 3.2 in the last column.
As it can be seen for the results of the andoperator, even though we are using 300
features for training the OSCP we are actually only using 67 original features. This
reduces the diversity and increases the probability that records belonging to dierent
classes will have the same set of features. This did not happen when mapping was
used since there, among the new created features, there are also the original ones. The
easiest way to solve this problem is to test the OSCP algorithm with 600 features
instead of 300: the new created features plus the original ones. The results can be
viewed in Table 3.3.
The best results are the ones obtained using the andoperator. For this reason we will
use this as a threshold to test our new features.

Chapter 3. Improving Malware Detection By Feature Extraction 77
Algorithm Detection False Positives Accuracy
OSCP-MAP-AND-best300+300 original 65.53% 15 95.85 %
OSCP-MAP-XOR-best300+300 original 58.87% 16 95.05 %
OSCP-MAP-OR-best300+300 original 62.13% 15 95.45 %
Table 3.3: The results of the 3-fold cross-validation for OSCP when used with the
best 300 features selected according the F-score plus the original 300 features
3.1.1 Using the boolean andoperator for crossover
To produce new features from existing ones we need to have a function. Since the
searching space is too big we will use genetic programming to create features from
existing ones that are very good at separating the records from the dataset. We start
by dening the chromosome. A chromosome is a series of original features of arbitrary
length. Each feature is linked to the rest using a boolean operator. Since the best results
after using mapping were obtained by using the andoperator, we use this to link fea-
tures between them. More precisely, if the original features are F=fF1;F2;F3;:::F ng,
a chromosome is dened as C=fG1;G2;G3;:::;G ngwhereGiis a gene in the chro-
mosome represented as an integer non-null number (the feature index) that respects
the condition 1 <=jGij<=n;i=1;n. A description of how a chromosome looks like
is depicted in Figure 3.2.
Figure 3.2: Chromosome
The initial population is made up of 300 chromosomes where each one is initialized to
an original feature, according to Algorithm 4.
This simple step is also depicted in Figure 3.3.

Chapter 3. Improving Malware Detection By Feature Extraction 78
Algorithm 4 Initial Population
functionInitialPopulation (F)
P fg
for allFi2Fdo
C fi+ 1g
P P[fCg
end for
end function
Figure 3.3: Chromosome
For the algorithm to work the tness function, mutation operator and crossover oper-
ator must be dened.
Mutation is implemented using the boolean notoperator. A gene is randomly selected
from each chromosome and with a probability of 3% its state will be negated. So the
chromosome is actually a boolean function over a set of features where each one is
linked to the rest using the boolean andoperator and some of them are negated.
Crossover is used by randomly selecting two chromosomes from each population, ran-
domly choosing a splitting point on each chromosome and exchanging genes from one
end to the spiting point. The crossover point can be done on any gene from 0 to the
length of the chromosome. This way, we make sure that crossover is possible from the
start where there is only one gene per chromosome. There are some problems that
can appear when using these operators. Before doing crossover it is possible that one
chromosome has a certain gene while the other has the same gene, but negated. It is
possible that one of their ospring to end up having both versions of the genes. Since
the boolean andoperator is used to link these genes, the chromosome will always eval-
uate to false. To prevent this from happening, if a gene exists in both forms, negated
and non negated, it will simply be removed. Another thing that could happen after

Chapter 3. Improving Malware Detection By Feature Extraction 79
crossover is that the same gene can appear multiple times. Even though this does not
immediately has an impact of the result of the tness function, it can produce very
long chromosomes without any increase in tness score. For this reason, if a gene is
found multiple times, only one copy will be retained.
On each generation, chromosomes that go into crossover are randomly selected and
1000 new ospring are created. Generating 1000 new chromosomes instead of choosing
pairs and decide whether to create ospring or not allows us to explore more of the
searching space. The number 1000 used here is arbitrarily chosen. Any larger number
can be used, provided that the selection will not take too long. This is highlighted in
Figure 3.4
Figure 3.4: Crossover operation
The tness function used for selecting the best chromosomes is again the F-score (Equa-
tion 3.1). From all the ospring created(1000) only the best 300 are being kept. This
is shown in Figure 3.5.
The steps taken to evolve the population are better described in Algorithm 5.
The transition from a chromosome to a feature that can be used by the OSCP algorithm
is straight forward. When evaluating the chromosome against a record, each gene,
which is actually an index of an original feature, will be tested against the record. For
the genes represented by negative indexes (due to mutation) the result will be negated.
At the end, a boolean andoperation will be performed between the results obtained
from evaluating each gene. The evaluation process is described in Algorithm 6.

Chapter 3. Improving Malware Detection By Feature Extraction 80
Algorithm 5 Evolve Population
functionEvolvePopulation (P)
combination 0newPop fg
whilecombination<combinationsThreashold do
C1 P[Random (jPj)]
C2 P[Random (jPj)]
cut1 Random (jC1j)]
cut2 Random (jC2j)]
C1left=Scut1
i=1C1i
C1right=SjC1j
i=cut1+1C1i
C2left=Scut2
i=1C2i
C2right=SjC2j
i=cut2+1C2i
CLL fgCLR fg
CRL fgCRR fg
ifRandom (100)>50then
CLL C1left[C2left
CRR C1right[C2right
ifRandom (100)>mutationThreashold then
index 1 =Random (jCLLj)
CLL index 1=CLL index 1
index 2 =Random (jCRRj)
CRR index 2=CRR index 2
end if
else
CLR C1left[C2right
CRL C1rightt[C2left
ifRandom (100)>mutationThreashold then
index 1 =Random (jCLRj)
CLR index 1=CLR index 1
index 2 =Random (jCRLj)
CRL index 2=CRL index 2
end if
end if
ifCLL not empty then
newPop newPop[fCLLg
end if
ifCRR not empty then
newPop newPop[fCRRg
end if
ifCLR not empty then
newPop newPop[fCLRg
end if
ifCRL not empty then
newPop newPop[fCRLg
end if
end while
returnnewPop
end function

Chapter 3. Improving Malware Detection By Feature Extraction 81
Figure 3.5: Selection Operation
Algorithm 6 Convert Chromosome to Feature
functionConvertChromosomeToFeature (F;C)
F0 fg
for allG2Cdo
ifG> 0then
F0 F0[fFG1g
else
F0 F0[fFG1g
end if
end for
end function
Testing these algorithms on the dataset did not provide the expected results. In fact,
with each generation, the detection rate kept getting worse. The results after dierent
iterations can be seen in Table 3.4
After analyzing the chromosomes an interesting behavior can be observed. With each
generation the number of original features used in combinations decreases, thus de-
creasing the searching space considerably. Even though this could be considered as a
good thing (feature selection) what is actually happening is that the solution being
found is a local minimum.
A graphical display of the number of features being used after each generation can be
consulted in Figure 3.6.
To solve this problem the crossover operation was altered. On every generation, the

Chapter 3. Improving Malware Detection By Feature Extraction 82
Iteration Number Detection False positives Accuracy
1 59.16 17 95.09 %
2 58.97 11 95.07 %
5 45.92 13 93.5 %
10 28.75 7 91.45 %
15 27.39 2 91.29 %
20 21.24 3 90.55 %
25 18.44 0 90.22%
50 12.22 1 89.47 %
75 1.7 0 88.21 %
90 0.01 0 88.01 %
100 0.0 0 88.01 %
Table 3.4: Results of using chromosomes resulted in dierent generations
Figure 3.6: Number of original features used on each generation
original features are added as chromosomes. This way, they can always be used and
allow the chromosome to escape from a local minimum. Another reason for why the
detection kept getting worse is that by using only the resulted features it is possible to
end up with records that belong to dierent classes that have the same features. This
is very detrimental to how the OSCP algorithm works. Adding the original features to
the ones resulted after the run of the genetic algorithm, solves the problem. Thus, the
crossover operation becomes as described in Figure 3.7.
After running the algorithm again, this time with the aforementioned modications,
the number of original features that were used by the chromosomes varied between 200

Chapter 3. Improving Malware Detection By Feature Extraction 83
Figure 3.7: Crossover operation including original features
and 245. This can be seen in Figure 3.8
Figure 3.8: Number of original features used on each generation
Another modication brought to the algorithm is to also include the parents in the
selection phase. This is done since it is possible that the resulted chromosomes will
be less t than their parents. For this reason, the selection is performed on 1300
chromosomes instead of 1000. The selection phase is better described in Figure 3.9
The results after running the algorithm again are presented in Table 3.5.
Even though the results have greatly improved, the detection rate still decreases, al-
though more slowly. After studying the chromosomes from dierent generations an

Chapter 3. Improving Malware Detection By Feature Extraction 84
Figure 3.9: Selection Operation
Iteration Number Detection
1 64.67 %
2 64.83 %
5 66.15 %
10 61.83 %
20 58.4 %
40 59.6 %
Table 3.5: Results on dierent iteration for arbitrary length features
interesting problem was observed: the chromosomes length kept getting bigger with-
out any benet in tness score. This is due to the evaluation function that treats
each chromosome separately. Even though each of these chromosomes are very good
when it comes to tness score, they separate the same malicious and benign les. So,
basically, they are one and the same feature. Using many of these features is detrimen-
tal since these translates in using less features for classication. An example of such
chromosomes can be seen in Table 3.6.
In Table 3.6, each line is specic to one chromosome. The rst two columns re
ect the
number of benign and malicious records that each chromosome separates, while the
last column shows the index of the original features that make up the chromosomes.

Chapter 3. Improving Malware Detection By Feature Extraction 85
Benign Records Malicious Records Feature Chain
26023 583 -237,-52,239,245,246,254,
26023 583 -237,-52,239,245,246,252,266,
26023 583 -237,-52,239,245,246,251,252,254
26023 583 -237,-52,239,245,246,251,252,254,255,266
Table 3.6: An example of bloating problem
The indexes are from 0 to 299 since we are using 300 features. A negative number
means that mutation was performed on that place of the chromosome, so the feature
will be used in negated form. As it can be seen, all of the four features presented in
Table 3.6 separate the same number of malicious and benign records. Also, the rst
feature is a prex for the other ones, the second is a prex for the following two and
so on. In fact, half of the chromosomes were created in this way. This phenomenon,
called bloating, has dierent xes. In this case, we have just limited the length of the
chromosome to a constant. A slighter better approach can be consulted in the next
section. The results obtained after applying the length limitation of the chromosome
can be seen in Table 3.7. As it can be seen, the detection rate gets to almost 70% after
which it slowly drops.
Iteration Number Detection
1 66.61 %
2 66.13 %
5 66.21 %
10 66.99 %
15 69.09 %
20 68.6 %
25 67.65 %
50 67.11 %
75 65.38 %
90 65.18 %
100 64.66 %
Table 3.7: Results on dierent iteration for arbitrary length features

Chapter 3. Improving Malware Detection By Feature Extraction 86
The Table 3.8 presents our best results compared to the ones obtained by the algorithm
that we use as threshold (OSCP with the best 300 features obtained after mapping using
the and operator, plus the 300 original features).
Algorithm Detection False Positives Accuracy
OSCP-MAP-AND-best300+300 original 65.53% 15 95.85 %
OSCP-GA-best300+300 original 66.99% 17 96.39 %
Table 3.8: Comparison of the results when using features resulted from mapping
versus genetic programming
As it can be observed, features obtained through genetic programming bring an increase
of almost 4.5% in detection rate. Another advantage that this method has over features
obtained through mapping is the low usage of memory. Obtaining the best features
from mapping implies that one must rst obtain all features, which in some cases might
not be possible since due to the exponential increase in memory usage.
Since this results conrm that it is possible through genetic programming to obtain
better results than mapping, the next question is how many genetically obtained fea-
tures are needed to achieve the same detection rate as the one obtained by using the
mapping method. For this reason we gradually increase the number of features as well
as the population size. The results are provided in Table 3.9.
Number of features Population size Detection Rate False positives Accuracy
1000 3000 73.01% 21 96.74%
2000 5000 77.16% 30 97.24%
3000 6000 80.66% 31 97.65%
4000 8000 81.16% 34 97.71%
Table 3.9: Results using dierent number of features
As it can be seen, with 4000 features the detection rate is 81.16% which is close to
85.35% obtained through mapping which produces 15 times more features (45150).

Chapter 3. Improving Malware Detection By Feature Extraction 87
3.1.2 Using all boolean operators for crossover
Using the boolean andoperator we have obtained satisfying results. However, we would
like to see if using more boolean operators translates into better results. Even though
the boolean set made of fand,notgis considered a generator set, (every other boolean
operation can be obtained just using these operators), in practice it takes too many
iterations to achieve the same results as using a simple operator like ororxor. For
this reason, in this section, the boolean operators fand,or,xorgwill be used to create
features through genetic programming.
For this, we have to change how a chromosome is implemented. A chromosome is a set
of features (1 to 300) each one connected to other through a boolean operator. Since
the order in which the expression is evaluated changes the result, a priority is also given
to each boolean operator. The structure can be seen in Figure 3.10.
Figure 3.10: Chromosome
Notice that in Figure 3.10 each Fi, wherei=1;300 refers to one of the original
features, each OPirefers to one of the boolean operations fand,xor,organd eachPi
is a number that re
ects the priority of the boolean operation among the expression.
The last operator and priority are used only when the crossover point is at the end of
the chromosome and a new piece from another chromosome will be attached.
To understand how such a chromosome is evaluated, an example is provided below.
We will start with a chromosome illustrated in Figure 3.11.
Figure 3.11: Chromosome

Chapter 3. Improving Malware Detection By Feature Extraction 88
For evaluating this expression an ordering of the boolean operations is needed. Since
doing this for every record is detrimental, an intermediate expression will be computed
where the operations needed for evaluation are already sorted. The intermediate ex-
pression will be built only for the rst record. The steps to build such expression are
the following:
1. ignore the last priority and operator
2. choose the maximum priority
3. evaluate the operation given by the operator corresponding to the chosen priority
and the two corresponding operands: the one on the same index with the operator
and the one from the next index
4. assign the result to a temporary variable
5. replace the two operands used with the temporary variable
6. repeat from 2 until all the operands are all the same temporary value
For the example provided, the temporary values are computed as following:
1.T1 =F10XOR F 12
2.T2 =F8AND F 9
3.T3 =F1AND F 5
4.T4 =T3XOR T 2
5.T5 =T4OR T 1
A more detailed view of these steps are illustrated in Figure 3.12. Notice, that at the
nal step, all of the features are replaced with the same temporary operand, T5.
Of course, this is just a way of evaluating a mathematical expression. If a tree would
be build for the provided example, it would look like the one in Figure 3.13. The chro-
mosomes could have also been implemented using binary trees, however I consider this
method to be more practical since it allows mutation on every part of the chromosome
(priority, operator and feature)

Chapter 3. Improving Malware Detection By Feature Extraction 89
Figure 3.12: Chromosome example evaluation
Figure 3.13: Evaluation tree
As in the previous subsection, we rst need to dene the initial population. The
rst generation is constructed by generating 300 chromosomes, of length 1. Each
chromosome has one of the original features, each operation is randomly chosen among
the set of boolean operators fand,or,xorgand each priority is initialized to 0. This
step is described in Algorithm 7.
After the initialization process, the population will look as illustrated in Figure 3.14.
The crossover operation is similar to the one described when using just the and operator:

Chapter 3. Improving Malware Detection By Feature Extraction 90
Algorithm 7 Initial Population
functionInitialPopulation (F)
P fg
for allFi2Fdo
Cfeature fi+ 1g
Coperation Random (fxor;and;org)
Cpriority 0
P P[fCg
end for
end function
Figure 3.14: Initial Population
1000 new chromosomes are generated by randomly selecting two parents and performing
crossover on a randomly selected point. The mutation operation is dierent since the
chromosome is more complex. A probability of mutation of 3% is considered. From
the new generated set of 1000 chromosomes, three sets of 30 chromosomes are chosen.
For the elements in the rst set, a feature will be randomly selected and negation
will be performed. For the elements in the second set, one of the operators will be
switched to a randomly chosen one from the boolean set fand,or,xorg. Finally, for
each chromosomes in the nal set, a priority will be chosen, and based on a randomly
generated number it will be incremented or decremented.
Based on the F-score (Equation 3.1), the best 300 chromosomes will be selected. The
parents are also included in the selection process in order not to lose the best chromo-
somes found so far. The genetic steps can also be consulted in Algorithm 8.
As described in the previous subsection, when using just the boolean andoperator,
the bloat problem is also found in the current approach. If in the previous section
the problem was solved by just limiting the length of the chromosome, I consider the

Chapter 3. Improving Malware Detection By Feature Extraction 91
Algorithm 8 Evolution process
P initial population
Pnew fg
forit 1;100do
==Crossover
whilejPnewj<1000 do
Offspring 1;Offspring 2=Crossover (P)
Pnew Pnew[fOffspring 1;Offspring 2g
end while
==Mutation
Pnew MutateFeature (Pnew;3%)
Pnew MutateOperator (Pnew;3%)
Pnew MutatePriority (Pnew;3%)
==Selection
Pnew Pnew[fPg
P fg
maxScore 0
bestChromosome
whilejPj<300do
for allchromosome2Pnewdo
ifFitness (chromosome )>maxScore then
maxScore Fitness (chromosome )
bestChromosome chromosome
end if
end for
Pnew PnewnfbestChromosome g
P P[fbestChromosome g
end while
end for
Ptraining P
approach ill suited for the current case since there are more operations involved. For
this reason, the following method is used: if there are two dierent chromosomes that
classify the same number of malicious and benign records then only the shortest one
will survive in the next generation. Even though relying just on numbers to decide if
two chromosomes separate the records is not the same thing as actually inspecting the
les, in practice it seems these measures are close related. Of course, if these numbers

Chapter 3. Improving Malware Detection By Feature Extraction 92
are very small then the probability of having such chromosomes that separate dierent
records increases. For this reason, in order not to limit the exploration power of the
algorithm, the bloating solution is applied only if the number of malicious and benign
records exceeds a certain threshold. (100 seems to work good in practice).
In order to see how the evolved features impact the classication algorithm, at each
generation the features are extracted and used to train an OSCP. The features tness
is also monitored in order to see how the genetic algorithm is doing. The gures below
(Figure 3.15 and Figure 3.16) show how the detection rate and false positive rate vary
while the tness score of the population increases.
Figure 3.15: Evolution of the detection rate and false positives rate
Figure 3.16: Evolution of the tness score of each generation
As it can be seen, the detection rate slowly increases at the beginning but drops sud-
denly after generation 20 with almost 10%. From that point, the detection rate slowly

Chapter 3. Improving Malware Detection By Feature Extraction 93
decreases until generation 60 where it remains constant at 58 %. The graphic of false
positives follows the one of detection rate. The number of false alarms drops after
generation 20 and remains in the interval 11, 12 until the end of the algorithm.
Regarding the variation of the population tness, this is not correlated to the variation
in detection rate. As observed from Figure 3.16, the population score keeps increasing
until generation 60 where it reaches a plateau and stabilizes at 6400.
Since the variation in detection rate does not depend on the variation of tness score, it
means that the tness function it not well correlated with the classication algorithm.
3.1.3 Testing a better tness function
All the chromosomes that get in the nal generation have a high F-score. Since the
F-score is computed based on how many benign and malicious records each feature
is activated on, the chromosomes that will have a higher score are the ones that get
activated on most of the records. However, since the records are not uniformly dis-
tributed, most of the chromosomes (if not all) will end up being activated on the same
subset (the largest one) of the records. This in turns decreases the detection rate of the
OSCP algorithm, since the records belonging to other subsets are very dicult to be
classied. This problem can be better explained through the image presented in Fig.
3.17. As it can be seen, the features from 1 to 300 have the highest F-score since they
get activated on the subset with the largest number of records. Those are the features
that will survive to the next generation. The features 301 and 302 get activated on a
lower number of records so they will have a lower F-score. This means that they will
not pass to the next generation. This is detrimental since all the rst 300 features get
activated on the same subset. It would be better if two of the rst 300 features would
be replaced with the feature 301 and feature 302 since the OSCP algorithm would have
more distinct features (in terms on what records a feature gets activated on) to work
with.

Chapter 3. Improving Malware Detection By Feature Extraction 94
Figure 3.17: Dierent features being activated on the same subset
There are dierent approaches for this problem (called multi modal optimization prob-
lem), but the one that was tested and provides good results is called sharing. The idea
of sharing in Genetic Algorithms was rst presented in [45] and, as most of the Genetic
Algorithm approaches, comes from nature. The idea behind it is that if there are too
many individuals consuming a certain type of resource then it is more favorable for
some individuals to orientate to another type of resource. The name sharing comes
from the way the tness is being computed: the payo of the individual is degraded
based on how many individuals are consuming the same resource.
In our case the resource is represented by each record. Thus, in each generation, for
every chromosome (feature) we will count the number of clean and malicious records
where the feature is being activated. This will be noted with CMalandCClean. Another
observation that we must take into account is that for some records, there are many
features that get activated, while for other only a few get activated. Since many
individuals can choose to consume a versatile record we must prioritize those that get
activated on records with a scarce number of features. For each feature we will also
compute a second score, called SMalandSClean that also takes into account how many
features are activated, beside this one, on each record. For a Feature Fithese scores
are computed according to the following Equations.

Chapter 3. Improving Malware Detection By Feature Extraction 95
CMal=nr malX
j=1Rj:Fi
SMal=nr malX
j=1Rj:Fi1
PjRjj
k=1Rj:Fk(3.2)
CMal=nr cleanX
j=1Rj:Fi
SMal=nr cleanX
j=1Rj:Fi1
PjRjj
k=1Rj:Fk(3.3)
The rst approach was to write the F-score using SMalandSClean. For this, the averages
used in the F-score were computed by using SMalinstead ofCMalandSClean instead
ofCClean. The value of the i-th feature in the F-score was also computed by dividing
it to the number of features with value 1 for that particular record. Using this method
we were only able to reach 65.34% in detection rate and 15 false positives. The second
approach was to simply multiply the F-score with a value computed based on SClean
andSMal:SMal+SClean
Total Mal+Total Clean. This proved even worse since the number of false positives
remained the same while the detection rate dropped to 61.5%.
Seeing this we decided to try another tness function, not dependent on F-Score. The
function dened in Equation (3.4) will give a very low tness score (negative) for the
individuals that get activated more on the clean records than on the malicious ones and
for the others will normalize the dierence between the number of malicious and clean
records where the feature is being activated based on how important is that feature for
all the malicious records.
Fitness 2=8
>>>>><
>>>>>:CMal= 0and C Clean = 0Fitness 2=1
CMal<C Clean Fitness 2=CMalCClean
CMal>=CClean (CMalCClean)SMal(3.4)

Chapter 3. Improving Malware Detection By Feature Extraction 96
Using this function, the detection rate increases to 69.96% while maintaining a low
number of false positives (17 false positives). However, the function presented in Equa-
tion 3.4 can be further optimized. Since the function is based on the dierence between
the number of malicious and benign records where a feature is being activated, it does
not make any distinction between a case where a feature is being activated on 1000
malicious records and 10 benign versus a case where it is being activated on 5000 ma-
licious records and 4010 benign. It is clear that in the rst case the feature is more
representative for the malicious class. For this reason the function presented in Equa-
tion 3.4 is updated by dividing the dierence to the number of malicious records. Thus,
a new tness function is obtained in Equation 3.5
Fitness 3=8
>>>>><
>>>>>:CMal= 0and C Clean = 0Fitness 3=1
CMal<C Clean Fitness 2=CMalCClean
CMal>=CClean(CMalCClean )SMal
CMal(3.5)
Using this function the detection rate increases to 73.23% while the number of false
positives increases with just 2 units, reaching to 19 false positives. The graphic in Fig
3.18 presents how the detection rate modies with each generation. The graphic in
Fig. 3.19 presents information regarding to how each generation evolves. The yellow
line represents the minimum number of features that are activated on every record on
each generation, while the vertical bars represent the standard deviation of the number
of features activated on every record on each generation.
As it can be seen the detection rate increases as more features are being set on the
records with a low number of activated features.
An important conclusion of this work is that adding more boolean operations did
not improve the detection rate. In fact, after analyzing the features obtained on the
last generation when using the Fitness 3function, 23 out of the best 30 features were
obtained using just the boolean notand the boolean andoperations. The same thing

Chapter 3. Improving Malware Detection By Feature Extraction 97
Figure 3.18: Evolution of the detection rate
Figure 3.19: Evolution of the number of features activated on each record
can be said about the bloating solution. Even though there was no limit set for the
maximum length of the feature, the best features were made of two operands. This
also explains why the detection rate has a sharp increase to almost 70% in the rst
generations.

Chapter 3. Improving Malware Detection By Feature Extraction 98
3.2 Feature Creation Using Restricted Boltzmann Machine
The idea of creating features using a mathematical operator can be extended. First of
all, the mathematical operator only brings a limited kind of relation between features.
Secondly, features could be created from the combination of more than two initial
features. In this way the new created features may contain more information and
could prove more important in separating the records. To address these ideas the
mathematical operator should be replaced with a function that has as arguments all
the existing features and the result of this function is a new feature. Since the newly
created features will be used as input for a linear classier it is important that the
function used to create them should be a non-linear one. Finding a function that is
optimized for this task may seem dicult, however these restrictions are very similar
to the ones imposed by a restricted Boltzmann machine [46].
A restricted Boltzmann machine is a stochastic neural network composed of two layers:
one made up of visible units (the data ) and another one made up of hidden units (the
new created features). Every unit from one hidden layer has a symmetrical connection
with every unit from the visible layer, however there are no connections between units
from the same layer. It is called restricted because of the lack of connectivity inside the
same layer. Each neuron from the network has a binary state: 0 or 1. Figure 3.20 is
an example of restricted Boltzmann machine composed of 4 visible units and 3 hidden
units and their corresponding biases.
Figure 3.20: A Restricted Boltzmann Machine

Chapter 3. Improving Malware Detection By Feature Extraction 99
The value of each unit is generated according to a probability function. Considering
that the weights of each connection are stored in a symmetric matrix wandviandhj
represent the states of the visible unit iand hidden state jthen their probabilities are:
p(hj= 1jv) =1
1 +eP
vi2vviwij(3.6)
p(vi= 1jh) =1
1 +eP
hj2hhjwij(3.7)
Learning in this network consists of adapting the weights such that the hidden units
will be able to nd relations between visible units. This is very similar to how an
encoder works. The weights of the connection will be adjusted such that given data
to the visible layer the hidden units will receive values from which a new kind of
data could be generated that is very similar to the original one. Thus hidden units
will contain encoded data of the visible units and the weight matrix is the way to
encode and decode it. The most known algorithm for learning a restricted Boltzmann
machine is called contrastive divergence [47] and works in the following way: The visible
units are rst initialized with data from the training set( vi) and the weight matrix is
randomized. Using the Equation 3.6 the probability of every neuron from the hidden
layer will be computed. The next step is to sample each hidden unit according to
this probability thus obtaining the values for the hidden layer ( hj). The values of the
hidden units are the encoded values of the visible units with respect to the encoding
conguration (weight matrix). The next step is to nd out how the decoded data will
look like and adjust the weight matrix such that the decoded data to be similar to the
original one. So, the next step is to generate probabilities for the visible units, given
the hidden units and the weight matrix, using equation 3.7. By sampling according
to the probabilities obtained, a new set of data is generated (^ vi). The nal step is to
generate another set of hidden units in a similar way as before but this time starting
from the new generated data, thus obtaining encoded values for the new data ( ^hj).

Chapter 3. Improving Malware Detection By Feature Extraction 100
The steps previously described where a new set of data and hidden values is generated
are considered to be one iteration. The weight matrix will be updated according to the
equation 3.8 where is the learning rate,n refers to the number of iterations and <>
is the mean over the training data.
wij=(<vihj><^vi^hj>n) (3.8)
The steps previously discussed are better depicted in Algorithm 9 and Algorithm 10.
The idea of using a restricted Boltzmann machine and a linear classier (one side
perceptron in this case) is depicted in Figure 3.21. As it can be seen in the gure, the
hidden layer will be the new set of features that the perceptron will be trained on.
Figure 3.21: Restricted Boltzmann machine and Linear Classier
Besides creating features in a non linear way, this ensemble also has another advantage.
Since the restricted Boltzmann machine is a stochastic neural network it means that
the units are based on probabilities. After the network has been trained, instead of
sampling a binary value for a hidden unit, the probability can be used instead. This

Chapter 3. Improving Malware Detection By Feature Extraction 101
Algorithm 9 Functions used in RBM training
1:functionComputeProbabilities hiddenLayer (v)
2:probH =fprobH 1;probH 2;:::;probH mg;probH j= 0;j=1;m
3: forj= 1!jhjdo
4:probH j=p(hj= 1jv) =1
1+eP
vi2vviwij+biasvisible
5: end for
6: returnprobH
7:end function
8:
9:functionComputeProbabilities visibleLayer (h)
10:probV =fprobV 1;probV 2;:::;probV mg;probV i= 0;i=1;n
11: fori= 1!jvjdo
12:probV i=p(vi= 1jh) =1
1+eP
hj2hhjwij+biashidden
13: end for
14: returnprobV
15:end function
16:
17:functionBernoulliSample (prob)
18:l=jprobj
19:values =fvalues 1;values 2;:::;values lg;value i= 0;i=1;l
20: fori= 1!ldo
21:r=GenerateRandom (0;1)
22: ifrprob ithen
23: values i= 1
24: else
25: values i= 0
26: end if
27: end for
28: returnvalues
29:end function
Where:
1.w;bias visible;bias hidden represent the conguration of the restricted Boltzmann
machine
2.ComputeProbabilities hiddenLayer is used to compute the probabilities of the hid-
den units given the visible units in respect to the network conguration
3.ComputeProbabilities visibleUnits is used to compute the probabilities of the visible
units given the hidden units in respect to the network conguration
4.BernoulliSample is used to sample hidden and visible units using a Bernoulli
distribution according to their probabilities

Chapter 3. Improving Malware Detection By Feature Extraction 102
Algorithm 10 Training Algorithm for RBM
1:functionTrainRBM usingRecord (v)
2:probH =ComputeProbabilities hiddenLayer (v)
3:PosMatrix =vprobH
4:h=BernoulliSample (probH )
5:bh=h;bv=v
6: forstep= 1!CDsteps do
7:\probV =ComputeProbabilities visibleLayer (bh)
8: bv=BernoulliSample (\probV )
9:\probH =ComputeProbabilities hiddenLayer (bv)
10: bh=BernoulliSample (\probH )
11: end for
12:NegMatrix =bv\probH
13:w=w+ (PosMatrixNegMatrix )
14:bias visible =bias visible + (probH\probH )
15:bias hidden =bias hidden + (v\probV )
16:end function
Where:
1.vis a record from the training data. the units from the visible layer will be
initialized with the values from v
2.w;bias visible;bias hidden represent the conguration of the restricted Boltzmann
machine
3.CDsteps represents the number of iterations from the contrastive divergence
algorithm
4. represents the learning rate
5.means the outer product of two vectors
means that a continuous value will be used instead of a binary one, thus giving more
information to the linear classier.
3.2.1 Training Ensemble of Linear Classier and Restricted Boltzmann
Machine
Training of this ensemble has to be done in two steps:
1. Train the restricted Boltzmann machine in order to obtain the values for the
weight matrix

Chapter 3. Improving Malware Detection By Feature Extraction 103
2. Train the one side perceptron using the hidden layer from the restricted Boltz-
mann machine as the new set of features
Due to the restrictions of the neural network, where there is no connectivity between
neurons of the same layer, each hidden or visible unit can be computed in parallel.
This makes the restricted Boltzmann machine well suited for training using a GPU.
There are currently two important frameworks for developing applications for GPU,
one being OpenCl which is the standard and another one available only for NVidia
graphic cards, named Cuda. Developing application for GPU has several restriction
that are dependent on the hardware and the framework being used.
First of all there is a hardware restriction related to resources. Even though there are
many cores in a graphic card, the memory available is usually far less than the available
RAM. This is important since in training the restricted Boltzmann machine there are
many cases where memory might become a problem, for example in computing the
product of a hidden layer and a visible one (which results in a matrix) in contrastive
divergence algorithm. So one should pay attention to the number of visible and hidden
units. Another restriction related to memory is the time it takes to copy a chunk of
data from RAM to GPU memory and process it. Since this takes time, the data should
be processed in batches.There is also a dierence between how programs for GPU are
implemented in contrast to CPU. The gpu consists of several hundreds or thousand of
execution units that execute the same instruction across all units at the same time but
on dierent data. This changes the way data should be processed in order to achieve
the best performance. A good example is how an access over an array of elements is
being done. On a cpu the way to process all elements is to iterate one element at a
time and process it. However, on the gpu, the same operation over the same array
should be done by processing several elements at a time, each execution unit of the
gpu processing one element. Thus, the data is being processed in chunks. So, even
though the functions described in Algorithm 9 seem to iterate using for loops this is
done only to make understanding of the algorithm easier.

Chapter 3. Improving Malware Detection By Feature Extraction 104
Another problem that one might encounter is related to generating pseudorandom
numbers. Since all the threads that run on a gpu execute the same instruction at
the same time it makes no sense to rely on traditional functions that generate random
numbers since they rely on a seed. Doing this will result in the same random number for
all the threads. In fact, OpenCl does not even provide a function for random numbers.
To use random numbers in OpenCl one should write their one code or use a library
provided by someone else. In this case the random number generator provided by
David Thomas [48] was used. This library also relies on a seed so in order to generate
several numbers at the same time, a seed for every thread should be rst randomly
generated. (In contrast to OpenCl , the Cuda Framework comes with support for
generating random numbers on GPU so no extra overhead is required).
Since the training is done using hundreds or thousand of threads that execute the same
instruction at the same time, there is a chance that at some point more than one thread
will try to access and modify a memory location. This especially happens when all the
threads have computed some partial results and an operation that needs to unite all
results is used, for example an addition. In order to solve this, an atomic operation
is needed. However, OpenCl does not support atomic addition. To achieve this, one
must implement it using a compare and exchange function. The Cuda Framework
makes things easier and has many atomic operations including atomic addition.
Once the training of the restricted Boltzmann is over, the neural network can be used
in order to compute the new features. This step can be done either separated, by
obtaining the values for the hidden layer on the gpu and storing them as new training
data, or it can be done directly on the ensemble. As stated before, the resulted features
can be treated as either boolean values (this is done by sampling a value using a
Bernoulli distribution with respect to the probability obtained for each hidden unit) or
as continuous ones (in this case the resulted probability for the hidden unit is used).
Using the probability gives the linear classier more information, thus accuracy will be
improved. The downside of this is that the memory usage increases. If the boolean
features can be stored in one byte or even one bit, the smallest
oating type that can

Chapter 3. Improving Malware Detection By Feature Extraction 105
store the probability is represented on 4 bytes. This can be solved by reducing the
precision of the probability to two decimals and multiplying it with 100, thus making
every value between 0 and 100. This has the advantage that the probability can be
represented by an integer type that occupies just one byte. The algorithmic form of
training the ensemble is depicted in Algorithm 11.
Algorithm 11 Training Ensemble of RBM and OSCP
1:functionTrainEnsemble (R)
2:newR
3: fori= 1!jRjdo
4:TrainRBM usingRecord (Ri)
5: end for
6: fori= 1!jRjdo
7:probH =ComputeProbabilities hiddenLayer (Ri)
8:newR newR[probH
9: end for
10:OSCPTrain (newR )
11:end function
Where:
1.Rrepresents the initial records used for training
2.newR represents then new set of records that are obtained from feeding the
original records to the restricted Boltzmann machine and using the probabilities
of the hidden units
3.2.2 Testing the Ensemble and Results
In order to test how this ensemble performs, a collection of malicious and benign
samples was gathered during the months of January and February 2014, containing
3087200 les. This was further processed and 3299 features were extracted. Most of
the features are boolean and determine if a le has a certain characteristic or not. For
example, there is a feature that determines if the le is an executable, there is another
one that tells if the le has a high entropy, or if it is a 64 bit applications or not.
There are also features that store continuous values, for example the size of the le.
These were also transformed into boolean. The next step was to lter the noise in the

Chapter 3. Improving Malware Detection By Feature Extraction 106
database. Even though the label of every record that is used in training and testing
is either malicious or benign, in practice there is another label called gray. This is
for les that do not harm the computer, but in certain situations may be considered
inappropriate. For example, cracks or key generators are just programs that overwrite
licensing protections of a commercial application, so no harm is being done to the one
that uses them. However, they bring a nancial loss to the company that sells the
application. Another example can be the one related to adware. Some users consider
that seeing advertisements that relate to their buying preferences is a good thing while
other are being disturbed by them. From the classication point of view, the gray
applications bring noise to the database. The main reason is that some of them look
very similar to malicious applications (for example cracks and key generators are often
obfuscated) while others resemble more to a benign one (for example adware). A
similar case is that of le infectors. This kind of malware works by adding its code to
the one of the le being infected, so when the original le runs it will also execute the
malicious code. Although this is no longer considered a gray behavior, a le that was
altered by a le infector is very dicult to correctly classify. The reasons for this is
that sometimes, the code for the malicious behavior is so small when compared to the
original application that it isn't taken into consideration when creating the features,
thus the infected le will have the features as it was never altered. For example, there
are le infectors that achieve execution by modifying a le so that it loads an external
component. The modications brought to the le can be as small as 5 bytes. Due
to these reasons, the gray les as well as the ones that contained a le infector were
discarded from database
To observe the advantages brought by this ensemble, the dataset was also used to train
an One Side Class Perceptron (OSCP) and a Mapped One Side Class Perceptron(OSCP-
MAP). Since mapping 3299 features would produce a very high increase in memory and
in training time for the OSCP-MAP algorithm, the number of features was reduced to
a more accessible one. For this reason, the F-score (Equation 3.1) was computed for all
the features and only the best 300 were kept. This, however, lead to another problem:

Chapter 3. Improving Malware Detection By Feature Extraction 107
there were now records that had the same exact features, but a dierent label. This
problem was solved by keeping only the records that are marked as benign and remov-
ing the malicious ones. Even if this would decrease the detection rate, the logic behind
this decision is that keeping the benign samples will make the algorithm less prone
to false positives. The nal database consisted of 1260543 les, from which 31507 are
marked malicious and 1229036 are marked benign, each one of them consisting of 300
boolean features.
The OSCP and OSCP-MAP algorithms, as well as the OSCP that made part of the
ensemble were trained for 10000 iterations using a learning rate of 0.01; the computation
was carried out using an Intel Xeon CPU E5-2440(2.4 GHz), 24 gb ram and running
Windows Server 2008 R2. The restricted Boltzmann machine was only trained for two
iterations, using a learning rate of 0.1 and the contrastive divergence algorithm in one
step. The computations were carried out using an Ati Radion 7970 graphic card and
the OpenCl framework. The algorithms were trained and tested using a 3-fold cross
validation. The results are displayed in Table 3.10
Algorithm Detection False Positive Accuracy
OSCP 73.11% 20 (0.00162%) 99.32 %
OSCP-RBM 88.83% 3 (0.00024%) 99.72 %
OSCP-MAP 90.18% 87(0.00707%) 99.72 %
Table 3.10: The results of the 3-fold cross-validation for OSCP, OSCP-MAP,
OSCP-RBM algorithms
As it can bee seen, the ensemble (noted with OSCP-RBM in Table 3.10) has obtained
an increase in detection of 15.72 % compared to the one side class perceptron (OSCP).
There is also a decrease in the number of false positives, the ensemble obtaining 6 times
less false alarms than the OSCP algorithm. However, when comparing the ensemble
to the mapped version of the perceptron the detection rate is smaller with 1.35 %.

Chapter 3. Improving Malware Detection By Feature Extraction 108
I consider the small decrease of the detection rate is well compensated by the big
dierence in the number of false positives obtained by the two algorithms, the OSCP-
MAP obtaining 29 times more false positives. If one would have to choose between any
of these algorithms for practical use in malware detection, the OSCP-RBM will be the
best option due to its increased detection rate and very low number of false positives.
The 1.35 % loss of detection compared to OSCP-MAP is usually compensated by other
detection methods.

Chapter 4
Clustering and Classication of Script based Malware
As users began to nd out more about malware, more people started to adopt measure
in order to protect from it. Most of the computer users know that they don't have to
execute a le that just came over their e-mail or was sent by someone over an instant
messaging protocol. This determined malware creators to adapt their infection vectors
to something that users are not accustomed with. Nowadays, a very used method to
spread malware is by the use of vulnerabilities found in software that people tend to use
often, like programs used for rendering documents or web browsers. A vulnerability that
can be used to take control over a computer, named remote code execution vulnerability,
can be sold in the underground market with thousand of dollars. The programmers of
the targeted software try to solve the problem by delivering security patches, but this
takes time. Until an update reaches their users, the vulnerability is called zero-day,
since it can be used to attack any user that runs the vulnerable software.
This new ways of delivering malware opened the door to a new kind of attacks, named
Advanced Persistent Threat (APT). Here, the target is not chosen randomly, but se-
lected according to a possible gain. For example, a person that has access to an
institution or maybe the server of a competitive company could be attacked. The sit-
uation is far worse since programs that contain collection of exploits are being built
and sold on the black market. These programs, named Exploit Kits, are usually run
on web-servers and try to automatically take control of the users computer that visits
109

Chapter 4. Clustering and Classication of Script based Malware 110
the server, by delivering a collection of exploits that target most used applications like
browser or document rendering software. For example, the BlackHole Exploit Kit [49],
the most widely exploit kit purchased in the underground market according to Internet
Crime Complaint Center [50] was responsible for the delivery of more than 70% of the
Zero Day attacks in 2013.
In contrast to standard documents, where the only actions available are reading and
modifying it, there also exist documents that can interact with the user or assist him
by doing some automatic actions. This kind of interaction can be very complex and
could include sending automatic e-mail, auto-completion of some elds depending on
what choices have been previously selected or just automatic parsing of the answers
a users may have given to a form. Since the possibilities are unlimited, the only way
to implement this kind of feature is to add code to the document that is responsible
for the user interaction and add support for the programming language to the software
that renders the documents. A good example for this is the support of the JavaScript
language for the portable document format (pdf) in the Adobe Reader software. How-
ever, this feature also opened the door for malware creator that can make use of the
programming language feature to execute their malicious code.
The same thing applies to web browsers. Even though the main language used to dis-
play web contained is HTML (Hyper Text Markup Language), most websites nowadays
make use of JavaScript code in order to make browsing more interactive. As in PDF
les, the use of JavaScript code allows programmers with malicious intent to make use
of vulnerabilities found in browsers in order to take control over the computer running
the browser. So, at the current moment, it is possible for a user to get infected just by
visiting an url with a certain browser.
During the rest of the chapter the main focus will be on clustering and detecting
JavaScript code embedded in document les (PDF) or in browser pages (html). Even
though all examples and analysis will refer only to the JavaScript language, they can
be easily adapted for any script based on a programming language. The reason for

Chapter 4. Clustering and Classication of Script based Malware 111
choosing JavaScript is due to the fact that it is the most used scripting language
involved in web based attacks.
Since scripts can be easily extracted from document and web pages, it can also be
signed and detected by anti-virus products. This is why malware creators use dierent
tools to obfuscate the code in order to make them seem dierent and to avoid detection.
This includes:
Garbage code insertion: adding code that does not alter the normal behavior.
This usually includes comments, separators, but could even be instructions. This
code is usually inserted at dierent points in the script in order to change its
appearance.
In the example provided below (Listing 4.1), which is extracted from an obfus-
cated script, there is an initialization of variable mjBHp with a value ("ATWC")
and then it is tested with another value "yFCO". Since the result is false, the
function FxSK() will not get executed. The only piece of code that is important
in the script is the initialization of the variable MPqABYWip with a string. The
rest, is garbage code.
1var mjBHp ='ATWC ';
2if ( mjBHp == 'yFCO ')
3{
4 FxSK ();
5}
6function rhcYat ()
7{
8}
9MPqABYWip ="a0h **** r7. colors4 .us /? biognu3q =*** ";
10 function zOOC ()
11 {
12 }
13 eZOr ='CiOUQY ';
14 if ( eZOr == 'betPbu ')
15 {
16 PPlvOt ();
17 }
Listing 4.1: Garbage Code Insertion Example

Chapter 4. Clustering and Classication of Script based Malware 112
Variable Renaming: this is the most used method and is based in renaming every
variable name, function name, or class name to a random generated string
Listing 4.2 provides an example of two scripts that belong to the same family
and do the same thing, but look dierent due to renaming of all the variables. In
this particular example, the strings were also changed, but this is only due to an
encryption with dierent key.
1/* Example 1 */
2var par = " scheduled flights to orlando ";
3function R(sI , X)
4{
5 if (!X)
6 {
7 X='XM [8u9 <lm?-c.pOS `E|odH^Wx … ';
8 }
9 var y;
10 var DS='';
11 …
12 }
13
14 /* Example 2 */
15 var pa = " newport beach travel agency ";
16 function r(QE , l)
17 {
18 if (!l)
19 {
20 l='T_/K=qrN -9 MthDGI6VE |P]{ @y … ';
21 }
22 var x;
23 var Ja='';
24 …
25 }
Listing 4.2: Variable Renaming Example
String obfuscation
{String splitting: rewriting a string as a combination of dierent components.
For example, the string "String" could be written as "S"+"t"+"r"+"i"+"n"+"g"
{String replacement: using replace function to remove garbage characters
added to a string. This can also be used together with a regular expression in

Chapter 4. Clustering and Classication of Script based Malware 113
order to hide the pattern. For example,the following code replaces "garbage"
string with nothing in order to obtain the string "Test" ObfuscatedTest =
"Tgarbageegarbagesgarbaget ";Test =ObfuscatedTest:replace ("garbage ";"")
{Ascii code rewriting: Every character from a string is replaced by its ascii
code and the unescape function can be used to decode it. In the following
example, the string retrieved is "test" " unescape ("%54%65%73%74")
{String decryption: the original string is obtained by applying one, or several
operations on the original one. This is usually a combination between a xor
operation and a replace one.
Equivalent Code Substitution: pieces of code are changed in order to make the
script look dierent, but the same result is obtained. In Listing 4.3 an example
extracted from 3 dierent malicious scripts is provided. It can be seen how the
value 16 is obtained by 3 dierent scripts in a dierent way.
1/* Example 1 */
2function v4690f2f18b2a9 ( v4690f2f1900e1 )
3{
4 var v4690f2f194ef8 =16;
5 return ( parseInt ( v4690f2f1900e1 ,
6 v4690f2f194ef8 ));
7}
8
9/* Example 2 */
10 function v4690f2762e644 ( v4690f27633467 )
11 {
12 return ( parseInt ( v4690f27633467 ,16) );
13 }
14
15 /* Example 3 */
16 function v4690f53753030 ( v4690f53757fb1 )
17 {
18 function v4690f5375cc73 ()
19 {
20 return 16;
21 }
22 return ( parseInt ( v4690f53757fb1 ,
23 v4690f5375cc73 ()));
24 }

Chapter 4. Clustering and Classication of Script based Malware 114
Listing 4.3: Equivalent Code Substitution Example
Function Permutation: this is the simplest one and it works by modifying the
order of the functions in order to avoid signature based detection. Since for some
functions it doesn't matter their order, this can be changed in order to obtain
scripts that look dierent.
1
2/* Example 1 */
3function awghh ()
4{
5 return " ppi4 "+"."+"r"+"e"+" place ("+" /199288 "
6 +" 5896235/g,"+" mth.c"+" harAt (4) );";
7}
8YKP8Oa ="e"+"v"+ srom . target . author ;
9function koul ( sts )
10 {
11 return valueOf [ sts ];
12 }
13
14 /* Example 2 */
15 function awghh ()
16 {
17 return " ppi4 "+"."+"r"+"e"+" place ("+" /199288 "
18 +" 5896235/g,"+" mth.c"+" harAt (4) );";
19 }
20 function draww ( tara )
21 {
22 return valueOf [ tara ];
23 }
24 g1hT7 ="e"+ vaar . target . creator ;
Listing 4.4: Function Permutation Example
Note how in Listing 4.4 the function koul is moved from the end of the script in
Example 1 to a dierent position in Example 2 . In this case, the function has
also been renamed.
Code reordering: This is similar to Function Permutation. Lines of code that
don't depend one on the other can be interchanged in order to obtain a dierent
looking script. An example is provided in Listing 4.5

Chapter 4. Clustering and Classication of Script based Malware 115
1/* Example 1 */
2var iv = geer . charAt (1);
3var oon = I6l (
4 " rak . rep "+" lace (/1992885896235/ g, iv);");
5V1TI = event . target . author + event . target . subject ;
6I6l = kast ( V1TI );
7razors = " eywuujsd ";
8
9/* Example 2 */
10 var iv = geer . charAt (1);
11 uf48RRe = event . target . author + event . target . subject ;
12 nep7 = kast ( uf48RRe );
13 var bje = nep7 (
14 " rak . rep "+" lace (/1992885896235/ g, iv);");
15 razors = " eywuujsd ";
Listing 4.5: Code Reordering Example
Create dierent layers of execution: Since scripts don't need to be compiled, the
code can be generated at runtime. Using dierent function, like eval in the case
of the JavaScript language, allows the programmer to write the code as a string
and then force the script language to evaluate and execute it. This technique is
often used in combination with the string obfuscation techniques. Due to use of
eval, the entire code can be written as a string.
1var s1=" document . write (' test ')";
2var s2 =" eval ("+s1+")";
3var s3 = " eval ("+s2+")";
4eval (s3);
Listing 4.6: Execution Layers Example
The example in Listing 4.6 shows how evalfunction can be used in order to create
dierent layers of execution. In a malicious script, each of the variables s1and
s2would be further obfuscated using methods presented above.
The rest of the chapter will be split in two parts: the rst one deals with malware
present in document les, while the second will concentrate on scripts present in web
pages. Since the same programming language is used in both cases, the methods

Chapter 4. Clustering and Classication of Script based Malware 116
used for documents can also be used for web-pages and vice-versa. For this reason, the
discussion regarding documents will be concentrated on clustering while the part of the
chapter that deals with scripts present in web pages will concentrate on clasication.
4.1 Clustering Document Based Malware
Through the rest of this section the main focus will be on clustering document based
malware. Clustering is very important in malware industry since it reduces the work
a malware analyst needs to do. Also, by having many similar les grouped together,
generic signatures can be added which in turn, increases productivity and proactivity.
This is especially true since most script based malware are automatically obfuscated on
the server side on every request. This is done in order to avoid detection. The resulted
scripts have similar structure and the same behavior, but their form diers from one
version to another. Based on the clustering method, automatic detection can also be
added. This technique will be discussed at the end of the chapter.
Two methods for clustering will be provided:
1. Hierarchical bottom-up clustering: This method is based on computing a distance
between every two items and cluster together the les that are close to each-other.
This method provides good results, but due to its time complexity ( O(N2)) it is
impractical in very large datasets
2. Hash-Based clustering: This method works by computing a hash function over
every item and clustering together the les for which the function outputs the
same value. Even though the results are not as good as in the previous method,
having a linear time complexity makes it well suited for a very large dataset.
Choosing the right method should be related to the restriction provided by the
the testing environment (computing power, dataset size, time-frame).

Chapter 4. Clustering and Classication of Script based Malware 117
4.1.1 Hierarchical Bottom-up Clustering
In order to compute the distance between every malicious document, the les need to
be mapped to a coordinate system. To do this, the le needs to be processed, features
extracted and stored in an abstract manner. The obtained result will be called PDF
ngerprint. Contrary to how the name may suggest, the PDF ngerprint will be specic
to the features extracted from the le, and not to the le itself. This means that it
is possible to have les that are dierent, but have the same features, thus the same
nger print.
In order to create the PDF ngerprint, the following steps must be carried on:
1. Extract the JavaScript code from the PDF le. If a PDF le contains more than
one script, all of the scripts will be extracted, and another one will be generated
that is made up of all the scripts present put together. In the case there is a le
that does not contain JavaScript, it will be skipped.
2. Tokenize the JavaScript les obtained, by using the grammar from the ECMA
scripts [51] and for every token extract as tuple its value and its class. The class
of the token is internally represented as a unique number and is specic for every
implementation. For example, if a script contains the string "String" and the
code used to tokenize it has internally dened the id of the class string to 100,
then the output tuple is: f100,"String"g
3. Remove unnecessary tokens: in this step, the number of tuples is reduced, by
eliminating comments, separators(for example semicolon) or unnecessary paren-
thesis. Since in some cases, computing parenthesis can be time consuming, a
good alternative is to eliminate all of them.
4. Emulate simple string operations: this step can be used to reduce the number
of tuples and to solve some problems brought by string obfuscation techniques.
This may include emulating concatenation operation, computing replacements or
decoding strings that are represented as ascii or unicode values. Even though
this is not mandatory for the clustering algorithm, it can improve the accuracy.

Chapter 4. Clustering and Classication of Script based Malware 118
5. For each class, count how many times it appears and generate a new set of tuples,
that consists of the id of the class and its frequency. The PDF Fingerprint will
be the set of tuples generated on this step. Since sets are used and the PDF
ngerprint does not contain any identication of the order in which elements
appear in the original le, a more general representation is obtained.
The previously described steps are better explained using the following example.
Let us consider, that the following script is present in the le. The rst step would be
to extract it.
1 Cvb345 = „Num ''+''ber ''+''s:''; For (i =0;i$ <$10 ;i ++) /* jhgasflkjh awruy wietu
kwejth jwkehgt jwehtg jtwheg */ \{ /* kjrehw kejwrh */ Cvb345 = Cvb345 + i.
toString (); /* ertyuui fsjdh jhg jjsdhgf fsjh jhfgs jhgfsd jfjsdhg */ \}
2 /* new line comment */ \\ \ hline
The second step is to parse the script, extracting every token, and create a tuple that
contains the class and the value. These are depicted in Table 4.1. Note that token
class is still represented here as a string and not as a number, in order to make the
example easier to understand.
Having the tokens class extracted it is easy to follow step 3, where unnecessary tokens
will be removed. In the case of this examples, comments, round and curly brackets,
semicolons will no longer be used, thus they don't appear in 4.2
Since this is a simple example, there are not many string operations that can be done.
The only available one is the elimination of string spiting by concatenating the three
pieces of the string "Numbers". The resulted tokens after executing step 4 is presented
in table 4.3.

Chapter 4. Clustering and Classication of Script based Malware 119
# Token Class Value # Token Class Value
1 Variable Cvb345 20operator increment ++
2operator equal = 21 bracket )
3 String "Num" 22 comment /*jhgas
kjh . . .
4 operator add + 23 curly bracket f
5 String "ber" 24 comment /* kjrehw . . .
6 operator add + 25 variable Cvb345
7 String "s:" 26 operator equal =
8 Semicolon ; 27 variable Cvb345
9 keyword for For 28 operator add +
10 Bracket ( 29 variable i
11 Variable i 30 operator point .
12operator equal = 31 class function toString
13 Number 0 32 bracket (
14 Semicolon ; 33 bracket )
15 Variable i 34 semicolon ;
16operator lower < 35 comment /* ertyuui . . .
17 Number 10 36 curly bracket g
18 Semicolon ; 37 end of line nn
19 Variable i 38 comment /* new line . . .
Table 4.1: Token classes and their value
The nal step is to simply count how many times each token class appears in the script.
This can be seen in 4.4
The resulted ngerprint is:
Fingerprint =f(class variable, 7), (class operator equal, 3), (class string, 1), (class keyword for,
1), (class Number, 2), (class operator lower, 1), (class operator increment, 1), (class operator add,
1), (class operator point, 1), (class function, 1)g
Having the PDF Fingerprint as an abstract representation for the le, it will be used to
map the le to a Cartesian coordinate system. For each available token class there will

Chapter 4. Clustering and Classication of Script based Malware 120
# Token Class Value # Token Class Value
1 Variable Cvb345 13 operator lower <
2operator equal = 14 Number 10
3 String "Num" 15 Variable i
4 operator add + 16operator increment ++
5 String "ber" 17 variable Cvb345
6 operator add + 18 operator equal =
7 String "s:" 19 variable Cvb345
8 keyword for For 20 operator add +
9 Variable i 21 variable i
10operator equal = 22 operator point .
11 Number 0 23 class function toString
12 Variable i
Table 4.2: Token classes and their value
# Token Class Value # Token Class Value
1 Variable Cvb345 11 variable i
2operator equal = 12operator increment ++
3 String "Numbers" 13 variable Cvb345
4 keyword for For 14 operator equal =
5 variable i 15 variable Cvb345
6operator equal = 16 operator add +
7 Number 0 17 variable i
8 variable i 18 operator point .
9operator lower < 19 class function toString
10 Number 10 20
Table 4.3: List of tokens after performing simple string operations
be a coordinate axis and the frequency present in tuple will serve as the coordinate on
that axis. By doing so for every PDF ngerprint, similar les will end up close to each
other. The only problem that remains is to dene a metric to compute the distance
between every two ngerprints as well as a threshold in order to determine what close
actually means.
The rst idea was to use the Manhattan distance.
Given 2 PDF ngerprints:

Chapter 4. Clustering and Classication of Script based Malware 121
# Token Class Frequency # Token Class Frequency
1 Variable 7 6 operator lower 1
2operator equal 3 7operator increment 1
3 String 1 8 operator add 1
4 keyword for 1 9 operator point 1
5 Number 2 10 class function 1
Table 4.4: Token classes and frequencies
p=ftuplefp1g;tuplefp2g;:::tuplefpngg
q=ftuplefq1g;tuplefq2g;:::tuplefqmgg(4.1)
where a tuple is dened as
tuple =ftokenClass;frequency g (4.2)
The set of token classes present in each ngerprint can be dened as:
Pc=jpj[
i=1tuple pi:tokenClass
Qc=jqj[
i=1tuple qi:tokenClass(4.3)
Thus, all the classes present in the system will be:
K=Pc[
Qc (4.4)
The Manhattan distance between two points will be:
d(p; q) =KX
iDist(i) (4.5)

Chapter 4. Clustering and Classication of Script based Malware 122
Figure 4.1: Distribution for the Manhattan distance for all pair of les
where
Dist(i) =8
>>>><
>>>>:[l]i2Pc;i2Qc=> Dist (i) =jtuple pi:frequencytuple qi:frequencyj
i2Pc;i =2Qc=> Dist (i) =jtuple pi:frequency0j
i =2Pc;i2Qc=> Dist (i) =j0tuple qi:frequencyj9
>>>>=
>>>>;
(4.6)
The only thing needed is to set the threshold. In order to do this, a small database
was generated consisting of 10000 les that belong to 12 well known malware families.
Using the Equation 4.5, the distance between every two les is generated and plotted
to a histogram. To make things easier to spot, the histogram is created in steps of 10%
starting with 0, which means that two les are very dierent and ending in 100 which
means that two ngerprints are identical. The results can be seen in Figure 4.1
On the x-axis the intervals of similarity are represented while the number of distances
for every pair of two les are represented on the y-axis. Is is clear by looking at the

Chapter 4. Clustering and Classication of Script based Malware 123
Cluster Family Number of PDF les from family
Cluster 1 Family 04 7
Family 07 6
Family 01 786
Cluster 2 Family 05 3949
Family 08 481
Family 03 110
Family 09 2
Family 12 50
Cluster 3 Family 02 227
Family 11 2565
Family 10 1808
Family 06 6
Table 4.5: Cluster components for Manhattan distance algorithm
graph that the threshold that should be chosen is 80%. The graph also shows an
anomaly at the 30-40%. The reason for this will be explained later.
Using the PDF ngerprints for the database consisting of 10000 les, the Manhattan
distance and the 80% threshold, the clusters were generated. The results are displayed
in Table 4.5.
The results are not very encouraging. Many les that belong to dierent families
were clustered together. The best results would have been 12 clusters, each family
belonging to one single cluster. There are dierent reasons for why this happened, one
being related to the metric and the other being related to the threshold. First of all, the
metric is computed based on the dierence of the frequency of identical tokens. Due
to this, the distance between two les in which the frequency of some tokens are 100
and 101 will be the same as the distance for two les in which the same token appears
for just 0 and 1 times respectively. This happens since both dierences equal to 1. In
the rst case, having a token appear for over 100 times is a valuable information, while
in the second case this could be related to some garbage instruction resulted from
obfuscation techniques. So the Manhattan distance is not the right metric for this
scenario. Second, the threshold, is computed as a percent from the greatest distance

Chapter 4. Clustering and Classication of Script based Malware 124
available. This means that if there is a big distance between two les, it doesn't matter
if other two les are somewhat dierent, because when it relates to that they will end
up in the same cluster. This may be the motive behind the anomaly present in the
Figure 4.1 at the 30-40% interval.
In order to solve this, a new metric is proposed. First, the subtraction operation from
the distance equation is replaced by a division. Secondly, the percentage of similarity
between two les is established only using information from the two ngerprints. Thus
Equations 4.5 and 4.6 will become
d(p; q) =PK
iDist(i)
jKj100 (4.7)
where:
Dist(i) =8
>>>><
>>>>:[l]i2Pc;i2Qc=> Dist (i) =min(tuple pi:frequency;tuple qi:frequency )
max( tuple pi:frequency;tuple qi:frequency )
i2Pc;i =2Qc=> Dist (i) = 0
i =2Pc;i2Qc=> Dist (i) = 09
>>>>=
>>>>;(4.8)
The algorithm for computing this distance between any two le is listed in Algorithm
12.
Using the new metric and the new way of computing the threshold, the same tests as
before were conducted. As in the previous case, the les were mapped to the same
histogram. The results are displayed in Figure 4.2. As it can be seen, the number of
les for which the distance is 0 (100% similarity) are the same, but there is a signicant
dierence when it comes to the distances in the 80-90% interval and the ones for 0-10%
interval. It is normal to have the biggest spike on the 0-10% interval since for every
le that belongs to a malware family, there is an equal number of distances that are
very big as the number of les present in other families. As for the interval 80-90% it

Chapter 4. Clustering and Classication of Script based Malware 125
Algorithm 12 PDF ngerprint: Metric Distance
1:struct Tuplef
2:tokenClass
3:frequency
4:g
5:functionMetricDistance (PDFFingerprint 1; PDFFingerprint 2)
6:HF an empty hashTable
7: for allTuple T in PDFFingerprint 1do
8:HF[T:tokenClass ] HF[T:tokenClass ]SfT:Frequencyg
9: end for
10: for allTuple T in PDFFingerprint 2do
11:HF[T:tokenClass ] HF[T:tokenClass ]SfT:Frequencyg
12: end for
13:nrTokens 0
14:sum 0
15: for allkey in HF do
16:frequencyList HF[key]
17: ifjfrequencyListj== 2 then
18: sum sum +min( frequencyList )
max( frequencyList )
19: end if
20:nrTokens nrTokens + 1
21: end for
22:metric sum
nrTokens100
23:MetricDistance metric
24:end function
is normal to be at a low value since the distances that should be small should only be
those that are computed between les in the same family.
In order to see what kind of clusters could be obtained using this metric, the value
of the threshold was set to 80%, just like before. However, this time the number of
clusters obtained is 11, close enough to the number of malware families (12). Details
about the clusters are given in Table 4.6
As it can be seen, each family was assigned to a separate cluster, except families 4 and
7 which were clustered together. The les belonging to these families were extracted
and manually analyzed. The blocks of JavaScript code found in these pdf les are
indeed very similar. The dierence between these families seem to rely only on the
exploit code being used, but this is not found in the Javascript code.

Chapter 4. Clustering and Classication of Script based Malware 126
Figure 4.2: Distribution for the Manhattan distance for all pair of les
Cluster Family Number of PDF les from family
Cluster 1 Family 05 3949
Cluster 2 Family 01 786
Cluster 3 Family 03 110
Cluster 4 Family 09 2
Cluster 5 Family 04 7
Family 07 6
Cluster 6 Family 11 2565
Cluster 7 Family 06 6
Cluster 8 Family 02 227
Cluster 9 Family 10 1808
Cluster 10 Family 12 50
Cluster 11 Family 08 481
Table 4.6: Cluster obtained using custom metric distance
Even though this method provides reliable results, its time complexity makes it dicult
to use when it comes to large datasets. For this reason another method of clustering
will be discussed that has linear time complexity and can achieve good results.

Chapter 4. Clustering and Classication of Script based Malware 127
4.1.2 Hash Based Clustering
The hash based clustering method tries to address the main problem of the Hier-
archical Bottom-up Clustering method, its O(n2) time complexity. It does that by
processing each le only once and applying a hash function f:F!R+, where
F=ffile 1;file 2;:::;file ngrepresents the set of les. A cluster is created for each
new hash. Two les file i;file jwill belong to the same cluster if f(file i) =f(file j).
The algorithm for creating such hash starts with the PDF ngerprint previously de-
scribed when the Hierarchical Bottom-up Clustering method was introduced. The steps
are described bellow:
1. Sort the tuples according to the frequency member. In case of two tuples that
have the same frequency, the one which has a higher class-id will come rst.
2. Normalize tuples frequency: this means that the frequency member of every tuple
will be divided by a factor. This is important since there could be similar les
where the frequency of some tokens dier only by a small amount. Note, that the
division is one between two integer numbers and only the quotient will be kept.
For examples, let us consider the following two sorted ngerprints:
F1 =ffclass1, 101g,fclass2, 40g,fclass3, 35g,fclass4, 1gg
F2 =ffclass1, 102g,fclass2, 40g,fclass3, 34g,fclass4, 1gg
As it can be seen, the frequency of class1 member from the rst ngerprint dier
only by one unit from the frequency of the class1 member of the second ngerprint.
The same thing is true for class3 member. If the frequencies are divided by a
normalizingfactor equal with 2, then the ngerprints will become:
F1 =ffclass1, 50g,fclass2, 20g,fclass3, 17g,fclass4, 0gg
F2 =ffclass1, 50g,fclass2, 20g,fclass3, 17g,fclass4, 0gg
This step is important, since there are obfuscation algorithms that when run
multiple times over the same le will produce dierent results. The dierences are
quite small and usually refer to more operators in one string splitting operations

Chapter 4. Clustering and Classication of Script based Malware 128
or adding some garbage code. This was not a problem when using the distance,
since the small dierences would be translated in small distances between les.
3. Apply a Hash Function: This step is the nal one and involves using a widely
known hash algorithm (ie. MD5 [52], SHA256 [53]). By computing a hash over
each le the result can be used as a key to a dictionary. This way there is
no need to compare hashes between every two les, thus obtaining a high time
complexity. All the les that have the same hash, will be located at the location
in the dictionary.
The algorithm steps previously described can be followed in Algorithm 13
Algorithm 13 Hash Based Clustering
1:functionBuildClusters
2:HF an empty hashTable
3: for allfile in PDF DATASET do
4:PDFFingerprint =GetPDFFingerprint (file)
5:normPDFFingerprint =Normalize (PDFFingerprint )
6:sortPDFFingerprint =Sort(normPDFFingerprint )
7:hash =MD5(sortPDFFingerprint )
8:HF[hash ] =HF[hash ]Sffileg
9: end for
10:end function
To make things clear, an example is provided. Considering the following two scripts
that were extracted from pdf les:
1 function randomName1 () {
2 var test = String . fromCharCode (116 , 101 , 115 , 116 , 95) ;
3 test = test + "1" + "2";
4 document . write ( test ) }
1
2 function randomName2 () }
3 var randomVar2 = String . fromCharCode (116 ,101 ,115 ,116) ;
4 randomVar2 = randomVar2 + "\_" +"1" +"2";
5 document . write ( randomVar2 ); }

Chapter 4. Clustering and Classication of Script based Malware 129
The scripts have many things in common, the name of the variables dier and there
are some extra operations from one script to another. To obtain the hash of these two
les, the steps previously described are followed:
1. Obtain le ngerprint: Two ngerprints for the scripts are:
fp1 =ff"function", 1gf"var", 1g,fvariable, 5g,f"String", 1g,f"fromCharCode",
1g,fnumber, 5g,fstring, 2g,f"document", 1g,f"write", 1g,f"=", 2g,f"+", 2gg
fp2 =ff"function", 1gf"var", 1g,fvariable, 5g,f"String", 1g,f"fromCharCode",
1g,fnumber, 4g,fstring, 3g,f"document", 1g,f"write", 1g,f"=", 2g,f"+",
3gg
2. Sort tuples by frequency:
fp1 =ffvariable, 5g,fnumber, 5g,fstring, 2g,f"+", 2g,f"=", 2g,f"function",
1gf"var", 1g,f"String", 1g,f"fromCharCode", 1 g,f"document", 1g,f"write",
1gg
fp2 =ffvariable, 5g,fnumber, 4g,fstring, 3g,f"+", 3g,f"=", 2g,f"function",
1gf"var", 1g,f"String", 1g,f"fromCharCode", 1 g,f"document", 1g,f"write",
1gg
3. Divide by a normalization factor. For this example, let us consider the factor
is 2. The results will be: fp1 = ffvariable, 2g,fnumber, 2g,fstring, 1g,f"+",
1g,f"=", 1g,f"function", 0gf"var", 0g,f"String", 0g,f"fromCharCode", 0 g,
f"document", 0g,f"write", 0gg
fp2 =ffvariable, 2g,fnumber, 2g,fstring, 1g,f"+", 1g,f"=", 1g,f"function",
0gf"var", 0g,f"String", 0g,f"fromCharCode", 0 g,f"document", 0g,f"write",
0gg

Chapter 4. Clustering and Classication of Script based Malware 130
4. Apply a hash function: depending of the hash function being applied and the
values for the classes id-s this result can dier from one implementation to an-
other. On one implementation where the value for classes variable, number,
string, "+","=","function","var","String","fromCharCode","document","write"
are 10,9 ,8 ,7 ,6 ,5 ,4 ,3 ,2 ,1 ,0 , respectively and all values are stored on 4 bytes,
the MD5 results is : 04cc450aab0b645da0acd24eb95f655d
The hash based clustering method was applied on the database containing 12 malware
families. The results are displayed in Table 4.7
Cluster Family Number of PDF les from family
Cluster 1 Family 03 110
Cluster 2 Family 05 3949
Cluster 3 Family 12 50
Cluster 4 Family 06 6
Cluster 5 Family 08 481
Cluster 6 Family 09 2
Cluster 7 Family 01 786
Cluster 8 Family 02 227
Cluster 9 Family 11 2565
Cluster 10 Family 10 1808
Cluster 11 Family 07 6
Cluster 12 Family 04 7
Table 4.7: Cluster components for hash-based clustering method
As it can be seen, this method has identied a fth cluster. This didn't happen in
the case of bottom-up clustering. In the next section a more in depth analysis of the
results provided by each method will be provided.

Chapter 4. Clustering and Classication of Script based Malware 131
4.1.3 Results
For testing both of the clustering methods, a larger database was constructed consist-
ing of 2311035 PDF-les. This dataset consists of 1333420 benign les gathered from
the dierent sources for a period of 12 months and 977615 les gathered from hon-
eypots, spam messages and public sources. Since the clustering algorithms work with
JavaScript les, the rst thing was to see how many of these malicious and clean les
have JavaScript code embedded. The results are displayed in Table 4.8.
Type Total number of Number of PDF les Percent
PDF les containing JavaScript
MAL-PDFJS 977615 913881 93%
CLN-PDFJS 1333420 72310 5%
Table 4.8: Nr. of PDF les containing javascript code
In table 4.8. CLN-PDFJS represent the collection of clean samples, while MAL-PDFJS
contain all the malicious les. As it can bee seen, the majority of malign PDF les
contain JavaScript code (93%). As for the CLN-PDFJS, the exact opposite is true:
most of the benign samples do not contain JavaScript code(95%). At this point ,one
might consider that the JavaScript presence inside a pdf could be used alone as a
detection trigger. This is very wrong since a 5% ratio of false alarms, especially for
documents, can be very dangerous in a practical situation. In a real case scenario,
deleting user documents is considered far worse than missing a malicious sample.
The rst comparison between the two clustering methods (Hierarchical Bottom-up
Clustering and Hash-Based Clustering) is related to their speed in order to see which
one can be used in practical situations. For this reason, a subset of 10000 les from the
original dataset was created. The selection of les was done in such a way to keep the
statistical validity of the original database: equal number of les were selected from

Chapter 4. Clustering and Classication of Script based Malware 132
dierent time frames. Using this subset the clustering algorithms were run and time
was measured. The test was performed on a computer with an Intel Xeon E5630 CPU,
at 2.54 GHz, and 12 GB of Ram, running Windows Server 2008/R2 x64. The results
can be seen in Table 4.9
Clustering Algorithm Duration
Hash Table
clustering 1 second
Hierarchical bottom up
clustering(using custom metric 1 day 21 hours 33 minutes
distance)
Table 4.9: Time comparision between Hash Based Clustering and Hierarchical
Bottom-up Clustering
The results clearly state that the Hash Based Clustering method is much time faster
than the Hierarchical Bottom-up Clustering. Looking at the time it took for the
Bottom-up clustering algorithm to nish running on dataset of 10000 les makes it
clear why a smaller dataset was needed for this test. In the reminder of this section,
all the results will be related to the hash-based method.
A problem that was not previously discussed is how to choose the normalization factor
so that good clusters can be obtained. Choosing a factor to small will create many
clusters, since it will not solve bigger variations of the frequencies. On the other hand,
if the value is too big, the number of created clusters will be too small, since most of
the frequencies will be equal. In order to nd the right value, multiple values were
tested. The results are displayed in 4.10
It can be seen in Table 4.10 how the number of clusters drops with every increase of
the normalization factor. The results for the last line in Table 4.10 are computed using
a variable value for the factor that is computed in relation to the number of tokens

Chapter 4. Clustering and Classication of Script based Malware 133
Normalization Number of Number of PDF Percent of le clustered
Factor clusters les clustered
1(non-normalized) 7568 906683 99.21237
2 6043 910521 99.63234
3 4664 911781 99.77021
4 3750 912304 99.82744
5 2902 912566 99.85611
6 2488 912714 99.8723
7 2093 912861 99.88839
8 1970 912906 99.89331
9 1730 912963 99.89955
10 1557 912942 99.89725
Variable 419 913520 99.9605
Table 4.10: Results for dierent values for normalization factor
present in the script. The way it is obtained is very simple: a percent (in this case
10%) of the total number of tokens is used.
normalizationFactor p=Pjpj
ituple pi:frequency
10(4.9)
As it appears in Table 4.10, the variable normalization factor produced the best results:
it outputed the smallest number of clusters and has also clustered the biggest percent of
les. In order to make sure that the clusters produced are representative for the malware
family, a subset of les (5%) from each cluster were manually analyzed. Results are
displayed in Table 4.11
Cluster Files Manually analysed les Similar les in the cluster
#1 90502 4600 100%
#2 63792 3200 99.9%
#3 43816 2200 100%
#4 33389 1700 99.8%
#5 27080 1500 100%
Table 4.11: Similarity of les from top 5 biggest clusters after manual analyzes

Chapter 4. Clustering and Classication of Script based Malware 134
The reason why the similarity of clusters 2 and 4 is not 100% is due to some scripts
present in each clusters that seem a little dierent. Even though the dierences are
small, the les have some extra tokens, thus dierent geometry. Clusters 1,3 and 5
correspond to les that were obfuscated but do the same thing.
A good reason why these results were obtained is also due to precomputation done
when creating the ngerprint. Without eliminating unnecessary tokens, concatenating
string, sorting by frequencies the results would not be the same.
To highlight this, a new dataset was created by selecting about 10% (23096 les) from
the original database. As before, the statistical validity was maintained by selecting les
that were collected during dierent time-frames from dierent sources. After manually
analyzing this new dataset, 18 families were identied. The hash-based clustering
method was run with and without precomputation. The results are displayed in Table
4.12
Used method Number of PDF Percentage of Number
les clustered clustered les of clusters
With precomputation (M1) 23096 100% 21
Without precomputation(M2) 23073 99.90% 63
Table 4.12: Clustering results obtained with and without precomputation
The results show that with precomputation, more les were clustered and less clusters
were created (only three more than the number of families). To better analyze the
resulted clusters, purity and detection will be computed. To do this, a family must
rst be assigned to each cluster. The role of the purity is to tell how many les from
each clusters were correctly assigned to a family while the detection tells us how many
les from a family were actually assigned to a cluster.
The purity of the clusters is computed by summing the maximum number of les
correctly assigned to a cluster (a family) divided by the total number of les in that

Chapter 4. Clustering and Classication of Script based Malware 135
cluster
purity (
;F) =1
NX
kmax
jj!k\fjj (4.10)
here,wkrepresents a cluster, fjrepresents a family and
= !1;!2;:::! kis the set
of outputted clusters. The detection of a cluster is the number of les present in the
cluster divided by the total number of les that belong to the family assigned to that
cluster.
D=nk
nf(4.11)
wherenkis the number of les in cluster kandnfis the number of les that belong
to the family f. Familyfwas assigned to cluster k. Since it happens for les that
belong to the same family to be assigned in more than one cluster, the mean detection
is computed:
meanDetection =PK
k=1Dk
K(4.12)
whereKis the number of clusters for the family fandDkis the detection for each
cluster where les for family fwere assigned. Both methods used deliver 100% purity.
Files belonging to dierent families were not found in the same cluster. However, as
seen in table 4.12, both methods have split several families to more clusters. Depending
on the method, the mean detection is dierent. More detailed results regarding this
are displayed in Table 4.13
Clusters belonging to families 9, 14 and 15 were manually analyzed. As it turned out,
the reason of splitting was not due to obfuscation techniques but to dierent versions
of malware belonging to the same family. The same thing was not true for clusters
generated using the second method. Some of the families (i.e family 15) were severely
fragmented when clustered without use of deobfuscation.
4.1.4 Detecting document based malware
Going from hash-based clustering method to a way of providing detection for document
based malware is very straight forward. The hash outputted by the clustering method

Chapter 4. Clustering and Classication of Script based Malware 136
Number Number of Number of Mean Mean
Family of les in clusters clusters detection detection
family using M1 using M2 for M1 % for M2 %
Fam 1 1047 1 2 100 49.85
Fam 2 1037 1 4 100 24.92
Fam 3 1036 1 1 100 100
Fam 4 1878 1 1 100 100
Fam 5 2291 1 2 100 50
Fam 6 2135 1 5 100 20
Fam 7 1039 1 3 100 33.26
Fam 8 658 1 2 100 50
Fam 9 1257 2 2 50 50
Fam 10 1247 1 7 100 14.26
Fam 11 1461 1 2 100 50
Fam 12 1022 1 5 100 20
Fam 13 1061 1 1 100 100
Fam 14 1086 2 3 50 33.33
Fam 15 2124 1 19 100 5.23
Fam 16 1043 2 2 50 50
Fam 17 1047 1 1 100 100
Fam 18 627 1 1 100 100
Table 4.13: Mean detections
will be used as a detection marker. However, since false-positive are very dangerous
for a real-life antivirus product, an analysis is required in order to compute the risk
of having benign les clustered together with malicious ones. As seen before, in Table
4.8, the simply presence of JavaScript code inside a PDF le rises the probability of a
le to be infected to 93%. But this is not enough, since there is still a 5% chance for
the le to be clean.
Since the PDF Fingerprint relies on token types and their frequencies, an interesting
question is how many of the tokens predominantly used in malicious les are found in
clean les. If the number is small, then there is a very low probability for these kind
of les to output the same hash, thus be clustered together. In order to answer this
question for each token present in the databases previously created (MALPDF-JS and

Chapter 4. Clustering and Classication of Script based Malware 137
CLEANPDF-JS) the F-score (Equation 3.1) was computed and tokens were ordered
accordingly to this result.
Besides this, another ratio was also computed:
S2=rate mal
rate clnwhere :
rate mal=number of occurrences in MAL PDFJS database
size of MALPDFJS database
rate cln=number of occurrences in CLN PDFJS database
size of CLNPDFJS database(4.13)
TheS2score tells for each token how likely it is to be used only in malicious les or
benign les. If the score is 1, then the token is used in equal manner both in benign
and malicious les.
The tokens that ranked in top 20 according to S1andS2score are displayed in Table
4.14 and Table 4.15 respectively.
Even though the operator + is at the top of the list, this is expected, since it is usually
used in string splitting operations. Other functions commonly used in obfuscation tech-
niques are also found in Table 4.14, like eval,replace ,substr and regular expressions.
Table 4.15 also highlights some tokens that are usually found in malicious les. For
example, the producer ,keyword ,creator ,subject ,title,info are keywords that contain
obfuscated data that will be decrypted during the rendering of the pdf-le. Table 4.15
also contains many keywords that are related to obfuscation techniques, like string
obfuscation ( concat ,reverse ,replace ,string ) or decryption (the xor operation) . Of
course, tokens like throw andcatch are also used in benign script, but a malicious one
contain much more of this kind of keywords.
By computing these scores, it has been shown that the chance of having clusters that
contain both malign and benign samples is greatly reduced. Of course, there is still the
case where both clean and malicious samples are obfuscated by the same code. This

Chapter 4. Clustering and Classication of Script based Malware 138
Token class ScoreS1Number of appearances Number of appearances
in malware les in clean les
+ 1076.10 1019302 9572
= 807.66 1043691 15637
f 778.63 940589 11376
g 768.77 941825 11623
S(string) 651.22 1043952 8552
replace 637.93 460520 431
substr 598.70 481557 1400
] 566.20 1031124 19561
[ 562.16 1029506 19565
var 555.37 665918 6717
String 535.10 363200 346
, 474.98 803237 12720
return 437.35 394890 2498
length 425.07 412698 3106
* 424.32 295740 848
catch 388.32 230224 333
try 383.50 226971 346
R(regular expression) 380.30 250843 748
sub 375.54 205304 155
eval 358.59 264488 1302
Table 4.14: Top 20 token classes according to score S1
case can only be solved by rst feeding the les to an emulator and then using the
result to compute the PDF-ngerprint and assign it to a cluster.

Chapter 4. Clustering and Classication of Script based Malware 139
Token class ScoreS2Number of appearances Number of appearances
in malware les in clean les
producer innite 85908 0
keywords 2144.07 72760 1
creator 530.99 90098 5
concat 130.65 106410 24
subject 128.78 78666 18
MAX VALUE 98.39 3339 1
title 58.32 142504 72
throw 57.91 64853 33
^ 47.97 58609 36
reverse 47.50 12898 8
info 42.99 53985 37
abs 39.79 180940 134
sub 39.03 205304 155
fs 34.76 181672 154
author 34.13 42856 37
replace 31.48 460520 431
getHours 31.25 10608 10
String 30.93 363200 346
Function 23.03 21104 27
catch 20.37 230224 333
Table 4.15: Top 20 token classes according to score S2

Chapter 4. Clustering and Classication of Script based Malware 140
4.2 Classication of Script Based Malware
In order to achieve good classication results, the scripts will undergo several trans-
formations in order to eliminate as much as possible from the obfuscation techniques.
Following this, two classication methods will be used, one relying on the One Side
Class Perceptron and the other on Hidden Markov Models.
Figure 4.3: Steps for building the classier
1. Extract the JavasScript code from the web-pages. This action involves two steps:
(a) extract the embedded directly code inside the html les through the use of
thescript tag
(b) extract the JavaScript code that is loaded from an external location by using
thesrcordata-for of the script and iframe attribute
2. Tokenize the script and remove each syntactic component that can be duplicated
in order to obfuscate the code. This include: comments, separators (comma,
semicolon, white spaces, tabs), brackets (round and braces). This way obfuscation
brought by using such characters in order to make the le seem dierent will be
removed.
3. Replace each user dened name with a symbol specic to the class of the to-
ken: This step involves renaming the variables, function , strings and regular
expressions. For example, each variable will be named to 'V', each string to '…'
and each regular expression to 'R'. This step is necessary in order to eliminate
obfuscation brought by using random generated names for variables.

Chapter 4. Clustering and Classication of Script based Malware 141
For example, by applying the steps above to the obfuscated script presented in Listing
4.7 an abstract representation of its code is created. This is depicted in Listing 4.8
1var par = " scheduled flights to orlando ";
2// asdf sgdt kopkie [q;
3function Rsfdf (sI , X)
4{
5 if (!X)
6 {
7 X='XM [8u9 <lm?-c.pOS `E|odH^Wx … ';;;;;;;
8 }
9 /*a ;;;;; kkpojiihn */
10 var y;
11 var DS='';
12 …
13 }
Listing 4.7: Obfuscated Script Example
1var V = " … "
2function V V, V
3 if !V
4 V = " …"
5 var V
6 var V = " … "
7 …
Listing 4.8: Obfuscated Script Tokenized Example
The other obfuscation techniques, such as equivalent code substitution ,function per-
mutation ,code reordering and creating dierent layers of execution involves dierent
procedures that are very expensive in terms of performance impact. For example, in
order to bypass obfuscation brought by using dierent layers, the le must be emulated.
For the rest of the techniques an automatic way of understanding code semantics is
proposed that involves machine learning algorithms.
4.2.1 Dataset
In order to create the dataset that is suitable for training and testing the algorithms,
several steps were taken. First, in order to nd benign script samples, the top 1 million

Chapter 4. Clustering and Classication of Script based Malware 142
websites from Alexa ranking [54] were visited and scripts that were used in their index
page were extracted. This includes scripts that are embedded in the page itself, as
well as the ones that are located in another place and referenced by using the 'src' and
'data-for' attributes of the ' <script>' and<iframe>' tags. The Only code written in
JavasScript was kept; the rest of the les were deleted.
The malicious samples were collected during a period of one year from dierent sources.
To make classication possible, the dataset was further cleaned, by removing the fol-
lowing type of samples:
1. Files that contain both benign and malicious code. This usually happens when
a webpage is infected by a script that acts similar to a le infector. For this type
of samples a dierent approach is needed, that will rst separate the infectious
code from the rest of the script.
2. Files that only contain ' <script>' and<iframe>' tags that reference malicious
scripts. The use of such a tag is not sucient to classify a sample since the only
information it contains is an url to a JavaScript code. The script referenced will
tell if it is malicious or not.
In doing so, a dataset that contains 1296962 benign samples and 25357 malicious scripts
was created. The ratio between malicious and clean les is similar to the one we would
nd in the wild.
4.2.2 Using the One Side Class Perceptron algorithm
The rst approach for classifying JavaScript les was to use the One Side Class Percep-
tron (OSCP) algorithm. This made sense, since false positives can easily be controlled
and it works well with large datasets. For this to work, features must rst be created.
The rst idea of how the features should be obtained was to take each token class from
the Javascript language and consider it a boolean feature. For each script, if it contains

Chapter 4. Clustering and Classication of Script based Malware 143
a certain token, then the feature that corresponds to that token class will be set to
1 and 0 otherwise. This approach will be called OSCP-simple throughout the rest of
the chapter. Though simple in implementation, the results obtained by OSCP-simple
were very poor (57.5% detection rate). The reason behind this is that the features
were not very descriptive for the records. Several malicious scripts were found that
contained the same features as the benign ones. In order to improve the accuracy,
features were modied. In this case, instead of having boolean features, continuous
values were used. Each feature (that corresponds to a token class) will be set to the
number of times that token appears in a script. This way samples that had the same
features in OSCP-simple will be dierentiated by the frequency of the tokens. This
method will be called OSCP-frequency throughout the rest of the chapter. The results
using OSCP-frequency were better, but not signicant (65.59%).
Finally, a hybrid approach was used. The records are rst splitted using a binary
decision tree. For each of the resulted cluster, an OSCP algorithm is used in order
to classify the samples. This method will be called BT-OSCP-xxx where xxx denes
the depth of the decision tree. The clusters are created according to Algorithm 14.
The method of grouping the samples is straight forward. It chooses a feature that best
separates the dataset and puts all the elements containing the feature in one branch of
the tree and the rest in the other branch.
In Algorithm 14, the following notations are used:
S=fS1;S2;:::;S ngrepresent the tokenized JavaScript les
Si:Features =fF1;F2;:::;F mgrepresent the features (tokens) of the le Si
The save operation represents the fact that all the elements from Sare grouped in a
new tree node.
Choosing the feature is done according to a score function, dened in Algorithm 15. It
works by choosing a feature that is present in a number of records close to the half of
the dataset size.

Chapter 4. Clustering and Classication of Script based Malware 144
Algorithm 14 Binary Decision Tree Clusterization
1:function cluster (S;depth;F )
2: ifdepth =maxDepth then
3:save(cluster;S )
4:return
5: end if
6:max 0
7:classificationFeature ;
8: forFi2Fdo
9:Score computeScore (S;F i)
10: ifmax<Score then
11: max Score
12: classificationFeature Fi
13: end if
14: end for
15: ifclassificationFeature =;then
16:return
17: end if
18: forSi2Sdo
19: ifSi:contains (classificationFeature )then
20: Sleft=Sleft[Si
21: else
22: Sright=Sright[Si
23: end if
24:F FnclassificationFeature
25:cluster (Sleft;depth + 1;F)
26:cluster (Sright;depth + 1;F)
27: end for
28:end function
In Algorithm 15, the following notation is used:
countInf represents the number of malicious samples containing the feature
totalInf represents the number of malicious samples from the dataset
countCln represents the number of benign samples containing the feature
totalCln represents the number of malicious samples from the dataset

Chapter 4. Clustering and Classication of Script based Malware 145
Algorithm 15 Feature Score
1:function computeScore (S;F)
2: iftotalInf = 0_totalCln = 0then
3:return 0
4: end if
5:totalCln 0
6:totalInf 0
7:countCln 0
8:countInf 0
9: forSi2Sdo
10: ifSiis labeled as Clean then
11: totalCln totalCln + 1
12: ifSi:contains (F)then
13: countCln countCln + 1
14: end if
15: end if
16: ifSiis labeled as Malware then
17: totalInf totalInf + 1
18: ifSi:contains (F)then
19: countInf countInf + 1
20: end if
21: end if
22: end for
23:probInf abs(countCln=totalCln 0:5)
24:probCln abs(countInf=totalInf 0:5)
25:return ((1probInf ) + (1probCln ))=2
26:end function
4.2.3 Using the Hidden Markov Model
In order to train a Hidden Markov Model (HMM) several notations must rst be
established. Since, HMM works with observation and states, the tokens from each
tokenized JavaScript le will be considered as a list of observations and the states in
which each le can be malicious or clean: S=Sclean;Sinfected whereSclean denes
the state in which the le performs a genuine action while Sinfected represents the
state where a malicious action is being performed. It is good to know that there are
cases when a le might switch between states at dierent moments, depending on the
observations being emitted. For example:

Chapter 4. Clustering and Classication of Script based Malware 146
Figure 4.4: Sequence Classication
Phishing: A malicious JavaScript le performs the exact actions as a clean one,
but instead of sending the values inserted by a user in an html form, it will send
them to the attacker
Proof Of Concept: A JavaScript le behaves exactly like a malicious one, by
using dierent elements and setting various values in order to trigger an exploit.
However, right after the control of the browser was taken, instead of harming the
system, it just noties the user that a vulnerability is present in the browser.
To train the model, two HMM were constructed: one for malicious les( Mclean) and one
for infected les ( Minfected ). Each time a script must be classied it is rst compared
to each of the models and it will be assigned to the group that resembles the most
(Figure 4.4):
P(XjMclean) =P
ZP(XjZ;M clean)P(ZjMclean) – the probability that the observa-
tion sequence was generated by Mclean
P(XjMinfected ) =P
ZP(XjZ;M infected )P(ZjMinfected ) – the probability that the
observation sequence was generated by Minfected
S(X) =max(P(XjMclean);P(XjMinfected ))
The training of each HMM was done using the Baum-Welch [55] algorithm which is
based on the forward-backward algorithm and works in an unsupervised manner.

Chapter 4. Clustering and Classication of Script based Malware 147
Algorithm Detection Rate False Positive rate Accuracy
OSCP 57.51% 0.02% 99.23%
OSCP-frequency 65.59% 0.01% 99.33%
BT-OSCP-4 81.65% 0.22% 99.51%
BT-OSCP-5 89.75% 0.31% 99.51%
HMM 76.23% 2.29% 97.32%
Table 4.16: 3-fold cross validation results for OSCP, OSCP-frequency BT-OSCP-
xxx and HMM
4.2.4 Results
The HMM along with 4 variations of the OSCP algorithm were trained on the dataset
using a 3-fold cross validation. The OSCP and OSCP-frequency were trained using
1500 iterations. For the BT-OSCP-xxx two congurations were used, one where the
depth of the decision tree was set to 4, BT-OSCP-4, thus 16 clusters resulted and one
where the depth was set to 5, BT-OSCP-5, thus 32 clusters resulted. On each of the
clusters an OSCP was trained and tested using a 3-fold cross validation. The results
can be seen in Table 4.16.
From the 5 algorithms tested, two of them stand out. The OSCP-frequency algorithm
provides good detection rate and small number of false positives. It is also very fast
on the training speed. On the other hand, the BT-OSCP-5 provides the best detection
rate but with a higher rate of false positives. It also takes more time to train due to
the need of building the decision tree and training of 32 OSCP algorithms.
As for the HMM, its detection rate is in the middle and has the highest number of
false positives. It also took the most time to train. Though it has mediocre results,
these can be improved with small adjustments. For example, in order to reduce false
positives, a threshold can be set for the Mclean model and if a sample is similar in a
percent greater than that threshold, it will be labeled clean disregarding the similarity
withMinfected .

Chapter 5
Mobile Privacy Issues
Since programmers have noticed that mobile applications are a good mean of earning
money, the number of applications found in popular stores started to increase very
fast. The two most well known mobile devices application stores, Google Play [56] and
Apple Market [57], have both reached in 2014 more than 1.4 million applications [7].
However, not all the applications are equal when it comes to how secure they are. Even
though these two markets put serious eorts in keeping the stores free of malware, it
can't be said the same thing about privacy issues. Many applications from these stores
hide serious security
aws that aect the privacy of the user. These can be simple
things like sending the user location or the unique device id, but can also be more
harmful ones, like sending sensitive data in an unencrypted mode over an unsecured
channel or allowing to execute code received from an untrusted source. In the following
sections the privacy risks associated with applications from both of the markets will be
described in detail along with the mechanisms used to detect them.
5.1 Application Store Protection and Malware
5.1.1 AppStore
The AppStore, which is the mobile application market for the Apple company, was
lunched on the 10th of July 2008 with the purpose of being a place where developers
148

Chapter 5. Mobile Privacy Issues 149
can publish applications for the iOS devices (mobile phones and tablets). On 2015, at
the time of writing this paper, the AppStore reached 1.4 million applications, being
the second most largest application store.
In order for a mobile application to reach the Apple market, there are several steps
that a programmer must take. First he has to register an account. Even though simple
to say, there are quite some steps in order to accomplish this:
1. register as an Apple developer. This part requires the programmer to submit
personal information regarding his name, profession, e-mail address and types of
application that he has developed in the past.
2. join the iOS developer program. This part will cost 99$ per year and will allow the
developer to publish and test their iOS applications. This step requires additional
data, like credit card information, billing information, if the programmer is part
of a company or not, shipping address (even though nothing is going to be shipped
here). The process usually lasts a day and an e-mail will be submitted on the
completion of the process.
3. download XCode, the main iOS development tool.
4. create Apple certicates. iOS devices can run only applications that came from
AppStore. Each time an application is run, the certicate is veried. For this rea-
son, the programmer must create a certicate that will be signed by apple.There
are three kinds:
Development certicate: this are registered for a specic device, so the ap-
plication will run only on that device. There can be a maximum of 100
devices registered per year.
Distribution certicate: this don't contain any device information so the
application signed by this can be run on any iOS device. This certicate
will be used for signing the application that will be submitted to AppStore.
Push Notication certicate. this is optional and allows a server to send
notications to a device (called Push Notications).

Chapter 5. Mobile Privacy Issues 150
After registration is complete, if the programmer has nished developing the application
and wants to submit it, there are some additional steps that he must follow:
1. go to iTunes Connect and electronically sign a contract regarding taxes and bank
information. This is required in order to be able to sell an application on the
AppStore.
2. submit the application using iTunes Connect. This involves submitting:
the company name and application name
a date when the developer wants his application to be uploaded
a price if the application is not free. Here, discounts can also be submitted
keywords that will be indexed by AppStore
reviewer notes: information regarding the content of the application or cer-
tain details that the reviewer should know
screen shoots for mobile device and table, icon for the application and the
binary le
Each submitted application is manually reviewed in order to ensure it is reliable and
it doesn't include oensive material. This complex process has allowed AppStore to
be the application market where the malicious samples are almost impossible to nd
(approx 0.7% [58]).
5.1.2 Google Play
Previously known as Google Market, Google Play is the ocial mobile application
market for the devices that run Android. Even though it was launched after the
market of Apple (on 23th October 2008) it was able to gather more applications due to
its strategy: Android is a free operating system, so it was used by many types of mobile
devices, creating an application is cheaper (it requires only an one time payment) and
the reviews are semi-automatic. Most application of Google Market are free and this
may be one of the reasons why it was able to attract more users than the AppStore.

Chapter 5. Mobile Privacy Issues 151
In order to submit an application to Google Play, three major steps are involved [59]:
1. Register for a Google Play publisher account
Submit personal information like name, e-mail, address, etc.
Accept the license agreement specic to each country
Pay the registration fee which is 25$
2. If the application is not free, the user has to set-up a Google Wallet Merchant
Account. Google Play, like AppStore, is not allowed to sell applications in any
country, so the rst thing would be to check if the country of the developer is
listed of merchant countries.
3. Upload the application using the tools from Google Play Developer Console. This
involves setting the price, linking to the merchant account and setting geographic
distribution.
Once uploaded, the application is automatically reviewed and will be available to the
users in a couple of hours. Google Bouncer is the system that is responsible for the
applications analysis. Introduced in 2012 [60], Bouncer was aimed to reduce the num-
ber of malware present in the store. Right after it was launched, it achieved a 40%
drop on malware. It works by running every application newly submitted as well as
some of the previously submitted applications in an emulator and looking for suspicious
actions. If something is detected, a human reviewer will further analyze the le. Even
though it has good results, a group of researchers [61] managed to show that Google
Bouncer can be ngerprinted, thus a programmer can create an application that has a
dierent behavior when run in the Google Bouncer than when run on a users device.
This was achieved by making the application connect to a server and granting a shell
on the testing device. By running dierent commands, they were able to collect dier-
ent information regarding the hardware, infrastructure and testing conguration. For
example, they observed that the tests were carried on the Google infrastructure and
the application was only run for 5 minutes. Since then, Google has constantly updated
its Bouncer in order to make it as failsafe as possible.

Chapter 5. Mobile Privacy Issues 152
At the time of writing this paper, malware that run on Android is considered the most
widespread, accounting for almost 97% [62] of the total mobile malware. However,
most of these malicious applications are found in unocial markets located in East
Asia. The percentage of les carrying malware on the ocial market, Google Play, is
less than 0.1%.
5.2 Security and Privacy Risks
Due to security restrictions imposed by mobile operating systems, malware almost
disappeared from ocial application markets. Traditional types of malicious applica-
tions, like trojan and worms, could no longer work on this new more secured operating
systems. However, this doesn't mean a user is risk free when using his smartphone.
Nowadays, most of the smartphones include many sensors and features (GPS location)
and are created with the scope of easing ones life. However, this also may come with
privacy risks. For example, the GPS can be used by an application to get the user
location that will be further used in delivering targeted ads. Even if the GPS is turned
o, the device could be geolocated by its ip address.
Almost all of the devices are able to connect to a wireless network, but many of these
wis don't encrypt the commmunication or use outdated protocols. Combine this with
applications that send personal data over an unencrypted channel and a perfect spot
for stealing sensitive information is formed.
Since mobile devices are designed to make life easy, they store many personal informa-
tion like e-mails and contacts. Once a user installs an application to the device, the
newly installed program has access to this data and can send it to a remote server.
For example, stored telephone numbers and e-mails could be used in spam campaings.
Since most users are always logged on from their phones to dierent social networks,
an application could post information in the name of the user.

Chapter 5. Mobile Privacy Issues 153
Finally, there are the risks that can also be found on malware than run on the PC. For
example, scareware and rogue applications are still found on mobile markets. There
were also some cases of malicious applications that automatically sent text messages
to premium rate numbers, just like dialers used to do when internet was becoming
available thanks to the dial-up connections.
The risk that a user downloads such an application are higher when advertising SDK
are also being taken into consideration. In order to make a prot from their application
but still make it available for everyone, many programmers use advertising SDKs in
order to rise revenue. These are bundled in the application and at dierent moments
when the application is being run, they contact a server in order to display some
marketing ads to the user. However, these adds come with their set of permissions and
they usually collect information that nor the programmer or the user is aware of. As
described by authors in [63], many advertising platforms collect more information than
stated in the end user license agreement. From the users perspective, there is no way
to dierentiate from an action performed by the ad-framework or the application, since
they all come in the same package.
5.2.1 Privacy risks Of Third Party SDKs
There are currently three ways of monetizing an application on Google Play and App-
Store:
Sell the application
In-app purchasements
Ads supported
Nowadays, it is very often to see many of these types of monetization methods mixed,
especially selling the application with the in-app purchasement or the in-app purchase-
ment with the ads supported one. In-app purchasements relate to the ability of a user

Chapter 5. Mobile Privacy Issues 154
to buy dierent things that he might consider helpful right from the application itself
and not from the application store. For example, a user that plays an adventure game
might buy a map to help him nish the mission. The third type of monetization options
is usually found in free applications. Instead of selling the application, the le is given
by free, but ads are displayed from time to time. Depending on dierent factors, like
the number of ads displayed, number of times a user clicked on an add or the number
of users, a programmer receives an amount of money from the company that delivered
the ads. What this means is that the programmer is actually providing clients for the
advertising companies and he receives payment for this. Since more than 80% of the
applications from the Google Play and more than two thirds of the applications in
Apple Store are free, the ad-monetization method is dominant.
For a programmer to benet from the Ads supported model, he has to integrate in his
application a third party library. However, this also means that the programmer will
give some control to the third party sdk since this will be bundled with the application.
More, every permission granted to the application will be available to the third party
sdk. Many of these framework collect private information from the user and send them
to their servers. There are even situations where the private data is being sent unen-
crypted, thus an attacker could intercept the trac. Among the private information
being sent is the user's e-mail address, location, phone number, device make and model,
contacts, carrier, etc. Since access to this kind of data is protected by permissions, some
frameworks pay more if the host applications will have more permissions. There are
even sdks that allow receiving and execution of JavaScript code from an external server
without verifying the server's identity. This allows an attacker to execute code on the
device.
In order to highlight the data that is most often being sent by third party sdks, the top
5 third party advertising frameworks were analyzed from a collection of approximately
314.000 application from Google Play. The results can be seen in Table 5.1
To better illustrate how this data looks like when is being sent, an excerpt of trac

Chapter 5. Mobile Privacy Issues 155
Adware specic behavior Percent of Apps
Sends location 9.80%
Sends the e-mail address 5.73%
Sends the device ID 14.58%
Sends the phone number 8.82%
Generate spam in notication bar 8.48 %
Greates spam icons on home screen 9.17 %
Table 5.1: Actions Performed by TOP5 3rd Party SDKs
is provided that was obtained by capturing HTTPS network trac of an application
that uses AirPush sdk. This can be seen in Listing 5.1.
POST https :// api . airpush .com/v2/ api . php HTTP /1.1
Content – Length : 867
Content – Type : application /x-www -form – urlencoded
Host : api. airpush . com
Connection : Keep – Alive
Accept – Encoding : gzip
APIKEY =<APIKEY >& token =<token >& request_timestamp =2013 -05 -29+11%3 A13 %3 A54_Eastern +
European + Time_Europe %2 FBucharest_EET & packageName =< package_name >& version =15&
carrier=orange&networkOperator=RO+ORANGE&phoneModel=Galaxy+Nexus&manufacturer=samsung&
longitude= <longitude >&latitude= <latitude >& sdkversion =5.0& wifi =1& useragent = Mozilla %2 F5
.0+%28 Linux %3B+U%3B+ Android +4.0.4%3 B+en -us %3B+ Galaxy + Nexus + Build %2 FIMM76D %29+
AppleWebKit %2 F534 .30+%28 KHTML %2C+ like + Gecko %29+ Version %2 F4 .0+ Mobile + Safari %2 F534
.30&android id=<android id>& screenSize =720 _1184 & deviceUniqueness = IMEI & networkSubType =&
isTablet = false &SD =2.0& isConnectionFast = true & unknownsource =0& appName =< app_name >&
email= <email >&phonenumber= <phone number >& language = English & country = Russia & zip = null & model
= user & action = setuserinfo & type = app
Listing 5.1: Airpush Network Capture
5.2.2 Case Studies
In this part of the thesis, several applications from Google Play and Apple Store will
be analyzed in order to highlight how users privacy is aected. Note that all of this
applications are free and downloaded from the two popular stores.

Chapter 5. Mobile Privacy Issues 156
The rst example is CricketLand, a module that sends almost every information avail-
able from the users phone. This is used by other companies, like "Al Free Dictio-
naries" (20 applications, November 2012), "Lyrics app" (15 applications, February
2013) and "Master your games"(1 application November 2012). Data being sent
is formated like a JSON and transfered over an unencrypted HTTP connection, to
"http://cricketland.net/game toolbox/index.php/appVersion/collectData" so it can be
easily intercepted by a third party. An excerpt of trac that highlights this can be seen
in Listing 5.2. Besides this, another interesting thing was observed: the application
also tries to submit the same data to a form from Google docs. This behavior didn't
work anymore when the application was tested, probably due to errors in the code or
maybe because Google blocked it. A print screen of the form is presented in Figure 5.1.
As it can be seen in Listing 5.2 the user's name, phone number, email addresses, GPS
coordinates, device identication numbers, device capabilities and address information
are being sent with as many details as possible.
Figure 5.1: Network trac of CricketLand
entry_9 = ( latitude , longitude ) &
entry_8 = US + -+en & entry_7 = [{
"n": " user name ",
"p": [" 0745 xxxxxxxx ", ],
"e": [" email address ", ]
] & entry_5 = {

Chapter 5. Mobile Privacy Issues 157
"N": "",
+" IMEI ": "35 xxxxxxxxxxxx ",
"C": "ro",
"Op": " 22610+( operator name )",
"sC": "ro",
" sOp ": " 22610+( oprator name )",
" sSerial ": "89 xxxxxxxxxxxxxxxxxxxx ",
" sID ": "22 xxxxxxxxxxxxx ",
"t": " GSM "
} & entry_4 = {
+" statusCode " + : +"OK",
+" statusMessage " + : +"",
+" ipAddress " + : +" 31. xxx . xxx . xxx ",
+" countryCode " + : +"RO",
+" countryName " + : +" ROMANIA ",
+" regionName " + : +" IASI ",
+" cityName " + : +" IASI ",
+" zipCode " + : +"-",
+" latitude " + : +" latitude ",
+" longitude " + : +" longitude ",
+" timeZone " + : +" +03:00 " +
} & entry_16 = [{
"p": " android ",
"n": " Android + System ",
"v": "15",
"vn": " 4.0.4 -299849 "
},
{
"p": " com . android . bluetooth ",
"n": " Bluetooth + Share ",
"v": "15",
"vn": " 4.0.4 -299849 "
}, {
"p": " com . android . calculator2 ",
"n": " Calculator ",
"v": "15",
"vn": " 4.0.4 -299849 "
}, ] & entry_3 = 16
& entry_15 = [" android . hardware . wifi ", " android . hardware . location . network ", " android .
hardware . nfc", " com . google . android . feature . GOOGLE_BUILD ",
" android . hardware . location ", " android . hardware . sensor . gyroscope ", " android . hardware .
screen . landscape ", " android . hardware . camera . front ", " android . hardware . sensor .
accelerometer ", " com . nxp . mifare ", " android . hardware . touchscreen ", " OpenGLES +v0 .0",
… ]
& entry_2 = com . flexidict . data . collinsthesaurus

Chapter 5. Mobile Privacy Issues 158
& entry_14 = 720 x1184
& entry_1 = f480bffe – eb2d – 3f29 – b126 – 47952 c3e7ab3
& entry_0 = 499689428 cb80811
& entry_12 = {
" Processor ": " ARMv7 + Processor + rev +10+( v7l )",
" BogoMIPS ": " 2047.7 ",
" Features ": " swp + half + thumb + fastmult + vfp + edsp + thumbee + neon + vfpv3 +",
" Hardware ": " Tuna ",
" Revision ": " 0009 ",
" Serial ": " 0149 c2201500c01a ",
" cpu_implementer ": "0x41",
" cpu_architecture ": "7",
" cpu_variant ": "0x2",
" cpu_part ": "0 xc09 ",
" cpu_revision ": "10"
} & entry_13 = {
" Total ": " unknown ",
" Available ": " 293806080 ",
" Threshold ": " 66542592 "
} & entry_10 = 15 + (4.0.4) & entry_11 = {
" ver ": " 15+( REL ,+4.0.4) ",
" model ": " Galaxy + Nexus ",
" board ": " tuna ",
" brand ": " google ",
" dev ": " maguro ",
" display ": " IMM76D +( Deodexed )",
"fp": " google / yakju / maguro :4.0.4/ IMM76I /330937: user / release – keys ",
" host ": " vpbs23 . mtv. corp . google . com ",
"id": " IMM76D ",
" mnf ": " samsung ",
" prod ": " yakju ",
" ABI ": " armeabi – v7a",
" ABI2 ": " armeabi ",
" bootldr ": " PRIMELA03 ",
"hw": " tuna ",
" radio ": " I9250XXLA2 ",
" serial ": " 0149 C2201500C01A ",
" tags ": " release – keys " …
Listing 5.2: CricketLand Network Capture
Even though in this case the behavior was intended, there are some cases where the
privacy issues are due to bad application design or bad programming. One such case

Chapter 5. Mobile Privacy Issues 159
is sending the user's name and the password without using encryption. This allows an
attacker that intercepts the trac to extract the credentials. Since many users use the
same authentication data for many services, this could mean that the attacker has
the possibility to log on to a dierent service with higher importance, like e-mail or the
victim's social account. There are several examples here, one of them being TalkBox
Voice Messenger (v. 1.64) from Google Play. This application, which has a number of
installs between 1.0000.0000 and 5.000.0000 sends the user's name, password, e-mail
and other information related to the device in plain text, using the HTTP protocol. A
listing of the captured HTTP trac that is being sent during registration process can
be seen in Listing 5.3.
POST / talkbox / api / user ? action = register HTTP /1.1
Content – Length : 299
Content – Type : application /x-www -form – urlencoded
Host : api1 . mytalkbox .com
Connection : Keep – Alive
User – Agent : Apache – HttpClient / UNAVAILABLE
countryCode = United + States & firstname =<name >& avatar =& lang = en_US &password=<password>& country
= United + States & version =1.64& system =4.1.1& udId = e5857e93230893cf &username=<username>&
<email>& avatarUrl =& device = unknown + androVM + for + VirtualBox +%28%27 Tablet %27+ version
%29& clientType =1
Listing 5.3: TalkBox Voice Messenger version 1.64 Register HTTP trac
Similar information is also being sent during the login process. This is shown in Listing
5.4
POST / talkbox / api / user ? action = login & loginType =2& clientType =1& version =1.64 HTTP /1.1
Content – Length : 168
Content – Type : application /x-www -form – urlencoded
Host : api1 . mytalkbox .com
Connection : Keep – Alive
User – Agent : Apache – HttpClient / UNAVAILABLE device = unknown + androVM + for + VirtualBox
+%28%27 Tablet %27+ version %29& getPort =YES& system =4.1.1& udId = e5857e93230893cf &
username=<username>& lang = en_US &password=<password>
Listing 5.4: TalkBox Voice Messenger version 1.64 Login HTTP trac

Chapter 5. Mobile Privacy Issues 160
The same problems can be seen in several applications from the Apple Store. For
example, version 2.8.1 of "Runtastic GPS Running, Walking and Fitness Tracker"
send the information collected on login using an unencrypted HTTP connection. A
listing of the trac can be seen in Listing 5.5
POST / webapps / services / auth / login ? version =1 HTTP /1.1
Host : lance . runtastic . com :80
User – Agent : runtastic /2.8.1 CFNetwork /548.0.4 Darwin /11.0.0
Content – Length : 50
Accept : \*/\*
Content – Type : application / json
X-App – Key : at. runtastic . gpssportapp
X-App – Version : 2.8.1
X- Date : 2013.03.15 17:19:38
X-Auth – Token : 7921091 aa206425fa0832aaa09dc0cae82ac086f
X- Device – Vendor : Apple
X- Device – Name : iPhone3 ,1
X- Device – Firmware : iPhone OS 5.0.1
X- Screen – Pixels : 640 x960
X- Carrier : AT&T; MCC =310; MNC =410
X- Locale : en -CA
Accept – Language : en -us
Accept – Encoding : gzip , deflate
Cookie : JSESSIONID = gf03 ~ ea25a3bfdab650e2873c419f88ec ; dtCookie =33
C6D50CB9F90CB58E817DCA1F80C51C | runtastic |1
Connection : keep – alive
{"password":"<password>","email":"<email>"}
Listing 5.5: Runtastic GPS Running Walking and Fitness Tracker Login HTTP
Trac
More examples of applications that do this kind of mistake can be found in both of the
application stores. However, sending the user's name and password in plain text over
an unencrypted channel is not the worst thing a programmer could do.
5.2.3 Remote Code Execution via Android SDKs
The Android operating system allows a programmer to load and execute code from an
external source. This can be done in several ways, some of the methods involving a

Chapter 5. Mobile Privacy Issues 161
very high security risk:
1. Request the installation of another application package (.apk les): this involves
executing the Package Manager that will ask the user if he agrees with the insta-
lation of the new package. Additionaly, the Package Manager also checks if the
apk le is digitally signed and comes from the same developer.
2. Load the JAVA code present in an application package or a JAVA package (.jar
les): this uses the DexClassLoader class and allows the programmer to load
code from a classes.dex le present in the packages. Apart from requesting an
application-private, writable directory, the operating systems does not impose
any security measures regarding the provenience of the loaded code.
3. Load Native Code: this works through the Java Native Interface (JNI) and allows
the programmer to load and execute compiled code (ARM assembly).
4. Runtime.exec: this is very similar to the fork on linux. It launches a new separate
process with the arguments specied in the command.
5. Creating a package context: an application can share data if it puts it in a context
object that can be later found by another application. This is done through the
use of the createPackageContext API. To load and execute code, the loading ap-
plication should call the command using the
ags CONTEXT INCLUDE CODE
and CONTEXT IGNORE SECURITY.
There are dierent reasons why an application would allow code execution, one of it
being able to automatically download and execute updates, without going through the
application store. However, if this is not done properly, it can prove very dangerous.
One such example is the Widdit sdk, also known as Android Homebase.
Widdit is a third party mobile monitization sdk that programmers can incorporate in
their applications in order to make it more appealing to the user and also to allow the
programmer to earn money through advertising. One of the key components of the
widdit sdk is the Update Module. The architecture of this component can be seen in
Figure 5.2

Chapter 5. Mobile Privacy Issues 162
Figure 5.2: Widdit Update Architecture
The update starts by rst checking the last version available from the widdit.com
server. If a newer version is found, a jar package will be downloaded that contains the
updated components. The le will be then executed using the DexClassLoader method
presented above. The problem with the update process is that all communication
between the application and web server are carried on using the HTTP protocol. No
encryption is used. A network capture that contains the request (Listing 5.6) and the
Response (Listing 5.7) can be seen below.
GET http:// cdn1 . androidhomebase . com / update / bahamas_latest . jar HTTP /1.1
Host : cdn1 . androidhomebase . com
Connection : Keep – Alive
User – Agent : Apache – HttpClient / UNAVAILABLE ( java 1.4)
Listing 5.6: Widdit Download .jar Request
HTTP /1.1 200 OK
Content – Type : application /java – archive
Last – Modified : Wed , 30 Apr 2014 08:25:24 GMT
Accept – Ranges : bytes
ETag : " 87881 cbc4d64cf1 :0"
Server : Microsoft – IIS /7.5
X- Powered -By: ASP .NET
Content – Length : 1120537
Cache – Control : public , max – age =44032
Date : Tue , 06 May 2014 17:47:42 GMT

Chapter 5. Mobile Privacy Issues 163
Connection : keep – alive
PK…
Listing 5.7: Widdit Download .jar Response
A man in the middle attack can be easily carried on. All the attacker needs to do is to
intercept the trac and send to the mobile phone that requests the newer package, a
jar le that has the same structure as the original, but with dierent code. This way
an attacker will be able to execute code on the mobile device of the victim. Since the
vulnerability is found in an sdk and not a particular application, the chances that an
user will get infected is higher. More, a targeted attack can be carried on using this
method. The attacker just needs to create an application that a user is likely to install
(i.e. a cars game), incorporate this sdk and upload it to the application store.
Like every sdk, Widdit comes with the set of permissions that are required in order to
integrate it into an application. A look at the list of permissions can show the extent
of the possible damages that can be caused by an attacher that controls the vulnerable
application. This list is presented in the table bellow:
RECEIVE SMS RECEIVE BOOT COMPLETED
READ SMS ACCESS NETWORK STATE
NEW OUTGOING CALL READ PHONE STATE
ACCESS FINE LOCATION ACCESS WIFI STATE
ACCESS COARSE LOCATION READ CONTACTS
READ LOGS DISABLE KEYGUARD
RECORD AUDIO GET ACCOUNTS
READ HISTORY BOOKMARKS READ CALL LOG
Table 5.2: Permissions required to integrate Widdit SDK
As it can be seen in Table (Table 5.2), a successful exploitation of an application that
contains the Widdit sdk can allow an attacker to record audio, to have access to all

Chapter 5. Mobile Privacy Issues 164
the victims contacts, call logs, location and even browsing history. A more dangerous
thing is that it is able to initiate a call. This can prove very costly for the victim, since
an attacker can call a premium rate number. Widdit is not the only sdk that can be
used in order to control an application. A similar case is the one of Vulna [64].
5.2.4 Methods for Privacy Risks Detection
Since mobile devices store many sensitive information that can aect the privacy of
the user that must be protected. Even though Google and Apple do a very good job
on protecting their application store from malicious applications, they don't seem to
do the same thing when it comes to privacy risks. Since the number of applications on
Google Play and Apple Store increased by more than 200.000 only in 2014, the only
way to analyze their privacy risks is to use an automatic system. In the remaining of
the chapter the architecture of this kind of system will be discussed.
In order to give a verdict for an application, two stages are needed. The rst one
extracts the features and stores them in a parsable format, while the second one will
work with the features and give a verdict. The two stages must be performed separately
since verdicts can change from time to time as heuristics are added while the extraction
phase changes rarely. Another reason to do so is that extraction of features can take
a much longer time than giving the verdict so it should be done as rarely as possible.
Feature extraction is also done in two steps: rst there is a static method that parses
the code and all meta information contained in the le and then there is a dynamic
method that tries to execute the le and collect behavior features. As execution take
time, it is easy to understand why feature extraction needs to be separated.
The static extraction method parses the Android executable le (Dalvik Executable
[65]), the iOS executable code found in application packages (IPA) and as well as other
les that can be found in the application container (i.e. Manifest File for the Android
and the info.plist for the iOS). Features that may be extracted are url, classes, method
names, service names, strings, etc. Using this information, it is easy to detect if the

Chapter 5. Mobile Privacy Issues 165
application uses a known adware sdk. A problem with this extraction method is that
some applications contain code that is never called. This is mainly because the author
has decided not to use this code anymore after updating his application or maybe it is
part of a library that misses some requested permissions and can not run. Since the
code that doesn't run can't aect the privacy of the user it is important to be detected
and skipped in order to avoid false positives. To do so, the dex les are disassembled
using smali [66] and instrumented in order to detect the code
ow. There is currently
no method for statically detecting the
ow of the iOS application. Even though this
method is fast and provides good results it can not solve another common problem:
obfuscation. Some authors want to protect their code by using dierent obfuscation
tools, like ProGuard [67].
The Dynamic extraction method is the best solution for obfuscation methods. This
involves running the application in a controlled environment and observing its behavior.
Using Cycript [68] for the iOS platform and Robotium [69] for the Android OS, real user
scenarios are simulated. Forms are auto-completed with relevant data, check-boxes are
checked and dierent buttons are clicked in order to be able to explore the application
interface as much as possible in a limited amount of time. Identifying what kind of
values should the edit elds contain is very important(i.e. in a user-name edit box
should be entered a username; similar to this, a password should be entered only in a
password box). The data submitted will be later searched during the trac analysis.
In order to be able to see what data the application collects, network trac will be
collected and saved to disk. Using the network dump, strings that were entered during
the exploration of the application will be searched in order to determine if they were
sent in clear text. Since a smartphone usually contains other private information (like
contacts or e-mail address) this data are also searched in the network dump.
Since some application encrypt their communication with the server (most of them
using a HTTPS connection), the data collected from the mobile device will not appear
in the network dump. This is why a transparent proxy is installed on the router and is

Chapter 5. Mobile Privacy Issues 166
used to do a man in the middle attack. To be able to accomplish this, all the mobile
devices will contain a root certicate that will be used to validate the certicate of the
router. The HTTPS part of the communication is implemented with the help of the
OpenSSL library [70].
Of course, dynamic analysis has its own drawbacks. Running every application is a
slow process and requires many resources. Some applications have certain protections
and do not run in an emulator. This is why it is needed to run them in a real mobile
phone. Depending on the results obtained, an analyst is sometimes needed to manually
validate the results. For example, it is normal for an application that displays weather
information to send GPS coordinates to the cloud, but the same action should not be
performed by a dictionary application.
After analysing over 310,000 applications from Google Pay, the following privacy issues
were discovered: 1.13% of the applications send the precise location of the device (GPS
coordinated), 0.64% send the e-mail address and 12.66% send the device ID, using an
unencrypted connection. Besides that, 0.51% of the applications send the username
and password required in login or registration forms in plain text.
A simillar study was carried on 207,000 application from the Apple Market. 14.43%
send the device ID and 0.56% send the credentials required in login or registration
forms in plain text.

Chapter 6
Conclusion and Future Work
This paper analyzed three main problems and solutions regarding security and privacy
that people using computers or mobile devices are currently facing.
1. The rst one, which is the oldest, refers to the detection of malicious executa-
bles that run on the Windows operating system. Since there is already a good
algorithm that works good in practice at malware detection (the One Side Class
Perceptron algorithm), the work presented in this paper focused on improving
its results, mainly by extracting new features from the original ones. This was
done through two ways: by using a powerful neural network, called Restricted
Boltzmann Machine and by using genetic programming. Both of these methods
improved either the detection rate or the number of false positives.
2. The second problem addressed, which is mainly used in targeted attacks and to
infect users, is malware found in document les and java script les. Since the
number of these attacks increased exponentially in the last years, the eorts have
also concentrated on clustering these les in order to decrease the work that a
malware analyst needs to perform. To achieve a good detection rate a Hidden
Markov Model was tested as well as hybrid method consisting of a decision tree
and the OSCP algorithm.
167

Chapter 6. Conclusion and Future Work 168
3. The nal problem presented in this paper is also the newest one present among
users. It refers to the privacy of the user's personal data that is used by appli-
cations in mobile devices. An analysis of the methods that mobile malware and
mobile advertising platforms use in order to bypass security measures imposed
by the Android and the iOS operating systems were analyzed. The current work
also focused on how a framework can be created that is able to analyze these
applications both statically and dynamically and provide feedback to the users
regarding to the danger they are exposing themselves.
The reader should understand that none of these methods should be used alone in order
to achieve a very good detection rate. These are complementary with other technolo-
gies that most antivirus products use, like hash-based signatures, targeted heuristics
or cloud detection mechanisms. However, using these will increase the accuracy of the
product without aecting its performance. The methods proposed in this paper are al-
ready being used in the Bitdefender antivirus software and they are partially the reason
of why the product has achieved the best detection score while also being considered
the fastest1.
Since using these methods good practical results were achieved, in the near future
more attention will be attributed to improving them. Mainly, I am interested on how
features can be created using many stacked hidden layers in a neural network and
how genetically evolved features can be created using non-linear functions. Since these
require more computing power, I am planning to test them using a distributed system
made of several graphical cards.
1http://chart.av-comparatives.org/awards.php?year=2014

Bibliography
[1] James Lewis and Stewart Baker. The economic impact of cybercrime and cy-
ber espionage, 2013. URL http://www.mcafee.com/sg/resources/reports/
rp-economic-impact-cybercrime.pdf .
[2] AVtest Institute. Malware statistics and trends report | avtest institute, 2015.
URL https://www.av-test.org/en/statistics/malware/ .
[3] Kaspersky Lab. Java under attack the evolution of exploits in
2012-2013, 2013. URL https://securelist.com/analysis/57888/
kaspersky-lab-report-java-under-attack/ .
[4] Seungwon Shin, Guofei Gu, A. L. Narasimha Reddy, and Christopher P. Lee. A
large-scale empirical study of concker. IEEE Transactions on Information Foren-
sics and Security , 7(2):676{690, 2012. doi: 10.1109/TIFS.2011.2173486. URL
http://dx.doi.org/10.1109/TIFS.2011.2173486 .
[5] Kaspersky Labs' Global Research and Analysis Team. Red
october diplomatic cyber attacks investigation, 2013. URL
https://securelist.com/analysis/publications/36740/
red-october-diplomatic-cyber-attacks-investigation/ .
[6] The Statistics Portal statista. Number of smartphone users worldwide from
2012 to 2018, 2015. URL http://www.statista.com/statistics/330695/
number-of-smartphone-users-worldwide/ .
[7] appgures Ariel Michaeli. App stores growth accelerates in 2014, 2015. URL
http://blog.appfigures.com/app-stores-growth-accelerates-in-2014/ .
169

Bibliography 170
[8] Dragos Gavrilut, Mihai Cimpoesu, Dan Anton, and Liviu Ciortuz. Malware detec-
tion using machine learning. In Proceedings of the International Multiconference on
Computer Science and Information Technology, IMCSIT 2009, Mragowo, Poland,
12-14 October 2009 , pages 735{741, 2009. doi: 10.1109/IMCSIT.2009.5352759.
URL http://dx.doi.org/10.1109/IMCSIT.2009.5352759 .
[9] Frank Rosenblatt. The perceptron{a perceiving and recognizing automaton. Tech-
nical Report 85-460-1, Cornell Aeronautical Laboratory, 1957.
[10] Dragos Gavrilut, Razvan Benchea, and Cristina Vatamanu. Practical optimiza-
tions for perceptron algorithms in large malware dataset. In 14th Interna-
tional Symposium on Symbolic and Numeric Algorithms for Scientic Comput-
ing, SYNASC 2012, Timisoara, Romania, September 26-29, 2012 , pages 240{
246, 2012. doi: 10.1109/SYNASC.2012.33. URL http://dx.doi.org/10.1109/
SYNASC.2012.33 .
[11] Cristina Vatamanu, Dragos Gavrilut, and Razvan-Mihai Benchea. Building a
practical and reliable classier for malware detection. J. Computer Virology and
Hacking Techniques , 9(4):205{214, 2013. doi: 10.1007/s11416-013-0188-1. URL
http://dx.doi.org/10.1007/s11416-013-0188-1 .
[12] Oliver Rittho, Ralf Klinkenberg, Simon Fischer, and Ingo Mierswa. A hybrid
approach to feature selection and generation using an evolutionary algorithm. In
In Proc. 2002 U.K. Workshop on Computational Intelligence (UKCI-02 , pages
147{154, 2002.
[13] Hugh Leather, Edwin V. Bonilla, and Michael F. P. O'Boyle. Automatic feature
generation for machine learning based optimizing compilation. In Proceedings of
the CGO 2009, The Seventh International Symposium on Code Generation and
Optimization, Seattle, Washington, USA, March 22-25, 2009 , pages 81{91, 2009.
doi: 10.1109/CGO.2009.21. URL http://dx.doi.org/10.1109/CGO.2009.21 .

Bibliography 171
[14] Taras Kowaliw, Wolfgang Banzhaf, Nawwaf N. Kharma, and Simon Harding.
Evolving novel image features using genetic programming-based image transforms.
InProceedings of the IEEE Congress on Evolutionary Computation, CEC 2009,
Trondheim, Norway, 18-21 May, 2009 , pages 2502{2507, 2009. doi: 10.1109/CEC.
2009.4983255. URL http://dx.doi.org/10.1109/CEC.2009.4983255 .
[15] Mark Hall, Eibe Frank, Georey Holmes, Bernhard Pfahringer, Peter Reutemann,
and Ian H. Witten. The weka data mining software: An update. SIGKDD Explor.
Newsl. , 11(1):10{18, November 2009. ISSN 1931-0145. doi: 10.1145/1656274.
1656278. URL http://doi.acm.org/10.1145/1656274.1656278 .
[16] M. Pei, E. D. Goodman, and W. F. Punch. Feature extraction using genetic algo-
rithms. In Proceeding of International Symposium on Intelligent Data Engineering
and Learning 1998 (IDEAL98), Hong Kong , page 98, 1997.
[17] W.F. Punch, E.D. Goodman, Min Pei, Lai Chia-Shun, P. Hovland, and R. Enbody.
Further research on feature selection and classication using genetic algorithms,
1993.
[18] Hong Guo, Lindsay B. Jack, and Asoke K. Nandi. Feature generation using ge-
netic programming with application to fault classication. IEEE Transactions on
Systems, Man, and Cybernetics, Part B , 35(1):89{99, 2005. doi: 10.1109/TSMCB.
2004.841426. URL http://dx.doi.org/10.1109/TSMCB.2004.841426 .
[19] Jamie R. Sherrah, Robert E. Bogner, and Abdesselam Bouzerdoum. The evolu-
tionary pre-processor: Automatic feature extraction for supervised classication
using genetic programming. In In Proc. 2nd International Conference on Genetic
Programming (GP-97 , pages 304{312. Morgan Kaufmann, 1997.
[20] Matthew G. Schultz, Eleazar Eskin, Erez Zadok, and Salvatore J. Stolfo. Data
mining methods for detection of new malicious executables. In 2001 IEEE Sympo-
sium on Security and Privacy, Oakland, California, USA May 14-16, 2001 , pages

Bibliography 172
38{49, 2001. doi: 10.1109/SECPRI.2001.924286. URL http://dx.doi.org/10.
1109/SECPRI.2001.924286 .
[21] Mihai Cimpoesu, Dragos Gavrilut, and Adrian Popescu. The proactivity of per-
ceptron derived algorithms in malware detection. Journal in Computer Virology ,
8(4):133{140, 2012. doi: 10.1007/s11416-012-0164-1. URL http://dx.doi.org/
10.1007/s11416-012-0164-1 .
[22] Jeremy Z. Kolter and Marcus A. Maloof. Learning to detect and classify malicious
executables in the wild. pages 2721{2744, 2006.
[23] D. Michael Cai, Maya Gokhale, and James Theiler. Comparison of feature selection
and classication algorithms in identifying malicious executables. Computational
Statistics & Data Analysis , 51(6):3156{3172, 2007. doi: 10.1016/j.csda.2006.09.
005. URL http://dx.doi.org/10.1016/j.csda.2006.09.005 .
[24] George Forman. An extensive empirical study of feature selection metrics for text
classication. J. Mach. Learn. Res. , 3:1289{1305, March 2003. ISSN 1532-4435.
URL http://dl.acm.org/citation.cfm?id=944919.944974 .
[25] Muazzam Ahmed Siddiqui. Data Mining Methods for Malware Detection . PhD
thesis, Orlando, FL, USA, 2008. AAI3335368.
[26] Asaf Shabtai, Robert Moskovitch, Clint Feher, Shlomi Dolev, and Yuval Elovici.
Detecting unknown malicious code by applying classication techniques on opcode
patterns. Security Informatics , 1(1):1{22, 2012.
[27] Tran Cong Hung and Dinh Xuan Lam. A feature extraction method and recogni-
tion algorithm for detection unknown worm and variations based on static features,
2011.
[28] Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min
Zhao. SBMDS: an interpretable string based malware detection system us-
ing SVM ensemble with bagging. Journal in Computer Virology , 5(4):283{293,

Bibliography 173
2009. doi: 10.1007/s11416-008-0108-y. URL http://dx.doi.org/10.1007/
s11416-008-0108-y .
[29] Jianyong Dai, Ratan K. Guha, and Joohan Lee. Ecient virus detection using
dynamic instruction sequences. JCP, 4(5):405{414, 2009. doi: 10.4304/jcp.4.5.
405-414. URL http://dx.doi.org/10.4304/jcp.4.5.405-414 .
[30] Usukhbayar Baldangombo, Nyamjav Jambaljav, and Shi-Jinn Horng. A static
malware detection system using data mining methods. CoRR , abs/1308.2831,
2013. URL http://arxiv.org/abs/1308.2831 .
[31] George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. Large-scale malware
classication using random projections and neural networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver,
BC, Canada, May 26-31, 2013 , pages 3422{3426, 2013. doi: 10.1109/ICASSP.
2013.6638293. URL http://dx.doi.org/10.1109/ICASSP.2013.6638293 .
[32] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction
to information retrieval. pages 272{275, 2008.
[33] Dimitris Achlioptas. Database-friendly random projections: Johnson-
lindenstrauss with binary coins. J. Comput. Syst. Sci. , 66(4):671{687, June 2003.
ISSN 0022-0000. doi: 10.1016/S0022-0000(03)00025-4. URL http://dx.doi.
org/10.1016/S0022-0000(03)00025-4 .
[34] Marco Cova, Christopher Kr ugel, and Giovanni Vigna. Detection and analysis of
drive-by-download attacks and malicious javascript code. In Proceedings of the
19th International Conference on World Wide Web, WWW 2010, Raleigh, North
Carolina, USA, April 26-30, 2010 , pages 281{290, 2010. doi: 10.1145/1772690.
1772720. URL http://doi.acm.org/10.1145/1772690.1772720 .
[35] Davide Canali, Marco Cova, Giovanni Vigna, and Christopher Kruegel. Prophiler:
a fast lter for the large-scale detection of malicious web pages. In Proceedings of
the 20th International Conference on World Wide Web, WWW 2011, Hyderabad,

Bibliography 174
India, March 28 – April 1, 2011 , pages 197{206, 2011. doi: 10.1145/1963405.
1963436. URL http://doi.acm.org/10.1145/1963405.1963436 .
[36] Wepawet. Wepawet. http://wepawet.cs.ucsb.edu .
[37] Ben Feinstein and Daniel Peck. Caeine monkey: Automated collection, detection
and analysis of malicious javascript. In DEFCON 15 , 2007.
[38] K. Selvaraj and N. F. Gutierres. The rise of pdf malware, 2010. URL http:
//www.symantec.com/connect/blogs/rise-pdf-malware .
[39] Alexander Moshchuk, Tanya Bragin, Damien Deville, Steven D. Grib-
ble, and Henry M. Levy. Spyproxy: Execution-based detection of
malicious web content. In Proceedings of the 16th USENIX Secu-
rity Symposium, Boston, MA, USA, August 6-10, 2007 , 2007. URL
https://www.usenix.org/conference/16th-usenix-security-symposium/
spyproxy-execution-based-detection-malicious-web-content .
[40] Charles Curtsinger, Benjamin Livshits, Benjamin Zorn, and Christian Seifert. Zoz-
zle: Low-overhead mostly static JavaScript malware detection. In Proceedings of
the Usenix Security Symposium , August 2011.
[41] Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Ramarathnam Venkatesan,
J Lambert, and Jinwook Shin. Pattern mining for future attacks. Technical Report
MSR-TR-2010-100, July 2010. URL http://research.microsoft.com/apps/
pubs/default.aspx?id=135599 .
[42] Maxim Mozgovoy, Kimmo Fredriksson, Daniel R. White, Mike Joy, and Erkki
Sutinen. Fast plagiarism detection system. In String Processing and Information
Retrieval, 12th International Conference, SPIRE 2005, Buenos Aires, Argentina,
November 2-4, 2005, Proceedings , pages 267{270, 2005. doi: 10.1007/11575832 30.
URL http://dx.doi.org/10.1007/11575832_30 .

Bibliography 175
[43] Lutz Prechelt, Guido Malpohl, and Michael Philippsen. Finding plagiarisms
among a set of programs with jplag. JOURNAL OF UNIVERSAL COMPUTER
SCIENCE , 8:1016{1038, 2000.
[44] Stanley Ng Kwang Loong and Santosh K. Mishra. De novo SVM classication
of precursor microRNAs from genomic pseudo hairpins using global and intrinsic
folding measures. Bioinformatics/computer Applications in The Biosciences , 23:
1321{1330, 2007. doi: 10.1093/bioinformatics/btm026.
[45] David E. Goldberg and Jon Richardson. Genetic algorithms with sharing for
multimodal function optimization. In Proceedings of the Second International
Conference on Genetic Algorithms on Genetic Algorithms and Their Application ,
pages 41{49, Hillsdale, NJ, USA, 1987. L. Erlbaum Associates Inc. ISBN 0-8058-
0158-8.
[46] Smolensky Paul. Information processing in dynamical systems: Foundations of
harmony theory. Parallel Distributed Processing: Explorations in the Microstruc-
ture of Cognition, Volume 1 , pages 194{281, 1986.
[47] Georey E. Hinton. Training products of experts by minimizing contrastive diver-
gence. Neural Computation , 14(8):1771{1800, 2002.
[48] David Thomas. http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-
mwc64x.html.
[49] Thouque Haq Deepen Desai. Blackhole exploit kit: Rise & evolution, 2012.
URL http://software.sonicwall.com/gav/BlackholeExploitKit-Rise&
Evolution.pdf .
[50] Internet Crime Complaint Center's (IC3). Internet crime complaint center's (ic3)
scam alerts, 2012. URL http://www.ic3.gov/media/2012/120420.aspx .
[51] ECMA International. Ecma standard 262, 2015. URL http://www.
ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf .

Bibliography 176
[52] Ronald L. Rivest. The md5 message-digest algorithm. Internet RFC 1321, April
1992. URL http://tools.ietf.org/html/rfc1321 .
[53] D. Eastlake 3rd and T. Hansen. US Secure Hash Algorithms (SHA and HMAC-
SHA). RFC 4634 (Informational), July 2006. URL http://www.ietf.org/rfc/
rfc4634.txt .
[54] URL http://www.alexa.com/ .
[55] Vreda Pieterse and Paul E. Black. Baum welch algorithm. Dictionary of Algo-
rithms and Data Structures, July 2004. URL http://www.nist.gov/dads/HTML/
baumWelch.html .
[56] URL https://play.google.com/store .
[57] URL https://www.apple.com/ .
[58] Threats to mobile devices using the android operating system, 2013. URL https:
//info.publicintelligence.net/DHS-FBI-AndroidThreats.pdf .
[59] URL https://support.google.com/googleplay/android-developer/answer/
6112435?hl=en .
[60] Hiroshi Lockheimer. Android and security, 2012. URL http://googlemobile.
blogspot.ro/2012/02/android-and-security.html .
[61] Jon Oberheide and Charlie Miller. Dissecting the android bouncer. Summer-
Con2012, New York , 2012.
[62] F-Secure. Threat report h2 2013, 2012. URL https://www.f-secure.com/
documents/996508/1030743/Threat_Report_H2_2013.pdf .
[63] Seungyeop Han, Jaeyeon Jung, and David Wetherall. A study of third-party
tracking by mobile apps in the wild. Technical report, March 2012. URL ftp:
//ftp.cs.washington.edu/tr/2012/03/UW-CSE-12-03-01.PDF .

Bibliography 177
[64] Dawn Song Hui Xue Yulong Zhang, Tao Wei. Ad vulna: A vul-
naggressive (vulnerable & aggressive) adware threatening millions,
2013. URL https://www.fireeye.com/blog/threat-research/2013/10/
ad-vulna-a-vulnaggressive-vulnerable-aggressive-adware-threatening-millions.
html .
[65] Dalvik executable format. URL https://source.android.com/devices/tech/
dalvik/dex-format.html .
[66] Jesus Freke. URL https://code.google.com/p/smali/ .
[67] Eric Lafortune. URL http://proguard.sourceforge.net/ .
[68] Jay Freeman. URL http://www.cycript.org/ .
[69] URL https://code.google.com/p/robotium/ .
[70] URL https://www.openssl.org/about/ .

Copyright Notice

© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.

Acest articol: University Alexandru Ioan Cuza of Iasi, Romania [601407] (ID: 601407)

Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.

University Alexandru Ioan Cuza of Iasi, Romania [601407]

Copyright Notice

,,Agenda 21 recunoaște că autoritațile locale au un rol foarte important în dezvoltarea durabilă pentru că: [305562]

TemașiviziuneadesprelumeadinMoromețiideMarinPreda [605129]

CAPITOLUL I: PLURALITATEA DE INFRACȚIUNI 1.1.Noțiune 1.2.Trăsături ce caracterizează pluralitatea de infracțiuni 1.3.Formele pluralității de… [629809]

Cuprins ………………………….. ………………………….. ………………………….. ………………………….. ….. 1… [603530]

Rolul, tipologia și componentele principale ale fondurilor europene 41 [625064]

Carіtοlul 1. AȘ ΕΖARΕA G ΕOGRAF ІCĂ, L ІΜІTΕLΕ ȘІ ΕVOLUȚ ІA COΜUNΕІ [619920]

Copyright Notice

Similar Posts