Faculty of Mathematics and Computer Science Department of Computer Science Bachelor Final Project An Overview of Deep Learning Algorithms for Static… [630125]
University of
Bucharest
Faculty of Mathematics and Computer Science
Department of Computer Science
Bachelor Final Project
An Overview of Deep Learning
Algorithms for Static Malware Detection
Scientific coordinator:
Andrei P ˘atras ,cuGraduate:
Teodor-Paul Tonghioiu
1st of July 2020
Contents
1 Introduction 7
1.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Static vs dynamic detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Preliminaries 9
2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 The framework 14
3.1 Building an input pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Tensorboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 What-If Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Fairness Indicators Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 HParams Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 The dataset 19
4.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Preprocessing 19
6 Convolutional neural networks 20
6.1 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 A simple CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 A ResNeXt inspired architecture . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Artificial neural networks 24
7.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.4 Loss and optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.6 Attempts for improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
List of Figures
3.1 Viewing scalars inside Tensorboard . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 What-If Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Fairness Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 HParams Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1 Convolution operation 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 1D convolutional layer with valid padding and a stride of 2 . . . . . . . . . . . 20
6.3 Max-pooling over one channel of 1D data with a stride of one and valid padding 21
6.4 Legend and proposed CNN building block . . . . . . . . . . . . . . . . . . . . 21
6.5 A simple CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.6 ResNeXt inspired block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.7 ResNeXt inspired architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.1 Grid searches, each search is an epoch . . . . . . . . . . . . . . . . . . . . . . 24
7.2 Resulting ANN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3 Predictions on part of test dataset . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 Approximate ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Abstract
Malware poses a threat to businesses and governments regarding data privacy and in-
tegrity, sometimes even leading to the disruption of their normal activity. Keeping up with
the newest versions of malware and obfuscation techniques is a necessity for the protection
of one’s assets and the prevention of large scale attacks. Some viruses reached such a large
number of hosts they received a lot of media attention, one such virus being the Wannacry
worm which led to an outbreak back in 2017. This ransomware attack affected around 200k
computers worldwide in 150 countries and lasted 4 days until mitigated.
A time not so long ago malware scanning was mainly heuristic and signature-based, but
this approach had a lot of trouble adapting to malware changes (e.g. polymorphic mal-
ware), therefore an approach that would generalize better was needed. Fortunately, in the
last years, bigger datasets of malware were gathered and machine learning raised in popu-
larity, so new ways to tackle malware detection emerged.
Usually, when using machine learning for static malware detection, tree ensembles are used,
but as more and more deep learning architectures are published there is an increasingly
greater chance that some might also fit this kind of problem better than their predecessors.
The contribution this work brings in this field consists in the unique architectures presented
and the way they were benchmarked.
In this document, two convolutional neural network architectures and one artificial neural
network architecture are presented, along with their benchmarks and the reasons for choos-
ing them. Out of those three, the neural network performed the best with an accuracy of
94.4%.
Abstract
Infect ˘arile cu malware reprezint ˘a o problem ˘a serioas ˘a pentru business-uri s ,i institut ,ii
guvenamentale, deseori duc ˆand la s ,tergerea ori pierderea confident ,ialit˘at,ii datelor, sau chiar
laˆıntreruperea activit ˘at,ii normale. Un lucru s ,i mai ˆıngrijor ˘atorˆıl reprezint ˘a atacurile ciber-
netice la scar ˘a larg ˘a, acestea av ˆand un efect asupra ˆıntregii societ ˘at,i. Un astfel de atac
ransomware care merit ˘a amintit este Wannacry care a ap ˘arut ˆın anul 2017 s ,i a infectat
aproximativ 200.000 de sisteme informatice din 150 de t ,˘ari, iar pentru care g ˘asirea unei
solut ,ii de combatere a durat 4 zile.
Acum 5 ani se putea afirma c ˘a majoritatea sistemelor de detect ,ie malware foloseau
metode euristice sau bazate pe semn ˘atura fis ,ierelor, ˆıns˘aˆın prezent se fac eforturi c ˘atre
trecerea la unele solut ,ii mai eficiente. Aceste metode nu sunt satisf ˘ac˘atoare deoarece pot fi
evitate de c ˘atre diferite tipuri de malware. Din fericire, cum field-ul de ˆınv˘at,are automat ˘a
s-a dezvoltat mult ˆın ultimii ani, iar seturile de date au devenit tot mai mari, s-a putut face
trecea c ˘atre noi metode bazate pe acest tip de algoritmi.
ˆIn general, solut ,iile actuale de detect ,ie static ˘a de malware se folosesc de algoritmi de
ˆınv˘at,are automat ˘a bazat ,i pe arbori de decizie. ˆIns˘a, cum antrenarea ret ,elelor ad ˆanci este
din ce ˆın ce mai popular ˘a s,iˆın fiecare an apar arhitecturi tot mai eficiente ˆın diferite arii de
expertiz ˘a, este posibil ca unele dintre aceste idei s ˘a poat ˘a fi aplicate cu succes s ,iˆın detect ,ia
de malware.
Contribut ,ia acestei lucr ˘ari const ˘aˆın rularea de experimente cu scopul analizei, dez-
volt˘arii s ,i ajust ˘arii anumitor modele de ˆınv˘at,are automat ˘a care s ˘a poat ˘a fi utilizate ˆın acest
domeniu. ˆIn continuare vor fi prezentate trei ret ,ele de ˆınv˘at,are ad ˆanci, dou ˘a ret ,ele convolut ,ionale
s,i una neuronal ˘a,ˆımpreun ˘a cu rezultate obt ,inute s ,i eforturile de ˆımbunat ˘at,ire ale acestora.
Dintre aceste metode cea care a oferit cele mai bune rezultate a fost ret ,eaua neuronal ˘a, cu
o acuratet ,e de 94.4%.
1 Introduction
1.1 Deep learning
Deep learning [4] is a field of machine learning field where predictions are made using a model
that has more than one layer of mathematical operations dependent on some weights. The goal
of the algorithm is to minimize a loss function which evaluates the accuracy of predictions made
on a dataset. Fortunately, those operations are differentiable, and thanks to this fact, there is an
algorithm that can compute the gradient of the loss function with respect to the weights of the
model for a sample in the dataset. This algorithm is called backpropagation [3], and the gradient
it computes can be passed to an optimizer that will adjust the model weights.
1.2 Motivation
As so many new deep network architectures emerge in the field of research there is a chance
that some might fit this problem very well, so exploration for new detection methods is an open
field. Also, on a personal level, I find a useful skill to be able to design deep learning models,
adjust parameters, analyse data and to be able to monitor and benchmark those models.
1.3 Static vs dynamic detection
When analyzing a binary statically, we are looking at its contents or its metadata (also called PE
features). Analyzing it dynamically would require running it in a sandbox environment, observe
its behaviour and usually giving a score of ”maliciousness” to each action it performs.
Usually, a dynamic scan performs better and shines especially when analysing encrypted con-
tent, code based on high-level scripting languages or code that uses in general advanced obfus-
cation techniques. The disadvantage of dynamic analysis is higher resource utilization.
Sometimes in malware scanning both techniques are used.
7
1.4 Related work
Artificial neural networks were used before for malware detection, one such network achieving
95.2% accuracy [7] on data gathered from customers and another one 98.9% [11] on the Ember
dataset [2]. Also, the paper published by Vinayakumar and Soman [11] shows the results of 9
machine learning algorithms, out of which the neural network performed the best. Satisfying
results were obtained when trying to detect rare malware attacks with this kind of networks [1].
I am not aware of any attempt to use a convolutional neural network over PE features.
1.5 State of the art
Tree ensemble learning algorithms are the way to go when analysing files statically (as it is also
the reference model for the dataset I used). Each antivirus has a different algorithm at its core,
but, at some level, there is a good chance that it uses some sort of tree ensemble.
As a reference of the current state of the industry, we could look at Kaspersky’s two-stage pre-
execution file scanning [1]. They extract some lightweight PE features from binary and then
hash them with a locality-sensitive hash (similarity hash). That hash will assign the file a posi-
tion in a multi-dimensional space, and that position can be used to filter the file into malware,
benign, and maybe malware. This classification is based upon dividing the space into buckets
that may contain only malware, only benign or both. If the file falls into a bucket with both
malware and benign, a more cautious examination is required. Then full PE features are ex-
tracted, hashed and for that specific bucket there is a tree ensemble that will classify the file as
malware or benign. This is their method to analyse malware efficiently in two stages. They also
use a deep learning model (artificial neural network) for very rare attacks with maybe just a few
samples.
8
2 Preliminaries
2.1 Concepts
With the help of Google’s glossary for machine learning [5] I will briefly explain the terms that
I used in the document.
Epoch
Training is usually done over multiple iterations over the dataset. An epoch is an iteration over
the dataset.
Batch
This is the number of samples analyzed for one gradient update.
Overfitting
An overfit model is a model that performs well on the training dataset but fails to generalize
well. Generalization is the measure of how well a model performs on unseen data. Usually, to
measure at training time how overfit a model is, a validation dataset is extracted from training
data. We keep a record of model performance on this dataset, but never train directly on it.
Regularization
Adding penalties to the loss function when trying to learn a more complex model. The purpose
of regularization techniques is to reduce the degree of overfitting.
Early stopping
The practice of stopping training when a metric is not improving anymore.
Embedding
An embedding is the representation of a categorical feature, usually in a lower-dimensional
space.
Activation function
Usually, a non-linear function which is applied element-wise to the outputs of a layer.
Fully-connected layer
9
The operations made by a fully-connected layer are:
output =activation (dot(input;weights ) +bias )
for n units in the output space (neurons), where ”weights” and ”bias” are parameters learned by
the model, ”dot” represents dot product and ”activation” is an activation function. The ”input”
is the whole previous layer in the form of a vector. The activation function is optional, at least
in most machine learning frameworks.
Synonym for Dense layer.
Dropout layer
A layer which randomly discards inputs from the previous layer with a certain probability.
Batch normalization
A layer which normalizes the output of the previous layer. The process of normalization maps
the values from their original distribution to a normal distribution.
Artificial neural network
A network that is based upon artificial neurons. From an engineering perspective, one neuron is
one of the outputs of a fully-connected layer. I use this term to address networks that are mostly
comprised of fully-connected layers.
Convolutional neural network
A network that is made mostly of convolutional layers. More about how convolutional layers
work can be found in section 6.1.
Hyperparameter
Parameters that cannot be learned by the model and must be provided prior to training.
Grid search
The search refers to finding the best hyperparameters for model training. The grid is related to
how you choose values for tuning. A grid search would imply that you provide a set of values
for each parameter and try every possible combination for those values.
10
Random search
Similar to grid search, but the values are chosen randomly from an interval. It is best to begin
the tuning with a random search followed by a grid search.
2.2 Metrics
Along the document, positives are considered malware samples and negatives benign samples.
Therefore a true positive would be malware correctly identified, a true negative benign correctly
classified, a false positive a benign file predicted as malicious and a false negative undetected
malware.
True positive rate
Also known as sensitivity, recall or hit rate, describes the proportion of positives correctly clas-
sified out of all positives.
Sensitivity =TruePositives
Positives=TruePositives
TruePositives +FalseNegatives= 1 FalseNegativeRate
True negative rate
Also known as specificity or selectivity, describes the proportion of negatives correctly classified
out of all negatives.
Specificity =TrueNegatives
Negatives=TrueNegatives
TrueNegatives +FalsePositives= 1 FalsePositiveRate
False positive rate
Also known as fall-out or false alarm rate, describes the the ratio between the number of negative
11
samples misclassified and the total number of negative samples.
FalsePositiveRate =FalsePositives
FalsePositives +TrueNegatives= 1 TrueNegativeRate
False negative rate
Also known as miss rate, describes the ratio of misclassified positives to all positive samples.
FalseNegativeRate =FalseNegatives
FalseNegatives +TruePositives= 1 TruePositiveRate
Positive predictive value
Also known as precision, describes how often a predicted positive is a true positive. It is specif-
ically relevant for imbalanced datasets.
Precision =TruePositives
TruePositives +FalsePositives= 1 FalseDiscoveryRate
Accuracy
Accuracy is an overall estimate of performance.
Accuracy =TruePositives +TrueNegatives
Positives +Negatives
F1 score
The harmonic mean of precision and sensitivity. Often used to compare performance among
12
different models.
F1= 2PositivePredictiveValue TruePositiveRate
PositivePredictiveValue +TruePositiveRate
F1=2TruePositives
2TruePositives +FalsePositives +FalseNegatives
PR curve
Short for precision-recall curve. It is the curve described by the graph of precision vs recall
plotted against all possible classification thresholds. The classification threshold is the threshold
above which a sample is predicted to be positive.
ROC curve
Short for receiver operating characteristic curve. It is the curve described by the graph of true
positive rate vs false positive rate plotted against all possible classification thresholds.
AUC
Short for area under curve. Most of the time, it is the value of the area under the ROC curve.
13
3 The framework
I chose to experiment with TensorFlow as it is a popular framework for deep learning, it is
backed up by many developers and has good analytics. i was mostly interested in using Keras
API for its speed when designing models and its flexibility.
I was able to monitor and log the training metrics inside Tensorboard, analyse the data and
extract some metrics post-training using fairness indicators and also log results when hyper-
parameter tunning (or architecture adjusting). I also took a look at the structure of my data,
viewed some plots (like ROC curve) and I observed how modifying data affects predictions
using What-If Tool.
3.1 Building an input pipeline
I also wanted the data to be in a format easy to use with the framework and ready for a multi-
input model, each file section having its custom processing flow.
That format is tfrecord, it was designed to facilitate building an input pipeline in a way that is
easy to train on a distributed system, reduce I/O latency, but most importantly it enables access
to many tools.
The tf.data API makes it easy to read the dataset from multiple files, shuffle the data at training
time, batch and prefetch the data (prefetch to a device like a GPU).
When building a tfrecord dataset, the data is serialized to a protocol buffer format. When
training, the data is unserialized and the task of how this data is interpreted, more precisely
establishing which are the inputs, the labels or the sample weight, is assigned to the programmer,
who must provide a mapping function.
There is also the possibility of using this API with data taken from other sources, but the data I
was using was loaded by default as a numpy.memmap, which is not efficient to read and takes
a while to load into RAM.
Using this API made the training much faster, especially for simpler models, and gave me easy
access to other helpful tools.
14
3.2 Tensorboard
Tensorboard is a visualization toolkit for machine learning, which is designed to work easily
with TensorFlow, but can also be used with other machine learning frameworks. It supports
displaying graphs of scalar values over time, profiling model performance, displaying images,
playing audios, displaying histograms and it is used as an interface for a few other tools like the
following ones.
Figure 3.1: Viewing scalars inside Tensorboard
3.3 What-If Tool
What-If Tool is helpful when having a quick overview of data, knowing how different features
might affect model prediction and can display a few fairness statistics.
It displays for each feature in the dataset the range in which it resides, the average values, the
standard deviation and the percent of zero values.
When trying to analyze how a feature change would affect a prediction, the model has to be
served using TensorFlow serving. TensorFlow serving establishes a server which serves model
predictions using REST API or gRPC, the latter being used by this tool.
15
(a) Prediction visualization
(b) Metrics
(c) Data
Figure 3.2: What-If Tool
16
3.4 Fairness Indicators Dashboard
This allows computation of common fairness metrics for classification problems. This API
is especially useful when comparing model performance under different thresholds and when
trying to analize performance over different slices of data.
Figure 3.3: Fairness Indicators
3.5 HParams Dashboard
Using the tensorboard.plugins.hparams API, results of different experiments can be logged
along with their specific parameters. This allows for a better documentation and easier in-
terpretation of random and grid searches.
17
(a) Table view
(b) Parallel view
Figure 3.4: HParams Dashboard
18
4 The dataset
I used the Ember [2] (2018) open dataset published by Endgame. The dataset is made of PE
features extracted from 1M binary files: 800K training samples (300K malicious, 300K benign,
200K unlabeled) and 200K test samples (100K malicious, 100K benign) from year 2018 or be-
fore.
The dataset came with a pre-trained model for reference. The model was trained with Light-
GBM (a gradient boosting framework) and reaches an accuracy of 97.8% over the dataset.
The publishers also offered an API to extract features from executables in the same way they
did when building the dataset.
4.1 Inputs
The dataset provides PE features of files, i.e. file sections related to: imports, exports, data
directories, strings, general, header, sections, bytes histogram, bytes entropy. Although the
publishers provided them as a big list of floats in the dataset, I chose to break that list into more
lists each related to a file section. Afterwards I used those lists as my model input. This gave
me more flexibility in choosing the architecture.
5 Preprocessing
The data was changed from the format provided by ember to tfrecords.
The input data was scaled in such a way that the values among different sections would be closer
numerically.
I shuffled the data pre-training.
I used as validation data a slice of 5k samples from the beginning of the shuffled dataset.
19
6 Convolutional neural networks
With the hope that by increasing the dimensionality of data and applying different transforma-
tions on the same layer might reveal some patterns, I tried using two CNN architectures.
Some details about the implementation will be discussed in the following chapter, as they are
mostly the same as those used for the ANN.
6.1 Basic operations
Convolutional layer
Figure 6.1: Convolution operation 1DA convolution operation consists of an
element-wise multiplication of part of the in-
put with a convolutional filter. The convolu-
tional filter is an array with the same rank as
the input but of a smaller total dimension.
The input of a convolutional layer consists of
multiple channels and the output of multiple
feature maps. The number of feature maps is
equal to the number of kernels used. A feature map is generated by applying convolution oper-
ations on each channel and then combining their results by addition.
Some parameters decide how the convolutional layer will behave. The stride is the delta in each
dimension by which we apply the convolutions to the next slice of data. We can use padding
schemes like same-padding which pads the input with 0’s such that the dimensionality is pre-
served. When using valid padding some data might get dropped out.
Figure 6.2: 1D convolutional layer with valid padding and a stride of 2
20
Pooling layer
A pooling layer has the purpose to compress the data by reducing the dimension of each channel
in the input. We move a frame of pool-size length over each channel and output a function used
for pooling (like the maximum of selected elements). We move this frame by a chosen number
of strides and use a padding scheme.
A pooling layer usually follows a convolutional layer.
Figure 6.3: Max-pooling over one channel of 1D data with a stride of one and valid padding
6.2 A simple CNN
(a) CNN block
(b) Legend
Figure 6.4: Legend and proposed CNN building block
21
Figure 6.5: A simple CNN architecture
This model reached only 78.44% test accuracy, 81.94% training accuracy and 87.22% validation
accuracy. It took 1 epoch of training to reach this result (a few minutes). Other metrics:
• False positive rate: 0.06
• False negative rate: 0.37
• True positive rate: 0.63
• True negative rate: 0.94
• AUC: 0.90
* Those metrics were extracted after model export
We can observe this model performs very bad at identifying malware (poor sensivity/true posi-
tive rate) and well when presented benign files (good specifity/true negative rate).
6.3 A ResNeXt inspired architecture
When trying to design an architecture I thought about using one that had already been used
in another area with success, like image classification. Unfortunately, a network similar to the
original one was exceeding my hardware capabilities and the training was very slow so I ended
up using only one block similar with the one in the original paper [12].
22
I say similar because due to bad performance I had to renounce using batch normalization and
had to use SELU activations.
This architecture is rather a hybrid between a CNN and an ANN.
Figure 6.6: ResNeXt inspired block
Figure 6.7: ResNeXt inspired architecture
The results were not impressing in this case either: 79.95% test accuracy, 83.11% training
accuracy and 84.74% validation accuracy. It took 4 epochs of training to reach this result (about
25 mins).
23
7 Artificial neural networks
Unlike CNNs, ANNs were used with a degree of success in malware detection. My method is
close to the results of a previously mentioned experiment [7] done on a different dataset (95.2%
accuracy, 0.1 FPR) but much worse than another one on the same dataset (98.9% accuracy).
After doing a few grid searches I was able to decide upon an adjusted architecture.
(a) First grid search ANN
(b) Second grid search ANN
Figure 7.1: Grid searches, each search is an epoch
24
Figure 7.2: Resulting ANN architecture
7.1 Dropout
Using dropout layers [6] usually improves generalization capability and help prevent overfit-
ting. To a certain degree, it helped my model but as I was training data I had to stop training
fast, otherwise the model would have overfitted to training data, although validation accuracy
would have still improved.
7.2 Activation functions
I was inspired by another paper [7] to use PReLU activations, as those are not saturating (do
not squeeze the output for large numbers) and can handle negative values better than ReLU
activations. Also, there is not any great improvement in choosing another activation function.
7.3 Learning rate
I opted to use a decaying learning rate for which I set the rate and decay following another grid
search:
=rate
1 +decay rate step
decay step
25
Where rate and decay rate are hyperparameters, step is the number of the current batch, and
decay step was set as the number of steps per epoch.
After running a grid search (full training) I found optimal values to be = 10 4and 0:75for
decay.
7.4 Loss and optimizer
The loss function was binary cross entropy and the optimizer Adam with 1= 0:9,2= 0:999
and= 10 5(default values).
7.5 Results
This model reached an accuracy of 94.4% (96.36% training, 96.85% validation). Another ad-
vantage of this model is that it can be trained in a few minutes. The training was done in 3
epochs by early stopping on validation accuracy change less than 0.5% per epoch.
Using TensorFlow model analysis I was able to extract other relevant metrics:
• False positive rate: 0.04
• False negative rate: 0.07
• True positive rate: 0.93
• True negative rate: 0.96
• AUC: 0.98
7.6 Attempts for improvement
Having a decent performing model I attempted to augment the dataset using the unlabeled
samples. We can observe from figures 7.3 and 7.4 that at low certainty threshold the model
is very accurate with its predictions, so using its predictions at a threshold of 1% certainty
would generate very little noise to data. Unfortunately, no improvement has been observed
after augmentation, neither at a low threshold nor when labelling all data.
26
Also looking at the ROC curve in figure 7.4 would suggest that using another threshold for
classification would not yield any significantly better improvement in accuracy, and using the
model to lower the false positives or negatives would not be without any penalties.
The model is suffering from overfitting in the sense that training until there is no improvement
in validation accuracy would result in an overfit model with an accuracy of around 0.9 on the test
set and 0.97 on training and validation dataset. I tried to use L2 regularization [10] and batch
normalization [9] to improve the generalization ability, but I did not observe any substantial
improvement in test accuracy.
Figure 7.3: Predictions on part of test dataset
27
Figure 7.4: Approximate ROC curve
28
Final thoughts
Overall, boosted trees [8] and neural networks are good malware classifiers which is why they
are used in the industry.
There are so many things to learn about deep models, how they are currently developed and
deployed in production and how data and results are analyzed that I could explore for years
what models would work better than others for this problem. It is worth mentioning that I tried
other few ”exotic” techniques developed in recent years, but I did not talk about them here
because they were at least seemingly worse performing than those presented.
Although a very popular and powerful framework, TensorFlow has still years ahead to mature.
A big step forward was the release of API version r2.0.0 one year ago, bringing eager execution
(vs manually built graphs), abandoning the tradition of using global namespaces and switching
from sessions to tf functions. Yet when needing custom functionality still provides a low-level
API, many times being necessary diving deeper in source code for finding solutions. Still, the
framework allows advanced model analysis, offers the best performance on the market, allows
for distributed training and has support for building full training pipelines, but I would also say
that the workflow for achieving this is quite complex.
I think the decision of treating each file section as a separate input capped the performance I
was able to obtain from my models, but I could not have known that until I made all efforts to
get the best performance.
The results obtained are not conclusive, there are neural networks that perform better than those
I trained, so there is still a lot of room for improvement. The way PE features are embedded
deserves more attention, testing should be done on models that take all features as inputs of the
first layer and more complex architectures must be thoroughly analyzed. Also, building a full
training pipeline with TensorFlow Extended would be an interesting experience.
29
References
[1] Machine learning methods for malware detection. 2020 AO Karspersky Lab.
[2] H. S. Anderson and P. Roth. EMBER: An Open Dataset for Training Static PE Malware
Machine Learning Models. ArXiv e-prints , Apr. 2018.
[3] I. Goodfellow, Y . Bengio, and A. Courville. 6.5 Back-Propagation and Other Differentia-
tion Algorithms . MIT Press, 2016. http://www.deeplearningbook.org .
[4] I. Goodfellow, Y . Bengio, and A. Courville. Deep Learning . MIT Press, 2016. http:
//www.deeplearningbook.org .
[5] Google. Machine learning glossary. https://developers.google.com/
machine-learning/glossary .
[6] N. G. A. I. Sutskever and R. Salakhutdinov. Dropout: A simple way to pre-vent neural
networks from overfitting. The Journal of Machine Learning Research , 15(1):1929–1958,
2014.
[7] Joshua Saxe and Konstantin Berlin. Deep neural network based malware detection using
two dimensional binary program features. ArXiv e-prints , 2015.
[8] Y . Mohammad Saberian, Pablo Delgado. Gradient boosted decision tree neural network.
arXiv:1910.09340v2 , 5 Nov 2019.
[9] W. Ping Luo, Xinjiang Wang and Zhanglin Peng. Towards understanding regularization
in batch normalization. ICLR , 2019.
[10] L. Thomas Tanay. A new angle on l2 regularization. arXiv:1806.11186v1 , 201.
[11] Vinayakumar R and SomanK.P. Deepmalnet: Evaluating shallow and deep networks
for static pe malware detection. https://www.sciencedirect.com/science/
article/pii/S2405959518304636?via%3Dihub .
[12] S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He. Aggregated residual transformations for
deep neural networks. arXiv preprint arXiv:1611.05431 , 2016.
30
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Faculty of Mathematics and Computer Science Department of Computer Science Bachelor Final Project An Overview of Deep Learning Algorithms for Static… [630125] (ID: 630125)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
