Identifying the Visitor s with Data Mining Methods [632293]

Identifying the Visitor s with Data Mining Methods
from Web Log Files

Uğur Gürtürk
Department of Software Engineering
Firat University
Elazig , Turkey
ugurgurturk23 @gmail.com

Muhammet Baykara
Department of Software Engineering
Firat University
Elazig , Turkey
muhammetbaykara23 @gmail.com

Murat Karabatak
Department of Software Engineering
Firat University
Elazig , Turkey
[anonimizat]

Abstract — The usage of data stored in web search engines
and on transaction logs of websites can provi de valuable
information for researchers related to the searched information
and user behavior analysis. Within this context, some
information, which is important for a network structure, can be
obtained such as access time and access type. It can be especi ally
beneficial in designing the information system, developing the
interface, and improving the information architecture for content
collections. In this paper, a set of samples of one month access
records of Firat University’s website is collected and us ed. The
set of samples is cleaned up with the log parser application
developed in the data cleansing phase, which is the core of data
mining. The cleaned data were converted to CSV format and
analyzed using the BayesNet classifier method, which provides
the best performance in the WEKA Software.
Keywords —Behavior Analysis; Data Mining; Information
Extraction; Internet Access Log .
I. INTRODUCTION
The use of computers and the Internet in today's
information age has started to rapidly spread in parallel to the
vast development of technology. This widespread use has led
many processes to be electronically implemented. This, in turn,
allows the data in digital environments to rapidly grow on a
daily basis and in a timely manner; bringing about the
complexity of da ta. The complex digital data generally need to
be analyzed. For this reason, the necessity of applying data
mining techniques to web technologies has increased.
Accordingly, an increasing number of researchers has focused
on this issue.
Data mining can be expressed as a process used to
transform raw data into useful information [1]. Data mining is
a powerful technology that helps businesses, organizations or
individuals focus on the most important information in data
warehouses, future vision, trends and b ehaviors, and making
useful and right business decisions [2]. Data mining basically
uses sophisticated mathematical algorithms to divide data into
parts, evaluate the likelihood of future events, and provide
efficient decisions and steps [3]. Data mining t echniques can be
implemented quickly on existing software and hardware
platforms to enhance the value of existing information resources and can they be integrated with new products and
systems. The basic features of data mining are:
• Automatic detection of patterns,
• Estimation of potential results,
• Creation of usable information,
• Focus on large data sets and databases,
• Answering unresolved questions with simple queries
and reporting techniques. [4].

Data
SamplesPre-processed
dataReduced Data PatternsKnowledge

Fig. 1. Data mining p rocess steps

Figure 1 shows data mining process steps. Among the given
data mining steps, the first step and most important part of
data mining is data selection process in order to be analyzed
for making a decision. The data selection step is one of the
most time -consuming steps among data mining and processing
steps. In this process step, data generated in the system should
be well -chosen, and a good analysis should be applied to suit
the correctness of the decision to be taken [5].

Another impo rtant step in implementing a successful data
mining method is the preprocessing. In preprocessing, data
should be presented in a convenient form for the future use
[6]. The success achieved at this phase affects the success of
the result at a high rate. As a result of the preprocessing, that is
performed correctly and efficiently, a clear and precise result
can be obtained.

After the preprocessing phase, the phase of obtaining useful
and realistic information is the step of data reduction. This

step ens ures the reduction of data that is not suitable to use in
next processing steps. This is applicable even for data that
passed through certain preprocessing steps and transformed
into required format [7].

In order to properly implement a data mining met hod, it is
necessary to apply to use certain approaches to obtain the
reduced data [8]. In this step, one or more known data mining
techniques can be applied to the reduced data. More accurate
and clear information can be obtained by combining different
data mining methods.

Once data mining techniques have been applied to the
obtained data, the produced results can be interpreted.
Whether the comments to be made are correct can also be
determined by the results of other data mining techniques
applied to the same data [8]. In other words, it should be
determined which of the methods applied to the database
obtains a more accurate result. Comparing the success rate
achieved by the applied methods with the success rate of other
methods found in the literatu re ensures that the best results are
obtained and these results can be verified.
II. LOG FILES
Network devices used in information systems and computer
networks keep a track of events developed on them. The
records stored on these devices, called the log, exam ine and
record the behavior of the users coming to that system. By
means of these records, it is possible to determine the security
incidents in the network and to take various precautions. The
log, which is called the transaction log, is a file that conta ins
communications between a system and its users. It can also be
described as a data collection method that automatically
captures type, content and time of a process done between a
terminal with a system or a person. More explicitly, a
transaction log is an electronic record of interactions between a
web search engine and users searching for information on that
web search engine [9].
Logging is the process of saving activity information to a
web log file called a log when the web user sends a request to
the web server. The main source of raw data is in the web
access log, which will be referred to as the log file. These log
files are initially kept for errors analysis, yet they spread across
a wide range of applications with the increased size and
compl exity of electronic processes.
Network and security components running in information
systems also generate many logs every day. However, when
the logs of servers and clients are added, it is almost impossible
to report this amount of information that may be valuable about
visitors’ movements and traffic. These actions can be
associated with other data movements and traffic information,
to be analyzed, in order to produce meaningful results. When
the information cannot be assessed as effectively as it is,
controlling process won’t be achieved in a real and complete
sense, and investments in uncontrollable information systems
are not satisfactory.
The format of the log files, which are sensitive to the
management and compulsory to be kept restricted by the law, may vary according to the type of daily resources. The text –
based log files, which are independent of the server platform,
can be found in three different places. These are [10]:
1) Web Servers,
2) Web Proxy Servers,
3) Client Browsers.

Server log files : It u sually provides the most complete and
accurate data usage. However, these log files have two
important deficiencies. These deficiencies are:
 These logs contain special individual information. For
this reason server owners usually keep them restricted.
 Thes e log files do not record visited cached pages.
Cached pages are called from the local store of browsers or
proxy servers, not web servers.
Proxy log files: They retrieve HTTP requests from users,
forward them to a web server, and send the results to the u ser
who is communicated by the web server. It has three major
disadvantages:
 The proxy server is hard to deal with. Dealing with it
needs advanced network knowledge and programming
skills such as the know -how needed for TCP/IP.
 Request blocking is largely limited to most requests.
 The proxy log file on the web is used when a
weblogging system performance is degraded because
each page’s request must be p rocessed by a proxy
simulator.
Client log files: Participants remotely test a website by
downloading a sp ecial software that stores the web usage, or by
changing the source code of an existing browser. HTTP
cookies can also be used for this purpose. This type of log is a
part of the information that is generated by a web server and
stored for future access on computers. However, the drawbacks
of this approach are:
 The design team needs to deploy a custom software
and install end users.
 The technique here makes it difficult to acquire
compatibility with a range of operating systems and
web browsers.
A lot of va luable information can be gathered from the log
records that are generated by the server and the client, which
are held by many information that may be valuable about
visitor’s movements and traffic. Here are some statistics that
can be obtained [10]:
• Wh o logs in at a certain time interval,
• Whether or not there is a change in the hostname, IP
address, MAC address, etc. of the connecting computer,
• Information about who gets which IP address,
• Which pages do these IP addresses provide access to,
• Whet her remote connection to the system occurs or not,

• Who established the VPN connection at which time in the
system,
•Information on who's accessing which file,
• Whether any of the accessed files have been deleted,
• Password change information successful ly,
• Information about whether a computer account or user
account has been created.
The characteristics of the weblog access file that can
provide such information are given in Table 1.
TABLE 1. LOG FILE FIELDS

III. MATERIALS AND METHODS
The classification c an be defined as a prediction of a
specific result according to some qualities, starting from
educational data. To estimate the result, a particular
classification algorithm operates on a set of qualifications and a
set of training that includes the releva nt result, often called the
target or predictive quality. The algorithm tries to explore
relationships between attributes that are likely to predict the
results. Then, the algorithm is given a previously unseen
dataset, called the set of estimates, contain ing the same set of
qualities, except for an unknown prediction set qualifier. The
algorithm analyzes the input and generates an estimate. The
accuracy of the estimate indicates that the used algorithm is
"good" [11].
After preliminary steps of data mining phases, parameter
selection and the dataset selection to be tested will affect the
performance of the model that will appear in the applications.
Therefore, the result of the comparison will be dependent on
the chosen classification algorithm. The used da taset in this
study consists of a log file. The used log file is resulted from
accessing Firat University’s website. In our work, the proxy
server handles log data that is in a text format with the .log file
extension, which consists of Internet user acces s logs. This file,
in the NCSA log format, is more than 1 GB in size. However,
data mining methods have been applied select a specific set of
samples. As a result of the operations performed on this
dataset, analyzed by WEKA program, the highest performanc e
is obtained from the BayesNet classification method. In the
following sections, these operations are discussed in
details. Abbreviations and Acronyms
Define abbreviations and acronyms the first time they are
used in the text, even after they have been def ined in the
abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in
the title or heads unless they are unavoidable.
A. Bayesian Network Classifiers
Bayesian network classifiers are a special t ype of Bayesian
networks designed for classification problems. The randomly
obtained set of variables and conditional dependencies are
defined as a probabilistic graph model directed by Directed
Acyclic Graph (DAG) [12]. Characterized as a graphical model
that effectively encodes the common probability distribution
for a large set of variables, the Bayesian classifier provides an
efficient representation of the multivariate probability
distribution of a set of random variables and makes various
calculations based on this representation [13].
BayesNet is a model class that is used extensively to
represent probabilistic information. In BayesNet, edges
represent conditional dependencies, non -connected nodes
represent conditionally independent variables.

1
0( ) ( | )
In
iX
ip X p X
 (1)
In formula 1, X = (X0… Xn -1) is evaluated as the vector of
the variables. ∏xi is representing the clusters of the parents of
XI in the network. p(Xi | ∏xi) is the conditional probability of
Xi. The distribution here can be used to create new examples
from conditional and marginal probabilities. The benefits of
Bayesian Networks are given as follows:
 They do not need to have a prior information about
the problem.
 They have the ability to protect a high level of
interaction.
 They enable the architectural blo cks to be efficiently
combined and fused together in accordance with a
specified layout.
 They use data modeling to estimate the common
distribution of the solution that seems positive in
terms of the outcome [14].
IV. DEVELOPED APPLICATION AND ANALYSIS
All ope rations made by the visitors on the site are recorded
by the server where the user access information called log is
stored. These recorded data are made into useful information
by carrying out specific analyses. This phase is included in the
literature as web mining. These activity logs are used by
system administrators to record interesting events that occur on
the system. The type of information stored in any activity log is
often a function of the purpose of the monitoring
application/tool used to create and update the log. In other
words, creating activity logs for different types of system
activity is achieved using different types of monitoring tools
[15].
These daily events’ data, which are held by different
monitoring tools, are kept on different fi le types. Significant
data can be obtained from the held access records. The

acquisition of these data has gained importance both in terms
of statutory requirements and standards, as well as in terms of
system performance, and has caused it to be regarded as a
responsibility. In this regard, in accordance with the Turkish
law no 5651, daily log analysis should be performed
professionally in order to keep the log files together with the
time stamps of the obtained files. The first step of these
analyses, the Log Parser process, is important in extracting
meaningful data from log records. Because these records are
confusing and have some parts that are irrelevant [16].
In order to obtain meaningful data from the existing log
records, the first step is to sepa rate the log file. For this
purpose, a Log Parser application was implemented with C #
programming language on Visual Studio 2015. Using this
application, the log file is initially loaded into the system and
then parsed with the Parse Library and transferr ed to the
appropriate form for analysis. The produced log data, which is
suitable for analysis, can be saved with the extension .XLS for
the file to be sent by the users. Figure 2 shows a general
diagram of the developed system in this study.
Log
Files
Parsing the
Log FileSample
Selection
Export
to Excel
Conversion to
Csv Format
Transfer to
Weka and
Analysis

Fig. 2. Developed application steps
A screenshot of this software is shown in Figure 3 .

Fig. 3. Screenshot of the developed log parser application
Begin
Log File Path Display
Specifying the path of
the file to be exported in
Excel format
The LogParser Library's Parsing Process
date
time
EndConversion to
Excel Table
Saving to Specified
File
Fig. 4. Flowchart of the developed log parser application
The existing dataset is cleaned with the developed log
parser software. Since the required data for the implemented
data mining techniques is represented in some parts of the
access records, consistent and necessary data should be taken
from the log records a nd remnants should be cleaned up.
Figure 4 shows the flow chart of the developed log parser
software.
With this application and similar studies, as initial steps of
data mining, data reduction and processing have been realized
in order to provide important information on website
administrators to system administrators. [17].

Fig. 5. O pening of the log dataset in weka and distributions
of remotehostname class
The log data converted into the appropriate file format is
converted to .CSV format online. Then, the .CSV format data
file is opened with WEKA Software. All available log data
instances are classified with all algorithms in WEKA and the
result of this classification are compared according to the
performance values. Table 2 shows the performance resul ts.
The most successful result of this comparison is provided by
the BayesNet classifier presented under section 3. By
BayesNet classifier, the predictions, which will be mentioned
in the conclusion section, are realized in the direction of some
informati on in the accesses to the Firat University website on
certain days and dates. In WEKA, the RemoteHostName
property is handled during the classification phase with
BayesNet and the estimates are made accordingly. Table 2
gives the results of classification with other classifiers.
TABLE 2. THE PERFORMANCE OF DIFFERENT CLASSIFICATION
ALGORITHMS

As shown in Figure 5, the date -time information in the log
file is DataTime, the Request is the page the user has requested,
the requested response status is Status Code, BytesSent is the
number of bytes sent by the server, and the page that the user
has visited before arriving at the page Referer a classification
was made by referring to the information of the browser that
the user was using as well as the User -Agent properties. 1514
rows of data are used in this analysis process; these data
instances are subject to a cleaning process to discard the
information that can be extracted during the analysis phase.
The analysis is performed on 36 different IP addresses
provided access more than 20 of the IP addresses on the
transmitted dataset. The information of users provided access
20 and below has been deleted. According to the information
contained in the log file, the results of this analysis achieved
79.0621% performa nce value in 36 different user
classifications. The results of the analysis are shown in Table 2. TABLE 3. THE PERFORMANCE RATE OF THE ANALYSIS

In Table 4, a confusion matrix, which is also known as the
error matrix in machine learning and statistical cl assification
problem, is obtained according to the classification result. This
matrix, called the confusion matrix, is a table structure that
allows you to visualize the performance of the algorithm. Each
column of the matrix represents instances of a clas s that
performs an estimate, and each row represents instances of a
real class [18] or this operation is taken as the opposite. Since
this matrix contains 36 different IP addresses and it is not
possible to retrieve the entire matrix, a specific part is sh own in
Table 4.
TABLE 4. CONFUSION MATRIX

V. CONCLUSION AND EVALUATION
As a result of the widespread use of the Internet, the
analysis of the log files stored on web servers has become
important. The increasing use of the Internet makes it
necessary to analyze log files in order to understand and
resolve many events, especially security issues, performance,
judicial processes and so on. We are informed about many
issues such as analyzing log files, detecting security violations
and collecting evidence , monitoring performance, detecting
successful and unsuccessful accesses. This information shows
how important log files are in terms of system administrators.
Data mining techniques such as Merging, Clustering, and
Classification can be applied only to in terested regular user
groups to find frequently accessed patterns, resulting in less
time consuming and memory usage, which in turn produces
higher accuracy and performance.

In this study, an analysis of collected log files was carried
out. A sample s et was selected from the access log data, with
the developed parser application, it was segmented according

to the columns and transferred into the appropriate format.
This data is converted to CSV format over the Internet. It was
then transferred to WEKA software. Less than 20 of these
access information that received at different times were
removed from the system, and more than 20 requests were
included in the study. According to the information contained
in the log file, the results of this analysis ach ieved 79.0621%
performance value in 36 different user classifications.
Accordingly, it is predicted that in the case of any malicious
activity to be carried out in future, the user who performs this
activity can be detected using the stored log records. Fo r
example, it may be possible to determine which malicious user
or activity is taking place rather than from which computer in
a laboratory environment. It is suggested to enlarge the size of
the log file uploaded to the system when the Log Parser
applicat ion is parsed and transferred to the appropriate form in
progress with the implemented application. The size of the
basic log file used in this study was determined to be more
than 1 GB. However, the developed software for the parser
does not support the d ecomposition of data with such a large
size. Therefore, further work will be needed to eliminate this
deficiency and achieve higher performance with more data.
REFERENCES

[1] B. J. Jansen, "Search log analysis: What it is, what's been done, how to
do it." Lib rary & information science research 28.3, pp. 407 -432, 2006.
[2] M. Spiliopoulou, "The laborious way from data mining to web log
mining." Computer Systems Science and Engineering 14.2, pp. 113 -126,
1999.
[3] I. Çınar, M. S. Çınar, H. Ș. Bilge, “Web Sunucu Logların ın Web
Madenciliği Yöntemleri ile Analizi”, Akademik Bilișim’14 – XIV.
Akademik Bilișim Konferansı Bildirileri, 2014.
[4] G. G. Emel, Ç. Tașkın, "Veri madenciliğinde karar ağaçları ve bir satıș
analizi uygulaması." Eskișehir Osmangazi Üniversitesi Sosyal Bilim ler
Dergisi 6 ( 2), 2005.
[5] M. Çağlayan, B. Çekirge, D. Birant, P. Yıldırım, "Mobil Uygulama ile
Görüntü İșleme ve Veri Madenciliği Tekniklerine Dayalı Melanom Tahmin Desteği Sağlanması.", Akıllı Sistemlerde Yenilikler ve
Uygulamaları Sempozyumu (ASYU), 2014 .
[6] U. Ekim, “Veri madenciliği algoritmalarını kullanarak öğrenci
verilerinden birliktelik kurallarının çıkarılması”, (Doctoral dissertation),
Selçuk Üniversitesi Fen Bilimleri Enstitüsü, 2011.
[7] M. S. Aktas, O. Kalıpsız, "Veri Madenciliğinde Özellik Seçim
Tekniklerinin Bankacılık Verisine Uygulanması Üzerine Araștırma ve
Karșılaștırmalı Uygulama",Proceedings of the 9th Turkish National
Software Engineering Symposium (UYMS 2015), Yasar University,
Izmir, Turkey, September, 9 -11, 2015.
[8] Akçapınar, Gökhan. "Çevrimiçi Öğrenme Ortamındaki Etkileșim
Verilerine Göre Öğrencilerin Akademik Performanslarının Veri
Madenciliği Yaklașımı İle Modellenmesi." (2014).
[9] B. Bart, G. Verstraeten, D. V. Poel, M. E. Petersen, P. V. Kenhove, J.
Vanthienen, "Ba yesian network classifiers for identifying the slope of
the customer lifecycle of long -life customers." European Journal of
Operational Research 156.2, pp. 508 -523, 2004.
[10] S. Çalıșkan Kırmızıgül, İ. Soğukpınar, "K× Knn: K -Means Ve K En
Yakın Komșu Yöntemle ri İle Ağlarda Nüfuz Tespiti." EMO Yayınları,
pp. 120 -124, 2008.
[11] Friedman, Nir, Dan Geiger, and Moises Goldszmidt. "Bayesian network
classifiers." Machine learning 29.2 -3, pp. 131 -163, 1997.
[12] Muralidharan, V., and V. Sugumaran. "A comparative study of Naïve
Bayes classifier and Bayes net classifier for fault diagnosis of
monoblock centrifugal pump using wavelet analysis." Applied Soft
Computing 12.8, pp. 2023 -2029, 2012.
[13] Bouckaert, Remco R. Bayesian network classifiers in weka. Hamilton:
Department of Comput er Science, University of Waikato, 2007.
[14] E. Ardıl, "Esnek hesaplama yaklașımı ile yazılım hata kestrimi." (2009).
[15] Giuseppini, Gabriele. "Log parser." U.S. Patent Application No.
10/461,672, 2003.
[16] M. Baykara, R. Daș, “Web Sunucu Erișim Kütüklerinden Web
Ataklarının Tespitine Yönelik Web Tabanlı Log Analiz Platformu”,
Fırat Üniversitesi Mühendislik Bilimleri Dergisi, Cilt 28 Sayı 2, 2016.
[17] T. Özseven, M. Düğenci, "Log Analiz: Erișim Kayıt Dosyaları Analiz
Yazılımı ve GOP Üniversitesi Uygulaması." Internatıonal Journal Of
Informatıcs Technologıes 4.2, 2011.
[18] S. V. Stehman, "Selecting and interpreting measures of thematic
classification accuracy." Remote sensing of En vironment 62.1, pp. 77 –
89, 1997

Similar Posts