”ALEXANDRU-IOAN CUZA” UNIVERSITY, IAS ,I FACULTY OF COMPUTER SCIENCE FINAL PAPER PORTABLE PERFORMANCE ANALYSIS FRAMEWORK proposed by Cosmin-Adrian… [603028]
”ALEXANDRU-IOAN CUZA” UNIVERSITY, IAS ,I
FACULTY OF COMPUTER SCIENCE
FINAL PAPER
PORTABLE PERFORMANCE ANALYSIS
FRAMEWORK
proposed by
Cosmin-Adrian Poenaru
Session: july, 2019
Scientific coordinator
Conf. Dr. Gavrilut ,Dragos ,Teodor
”ALEXANDRU-IOAN CUZA” UNIVERSITY, IAS ,I
FACULTY OF COMPUTER SCIENCE
PORTABLE PERFORMANCE
ANALYSIS FRAMEWORK
Cosmin-Adrian Poenaru
Session: july, 2019
Scientific coordinator
Conf. Dr. Gavrilut ,Dragos ,Teodor
Avizat,
ˆIndrum ˘ator lucrare de licent ,˘a,
Conf. Dr. Gavrilut ,Dragos ,Teodor.
Data: ………………………. Semn ˘atura: ……………………….
DECLARAT,IE privind originalitatea cont,inutului lucr˘ arii de licent,˘ a
Subsemnatul Poenaru Cosmin-Adrian domiciliat ˆınRomˆ ania, jud. Ias ,i, mun.
Ias ,i, bulevardul Chimiei, nr. 4, bl. C1, sc. B, et. 6, ap. 65 , n˘ascut la data de 4 mai,
1997 , identificat prin CNP [anonimizat] , absolvent: [anonimizat] ˘at,ii “Alexandru Ioan
Cuza” din Ias ,i specializarea, Facultatea de Informatic ˘acomputer science , promot ,ia
2019, declar pe propria r ˘aspundere, cunosc ˆand consecint ,ele falsului ˆın declarat ,iiˆın
sensul art. 326 din Noul Cod Penal s ,i dispozit ,iile Legii Educat ,iei Nat ,ionale nr. 1/2011
art. 143 al. 4 s ,i 5 referitoare la plagiat, c ˘a lucrarea de licent ,˘a cu titlul PORTABLE PER-
FORMANCE ANALYSIS FRAMEWORK elaborat ˘a sub ˆındrumarea domnului Conf.
Dr. Gavrilut ,Dragos ,Teodor , pe care urmeaz ˘a s˘a o sust ,inˆın fat ,a comisiei este original ˘a,
ˆımi apart ,ine s ,iˆımi asum cont ,inutul s ˘auˆınˆıntregime.
De asemenea, declar c ˘a sunt de acord ca lucrarea mea de licent ,˘a s˘a fie verificat ˘a
prin orice modalitate legal ˘a pentru confirmarea originalit ˘at,ii, consimt ,ind inclusiv la
introducerea cont ,inutului ei ˆıntr-o baz ˘a de date ˆın acest scop.
Am luat la cunos ,tint ,˘a despre faptul c ˘a este interzis ˘a comercializarea de lucr ˘ari
s,tiint ,ifice ˆın vederea facilit ˘arii falsific ˘arii de c ˘atre cump ˘ar˘ator a calit ˘at,ii de autor al
unei lucr ˘ari de licent ,˘a, de diplom ˘a sau de disertat ,ie s ,iˆın acest sens, declar pe proprie
r˘aspundere c ˘a lucrarea de fat ,˘a nu a fost copiat ˘a ci reprezint ˘a rodul cercet ˘arii pe care
amˆıntreprins-o.
Dat˘a azi: ………………………. Semn ˘atur ˘a student:
……………………….
DECLARAT,IE DE CONSIMT,˘AM ˆANT
Prin prezenta declar c ˘a sunt de acord ca Lucrarea de licent ,˘a cu titlul “ PORTABLE
PERFORMANCE ANALYSIS FRAMEWORK ”, codul surs ˘a al programelor s ,i cele-
lalte cont ,inuturi (grafice, multimedia, date de test, etc.) care ˆınsot ,esc aceast ˘a lucrare s ˘a
fie utilizate ˆın cadrul Facult ˘at,ii de Informatic ˘a.
De asemenea, sunt de acord ca Facultatea de informatic ˘a de la Universitatea
“Alexandru Ioan Cuza”, s ˘a utilizeze, modifice, reproduc ˘a s ,i s˘a distribuie ˆın scopuri
necomerciale programele-calculator, format executabil s ,i surs ˘a, realizate de mine ˆın
cadrul prezentei lucr ˘ari de licent ,˘a.
Absolvent: [anonimizat]-Adrian Poenaru
Ias ,i, Data: ………………………. Semn ˘atura:
……………………….
ACORD PRIVIND PROPRIETATEA DREPTULUI DE AUTOR
Facultatea de Informatic ˘a este de acord ca drepturile de autor asupra programelor-
calculator, ˆın format executabil s ,i surs ˘a, s˘a apart ,in˘a autorului prezentei lucr ˘ari, Poe-
naru Cosmin-Adrian.
ˆIncheierea acestui acord este necesar ˘a din urm ˘atoarele motive:
Lucrarea de fat ,˘a este realizat ˘a folosind date confident ,iale din cadrul companiei
Bitdefender. ˆIncheierea acestui acord este necesar ˘a deoarece codul surs ˘a se afl ˘a sub o
clauz ˘a de confident ,ialitate ˆıntre autorul lucr ˘ari s ,i compania Bitdefender SRL.
Ias ,i, Data ……………………….
Absolvent Cosmin-Adrian Poenaru
Decan Iftene Adrian : ………………………. Semn ˘atura:
……………………….
Contents
Motivation 2
Introduction 3
1 Context – loading a webpage 4
1.1 Address translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 TCP connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Web requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Displaying the webpage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Performance 14
2.1 Intercepting and modifying traffic . . . . . . . . . . . . . . . . . . . . . . 15
3 Solution 18
3.1 Browsers – selenium, webdriver . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Loading phase – errors, used resources, time difference . . . . . . . . . . 23
3.3 Results phase – HTML, JSON . . . . . . . . . . . . . . . . . . . . . . . . . 28
Conclusions 30
Bibliography 31
Figure reference 34
1
Motivation
The need for a tool that provides detailed information about loading time of a
certain website (or a list of websites) is a necessity in a online-driven world, where any
delay is frowned upon by the user and any timeout makes certain software look bad
enough to be uninstalled the second it becomes a problem.
Measuring the difference between loading times for a certain website might be a
difficult thing to do: we have to take into account two main factors:
user-side – users’ hardware, installed software (also looking into the current files’
versions), geolocation
server-side – server’s hardware, web server’s load balancing, geolocation of the
server (vs. user’s geolocation)
How can you provide an objective result, without neglecting any external factor
that might influence it?
Moreover, can you be sure that the given result truly represents what you are
looking for?
“A computer program does what you tell it to do, not what you want it to do.”
— Arthur Bloch, Murphy’s Law and Other Reasons Why Things Go Wrong
The most remarkable 1-second stats:
1 second delay reduces page views by 11%
1 second delay decreases customer satisfaction by 16%
1 second delay eats away 7% of the coveted conversion rate
hostingtribunal.com
2
Introduction
The Web performance analysis tool addresses the problem of not being able to
correctly check if an application impacts the loading speed of a certain website.
Youare in control of what website(s) you want to test, how accurate the result has to
be, what browser to use, and you will be given smart infographics representing data
collected through-out the test.
“How can you provide an objective result, without neglecting any external factor that
might influence it?”
You cannot overlook any factor that might influence your result. One of the great-
est problems is that a certain website might have a slow loading time at a given point,
but only for a short amount of time – a server-side problem. These values are simply
ignored when computing the impact, and, if ran multiple times, the results will be as
accurately as possible.
“Moreover, can you be sure that the given result truly represents what you are looking
for?”
You can be sure that the impact for a loading time of a website is accurately com-
puted, but you could never know all the external factors that came into the ecuation.
The goal is to minimize the risk of your data being altered, and for the results to be
computed to the best of the script’s “knowledge”.
1-3 seconds load time increase the bounce rate probability by 32%
1-5 seconds load time increase the bounce rate probability by 90%
1-6 seconds load time increase the bounce rate probability by 106%
1-10 seconds load time increase the bounce rate probability by 123%
(Find Out How You Stack Up to New Industry Benchmarks for Mobile Page Speed ,
Google, 2017)
3
Chapter 1
Context – loading a webpage
When entering a URL in the address bar, the browser sends a GET request to the
server that hosts the specified resource, basically asking it for the webpage the user
wants.
The GET method requests a representation of the specified resource. Requests using
GET should only retrieve data and should have no other effect.
(wikipedia.org )
But computers cannot understand that human-readable address, so it needs to
transform it into a computer-readable address — the URL will be translated into an
IP address for the computer to be able to establish a connection with the server host-
ing the specified web address. Once the URL has been translated into an IP address,
the browser will connect to the specified IP address (where the server is located), and
make a set of GET requests, needed to load, render and display the webpage to the user.
Figure 1.1: domain name translation & request
4
1.1 Address translation
In order to find the location of the webpage, the computer needs to find the
IP address of the server that hosts the website. For the human-readable address to
computer-readable address translation, the browser takes the entered address and runs
a set of operations needed to render the HTML file, located at the desired address — it
executes a Domain Name Service lookup protocol .
The Domain Name Service (DNS) protocol helps Internet users and network de-
vices discover websites using human-readable hostnames, instead of numeric IP
addresses. (ns1.com )
First of all, the browser looks in its DNS cache, checking whether the user has
recently navigated to this website and knows the address. If not, the browser will ask
theOperating System to look in its hosts file, if there is a manual translation present.
Manual translations can be entered by the user (e.g.: “10.1.2.3 google.com” – when the
user navigates to google.com , the browser will be instructed to connect to 10.1.2.3. ).
The hosts file is one of several system facilities that assists in addressing network
nodes in a computer network. It is a common part of an operating system’s Inter-
net Protocol (IP) implementation, and serves the function of translating human-
friendly hostnames into numeric protocol addresses, called IP addresses, that iden-
tify and locate a host in an IP network. (wikipedia.org )
If the hosts file cannot find the address, it looks in the operating system’s cache
– did any software installed on this computer try to connect to the specified address?
A negative response from this cache triggers a request to the Internet Service Provider ,
which checks to see whether or not it has seen that request recently, from any of the
computers which it hosts — gaining a broader and broader pool of users that might
have requested this translation. The last step, if the translation has not been completed
until now — all of the caches missed —, the top-level domain will resolve it.
A top-level domain (TLD) is one of the domains at the highest level in the hierar-
chical Domain Name System of the Internet. (wikipedia.org )
The top-level domain is asked at the request of the ISP , which will cache the re-
sponse and return it to the Operating System – which, in its turn, caches it too. The
translation will be finally transmited to the browser, which also keeps a copy of the
translation.
5
Figure 1.2: How does a DNS query work
All of these cachings are done in order for further connections to that same ad-
dress to be made faster, without having to connect to the internet and ask the IPSor
DNS to translate the address for you. However, this caching is only temporary, because
a server’s IP address might change in time, and, if cached for too long, the IP address
might redirect you to a wrong address.
A DNS cache […] is a temporary database, maintained by a computer’s operating
system, that contains records of all the recent visits and attempted visits to websites
and other internet domains.
(lifewire.com )
Right now, the browser knows the translation of the address requested by the
user, as a computer-readable IP address. All it needs to do now is to connect to the
server to request the files needed for the webpage to be displayed. That implies a
TCP connection to the server, done by the browser, sending different packets to the
destination in order to create a connection — both sides need to know that the other
side wants to communicate with it.
6
1.2 TCP connection
The Transmission Control Protocol provides a communication service at an inter-
mediate level between an application program and the Internet Protocol. It provides
host-to-host connectivity at the transport layer of the Internet model.
(wikipedia.org )
Using the TCP protocol allows you to set up a connection to your destination, in
order to send your request to the server for the path of the wanted address. The com-
puter ( client ) sends a packet to the translated IP address, the server replies with another
packet, and then the computer sends yet another packet. Those packet are called SYN
(SYNCHRONIZE ) — hosting a random sequence number, SYN-ACK (SYNCRONIZE-
ACKNOWLEDGE ) — with a random sequence number and a number acknowledging
the client and ACK (ACKNOWLEDGE ) — number to acknowledge the server, and form
thethree-way handshake — the foundation of every connection established using the
TCP protocol.
Figure 1.3: three-way handshake
7
TCP is most widely used for data transmission in communication network such
as internet. It provides process to process communication using PORT numbers. Using
a number system, it keeps track of segments that are received and transmitted. It also
has a flow control, a congestion control and an error control, which makes this protocol
a reliable one.
After the three-way handshake has been successfully completed, both sides know
that the other side exists, is trying to communicate is them, where they are and how to
get them — a connection is created.
The Hypertext Transfer Protocol (HTTP) is an application-level protocol for dis-
tributed, collaborative, hypermedia information systems. It is a generic, stateless,
protocol which can be used for many tasks beyond its use for hypertext, such as
name servers and distributed object management systems, through extension of its
request methods, error codes and headers
(Hypertext Transfer Protocol – HTTP/1.1, IETF )
Figure 1.4: Hypertext Transfer Protocol
8
1.3 Web requests
Figure 1.5: HTTP request
HTTP is a connection-less protocol , meaning that after every request, the two parts
that communicate with each other disconnect, and after the response is ready, the two
re-connect with each other to communicate. After the response is sent, the two discon-
nect from each other again.
Requests represent a way for the client and server to exchange data. The client
sends the request and the server responds to that request. All requests are sent to
a specific URL. Every time a request is sent, the server is going to take that request,
process it and send the client a response back.
Typical HTTP messages consists of 3 parts: start line ,headers and body — which
contain either requests orresponses — all of these are, generally, written in plain text.
The start line contains the method , the URI and the HTTP version , while the header allows
for additional information to be sent along with the request. This might contain host,
accept ,accept language and so on. The body may or may not be needed.
Figure 1.6: GET request
9
Arequest message is a message sent from the client to a server. As stated before,
the start line contains a method — which tells the server what it should do — either
aGET command (when asking the server to send some data to the client) or a POST
command (when asking the server to store or process some data sent by the client).
There are more HTTP methods , like POST ,DELETE etc, but they are not needed for the
sake of this topic.
The next thing in the header is the path to the requested resource (the URI), also
having the file extension. The client will send a HTTP command over the created con-
nection, by transmitting a packet using TCP , the GET request, which will instruct the
server to send the requested resource. A 200 OK response from the server is a status
response that indicates that the request has succeeded.
Aresponse message is a message sent to the client, by the server, after receiving
the client’s request and processing it. The request contains multiple components, like
data (if the client requests a file, data will be that file), a status code and headers .
The header is composed by the HTTP version (HTTP version supported by the
server) and a status code — 3-digit number which tells the client whether the request
has succeeded ( “200 OK” response) or not, in which case it sends an error code (e.g.:
404 Not Found ). These details tells the client how the sent request has been processed
by the server.
Figure 1.7: HTTP response
10
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext
Transfer Protocol (HTTP). It is used for secure communication over a computer
network, and is widely used on the Internet. In HTTPS, the communication proto-
col is encrypted using Transport Layer Security (TLS), or, formerly, its predecessor,
Secure Sockets Layer (SSL).
(wikipedia.org )
The problem with the original protocols used for the web is that they are insecure:
they are not confidential – anyone can intercept and read the transmitted data, and
unauthenticated – even if they are encrypted, there is no guarantee that the right person
receives them. As a result, a new, more secure protocol, TLS, has been created.
Transport Layer Security (TLS), and its now-deprecated predecessor, Secure Sock-
ets Layer (SSL), are cryptographic protocols designed to provide communications
security over a computer network.
(https://en.wikipedia.org/wiki/Transport Layer Security )
Encryption is used in the TLS protocol in order to secure the data — a certificate will be
used for the TLS protocol. This encryption might cause performance issues, especially
if installed software try to intercept the traffic.
Figure 1.8: SSL Certificate
11
1.4 Displaying the webpage
The machine that hosts the webpage receives a GET request from the browser,
taking the path part of the domain and looking for it in its file system. Once found,
the server sends the resource to the browser, which will start processing it. First of
all, the browser parses the HTML file, which might contain links to other resources
(URLs), and fetches those resources too (e.g.: stylesheets ,javascript ), using the same
GET request. Those resources include information about how to draw the webpage,
images, fonts and so on.
A web style sheet is a form of separation of presentation and content for web
design in which the markup (i.e., HTML or XHTML) of a webpage contains the
page’s semantic content and structure, but does not define its visual layout (style).
Instead, the style is defined in an external style sheet file using a style sheet lan-
guage such as CSS or XSLT. This design approach is identified as a ”separation”
because it largely supersedes the antecedent methodology in which a page’s markup
defined both style and structure.
Alongside HTML and CSS, JavaScript is one of the core technologies of the World
Wide Web. JavaScript enables interactive web pages and is an essential part of
web applications. The vast majority of websites use it,[10] and major web browsers
have a dedicated JavaScript engine to execute it.
(wikipedia.org )
As the server sends the requested resources, the browser starts to assemble them
into the document, in order to render it as a visible page in the browser’s window.
It also runs the javascripts, which add interactivity to the webpage. The document is
published as a DOM, a standard way to describe web pages, allowing it to be traversed
and manipulated.
The Document Object Model (DOM) is a cross-platform and language-independent
interface that treats an XML or HTML document as a tree structure wherein each
node is an object representing a part of the document. The DOM represents a doc-
ument with a logical tree. Each branch of the tree ends in a node, and each node
contains objects. DOM methods allow programmatic access to the tree
(wikipedia.org )
12
Those javascript files or the CSS stylesheets might be loaded from different web-
servers — in this case, an address translation needs to be completed again for those
servers, a TCP handshake will be done and all of the above steps will be repeated with
each and every one of the external servers that contain data which needs to be loaded.
From a performance point of view, all the javascripts should be placed at the
end of a webpage, because they can stop the HTML parser to construct the DOM tree,
making the webpage’s loading a bit faster.
In the layout process, all the resources’ positions that appear on the page are being
calculated, so talking performance, some methodologies are to be avoided, for exam-
ple,layout thrasing . However, most modern frameworks solve this problem internally.
Layout Thrashing is where a web browser has to reflow or repaint a web page many
times before the page is ‘loaded’. […] Depending on the number of reflows and
the complexity of the web page, there is potential to cause significant delay when
loading the page.
(idrsolutions.com )
The final step in the webpage loading is the Paint process, which will take all the
information from the Render tree and giving the visual output to the browser. The best
way of making a website load faster is by letting the most important parts of the page
load first, delaying all the other javascripts and CSSs. After all of these steps, the web-
page is beying displayed in the browser, so the user can see the requested content.
Figure 1.9: The rendering process of a webpage
13
Chapter 2
Performance
When you navigate to a URL, you do so from any number of potential starting
points. Depending on a number of conditions, such as connection quality and the
device you’re using, your experience could be quite different from another user’s.
(developers.google.com )
There are multiple factors that affect a website’s performance, but from the user’s
point of view, all of them lead to the same unsatisfactory result: the page does not load
fast enough, and if it’s about a webpage that has more verions on the market, users
might be inclined to use an alternative which loads faster. However, if this is not the
case, users will become frustrated and will look into the problem. Should a certain
software installed on the computer cause the increase of loading times, the software
will be remove in a minute. This is why it is important for software developers to
detect such problems and solve them before the users do so. Some of these factors are:
Client’s or server’s internet speed — solvable only after switching to a better
internet plan or better hosting provider
Browser — older browser versions load data differently, slower (e.g.: no caching )
Site data — big files or unefficient data manipulation
Server’s Bandwidth — hosting servers might limit their client’s bandwidth
Client’s bandwidth and installed software — applications might hog the band-
width for their own interest
14
2.1 Intercepting and modifying traffic
Operating systems and most of the browsers provide developers ways of inter-
cepting, reading and modifying all the network traffic that goes through the computer
or browser. Should the developer choose to use those methods, the user’s internet
speed could decrease instantly. For example, Microsoft Windows exposes a platform
called WFP to allow developers to filter network traffic.
Windows Filtering Platform (WFP) is a set of API and system services that pro-
vide a platform for creating network filtering applications. The WFP API allows
developers to write code that interacts with the packet processing that takes place at
several layers in the networking stack of the operating system. Network data can
be filtered and also modified before it reaches its destination.
(docs.microsoft.com )
WFP is used by any installed software that wants to intercept network traffic and mod-
ify it. For example, Antivirus softwares could use this in order to intercept traffic going
in and out of the computer, scan the packets for any type of threats (e.g.: malware )
and then send the packets to the intended source, if no anomalies have been detected.
However, until the packets are being released to their intended source, they are being
scanned — increasing the performance impact. If scanning an URI takes .5s, then, say,
for a website that contains 10 URIs , a 5 second delay will impact the user.
Figure 2.1: Windows Filtering Platform Architecture Overview
15
Browsers, like Google Chrome , expose APIs which allows the same effect, on the
browser alone. Chrome’s API is called webRequest .
The web request API defines a set of events that follow the life cycle of a web request.
You can use these events to observe and analyze traffic. Certain synchronous events
will allow you to intercept, block, or modify a request.
(developer.chrome.com )
This is used, for example, by Ad blocker extensions — they analyze the current
webpage and listen for requests and, before ads can load, they are blocked. The prob-
lem is, for this method to work, all requests are being intercepted by the adblocker.
Before the requests are released, the extension should analyze the traffic — and how
much this takes depends only on the developer — if one chooses to compare the cur-
rent URI of the request against a database of ten million known URIs (which contain
ads), and does this for every request that pops up during the loading phase of the cur-
rent webpage — the user’s webpage will take forever to load. A quick comparison (be
it a behavioral one or a filter list) should have no visible effect on the load performance
of the webpage.
Figure 2.2: chrome.webRequest
16
ForHTTPS requests to be intercepted, a software needs to decrypt the traffic, read
it, modify it if wanted, then re-encrypt it and send it to its original source, through a
process called Man In The Middle .
In order to be able to sniff into the connection, mitmproxy acts as a certificate
authority, however, not a very trustworthy one: Instead of issuing certificates to
actual persons or organizations, mitmproxy dynamically generates certificates to
whatever hostname is needed for a connection.
(heckel.xyz )
This process evidently has an impact on the overall performance, being that data
is not directly sent from one participant to another, but goes through a third participant
which, besides intercepting the traffic and reading or modifying it, it also needs to take
care of the traffic encryption and decryption.
Figure 2.3: Man in the middle (MITM) attack
Browsers, however, have almost no protection for MITM attacks, so any applica-
tion could perform such an attack, especially on HTTPS websites, so software develop-
ers should not take the user’s trust as granted and manipulate the traffic to their own
needs.
17
Chapter 3
Solution
Implemented in Python, 3.x distribution – a desktop application that lets the user
choose how most of the test is ran. Users interact with the application via the Graphical
UserInterface or with command-line arguments. The application lets the user to select a
list of websites, an input file, the number of runs, the browser and then starts the actual
testrun. After the test’s initialization phase (which contains creating a result folder, a
HTML file and a JSON file ), the webpages are opened in the selected browser, then the
following tasks are performed (in the background):
load time for the current webpage is recorded and compared with the average
of all of the other load times (either from local files or from an online database ) –
the impact (load time difference % between current run and the average) and the
time difference are written in the results files (aforementioned HTML and JSON
files).
resources for a wanted list of processes are being recorded ( RAM ,CPU ,Handles
and Threads ) and updated
current webpage is checked for any loading errors – divided in two categories:
–server-side error — common errors that are caused by the server, there’s
nothing the user can do about them
–client-side error — common errors that are caused by a software installed on
the client’s machine
data collected from the steps above is written in the result files
18
The tool exposes a Grafical UserInterface which lets the user customize the test
run, as following:
Number of runs — how many times to run the test – more runs provide a more
accurate result
First page — first page, as a number, from the input file (noted as CSV path below)
– the test will start from this page number
Last page — last page, as a number, from the input file (noted as CSV path below)
– the test will end at this page number
CSV path — path on the disk (or SMB share ) of the CSV file – this file contains all
the websites that need to be navigated to
Start day — day of the month, as a number – which day the test will start
Start hour — hour of the day, as a number – at what hour the test will start
Browser — you can choose from 3 browsers: Google Chrome ,Mozilla Firefox or
Internet Explorer – this browser will be used to navigate on the selected webpages
Karma submit — a checkbox which, if marked, will submit the collected results
to an online database, for further reference
Karma retrieve — a checkbox which, if marked, will use the aforementioned
database to retrieve the collected results and compute the time average
Take screenshots — checkbox which, if marked, will take a screenshot of every
website that has been navigated to, in order to determine if the page is correctly
displayed
Start — Takes the above options, verifies and applies them, before running the
test (if the options have been validated)
19
Figure 3.1: Main view of the application (GUI)
Tkinter is Python’s de-facto standard GUI (Graphical User Interface) package. It is
a thin object-oriented layer on top of Tcl/Tk. Tkinter is not the only GuiProgram-
ming toolkit for Python. It is however the most commonly used one. CameronLaird
calls the yearly decision to keep TkInter ‘one of the minor traditions of the Python
world.’
(python.org )
TheGrafical UserInterface is implemented in Tkinter , which allows programmers
to implement a simple but powerful GUI , exposing useful options to users.
20
All of these options are also available if the user chooses to run the tool from the
command line, as described below:
-N—Number of test runs
-Fand -L—First and Last pages of the test
-Dand -H—Day and Hour to start the test
-C—CSV path, containing the webpages
-Sand R—Submit and Retrieve results from the online database
-B—Browser to use for navigating to the selected webpages
-P— whether to take a screenshot ( Picture) or not on any website
Figure 3.2: Running the application with command-line arguments
“GUIs tend to impose a large overhead on every single piece of software, even the smallest,
and this overhead completely changes the programming environment. Small utility programs
are no longer worth writing. Their functions, instead, tend to get swallowed up into omnibus
software packages.”
— Neal Stephenson, In the Beginning…Was the Command Line
21
3.1 Browsers – selenium, webdriver
Selenium automates browsers. That’s it! What you do with that power is entirely
up to you. Primarily, it is for automating web applications for testing purposes,
but is certainly not limited to just that. Boring web-based administration tasks
can (and should!) be automated as well. Selenium has the support of some of the
largest browser vendors who have taken (or are taking) steps to make Selenium a
native part of their browser. It is also the core technology in countless other browser
automation tools, APIs and frameworks.
(seleniumhq.org )
Built in 2004 by Jason Huggin ,Selenium is a javascript library used for manipulating
browsers in an automated way, with minimal user-input. It supports a wide range
of browsers, like Google Chrome ,Mozilla Firefox ,Internet Explorer ,Microsoft Edge ,Sa-
fari,Opera , multiple operating systems, among which Microsoft Windows ,Apple OS X
and Linux , and most of the popular programming languages: C#,Haskell ,Java,Python
and so on. Portable testing framework, it only needs an executable file – the webdriver
(provided by the browser’s manufacturer), and the automation can start right away.
WebDriver refers to language bindings and implementations of browser controlling
code each browser having a native support for automation, with Selenium making di-
rect calls.
Figure 3.3: Selenium running Google Chrome , navigating on google.com
Open-source and designed specifically for web-testing, selenium is used in the
analysis tool in order to navigate to the desired webpages, using any browser the user
feels comfortable with, while gathering useful information about the websites, loading
times and errors.
22
3.2 Loading phase – errors, used resources, time differ-
ence
The loading phase begins as soon as all of the user’s preferences are validated,
if the user chooses to start right that second. Otherwise, it waits for the specified day
and hour.
First of all, two result files are created: a HTML file and a JSON file, which will
hold any data collected throughout the testrun. The HTML file will hold data about
the machine (operating system, IP v4, IP v6, public IP , RAM, CPU, hard disk space)
and about the tested product (product version, all files’ versions, going recursively
through the product’s directory) and a color legend. The JSON file contains the same
data, except for the color legend.
After the initialization of the result files, a dictionary which will contain informa-
tion about the current webpage is created (timestamp of when the page is loaded, load
time, sitename, navigation status, used resources) and overall information (number of
loaded webpages, test start time, test stop time).
The actual loading of the webpages starts after opening the CSV file, which con-
tains all the webpages that the script needs to navigate to. At every iteration, the ap-
plication checks if there’s more than 100 MB free on the disk, and if not, the test stops.
The navigation starts after a instance of the selected browser (via an remote webdriver )
is opened (if the browser is installed on the system), and the stopwatch registers the
loading time. After the webpage is completely loaded, the app updates the dictionary
with the resources used by the product, in order to check for a memory-leak, spike or
any other unwanted event.
A page’s performance gets progressively worse over time. This is possibly a symp-
tom of a memory leak. A memory leak is when a bug in the page causes the page
to progressively use more and more memory over time. A page’s performance is
consistently bad. This is possibly a symptom of memory bloat. Memory bloat is
when a page uses more memory than is necessary for optimal page speed. A page’s
performance is delayed or appears to pause frequently. This is possibly a symptom
of frequent garbage collections. Garbage collection is when the browser reclaims
memory. The browser decides when this happens. During collections, all script
execution is paused. So if the browser is garbage collecting a lot, script execution is
going to get paused a lot. (Fix memory problems, Kayce Basques )
23
Figure 3.4: for loop – loading the webpages
A time difference and an impact are computed, using previous results — which
are either retrieved from an online database or from prior test runs — and are written
in the HTML result file. The time difference is computed by substracting the current
load time from the average loading times, ignoring the marginal results. Impact is cal-
culated using the Relative change formula , as shown below:
RelativeChange (x; x reference ) =Actualchange
xreference=
xreference(3.1)
— Using and Understanding Mathematics: A Quantitative Reasoning Approach
Bennett, Jeffrey; Briggs, William
Every navigation is handled by a method which, in the beggining, checks if the take
a screenshot on any website option has been checked, in which case, a screenshot is saved
in the results path, using Python’s Pillow (PIL fork) library.
24
The Python Imaging Library adds image processing capabilities to your Python
interpreter. This library provides extensive file format support, an efficient internal
representation, and fairly powerful image processing capabilities. The core image
library is designed for fast access to data stored in a few basic pixel formats. It
should provide a solid foundation for a general image processing tool.
(pillow.readthedocs.io )
The tool also checks if certain processes that might determine a software problem are
running ( e.g.: werfault.exe ), and if software misbehavior is detected, it is also written
in the results files, marking webpage number, name, timestamp and a screenshot is
taken.
werfault.exe is used for Windows Error Reporting. It is a feature that allows Mi-
crosoft to track and address errors relating to the operating system, Windows fea-
tures, and applications. It gives you the option to send data about errors to Mi-
crosoft and to receive information about solutions.
(microsoft.com )
After software checking, the app checks the webpage for timeouts and errors.
First, if the navigation took longer than three minutes, then the event is marked as un-
successful. Any webpage should be loaded in less than three minutes, otherwise sev-
eral problems arise. If the navigation is shorter than three minutes, the page’s source
is checked against the website being blocked by an Antivirus software and against com-
mon webpage loading errors, such as ERR CONNECTION CLOSED ,
ERR CONNECTION RESET ,ERR EMPTY RESPONSE ,
ERR INCOMPLETE CHUNKED ENCODING ,ERR SSL PROTOCOL ERROR ,
ERR INV ALID CHUNKED ENCODING ,ERR CONTENT DECODING FAILED ,
ERR SSL VERSION ORCIPHER MISMATCH ,ERR CERT AUTHORITY INV ALID ,
ERR NAME NOT RESOLVED ,ERR CONNECTION TIMED OUT ,
ERR CONNECTION REFUSED ,ERR NAME RESOLUTION FAILED ,
DNS PROBE FINISHED NXDOMAIN ,ERR TOO MANY REDIRECTS and so on. Should
any error appear, they are marked in the final results files, for further investigation.
Figure 3.5: Page error on ettoday.com/ -ERR CONNECTION RESET
25
Used resources are computed using psutil andWindows Management Instrumentation ,
which provide a clear vision of how a certain process occupies the system’s resources.
Those are later depicted in a graphic, for better understanding of memory management .
The monitoring happens from a separate thread, in order to not interfere with the ac-
tual navigation, and can monitor any process wanted by the user for multiple resources
usage ( RAM ,Handles ,CPU ,Threads ,Private bytes ,Virtual size and so on). Those provide
good insight into how the computer’s resources are used when navigating to a certain
webpage (e.g.: cryptominers can be detected if there is a spike in resources ). Moreover, the
user can define a threshold for certain processes, and the thread will issue a warning
message if the process goes over the threshold. Also, the graph has a red line, repre-
senting the user-defined threshold.
Figure 3.6: Graph representing RAM used by vsserv.exe process through-out the test
psutil (process and system utilities) is a cross-platform library for retrieving in-
formation on running processes and system utilization (CPU, memory, disks, net-
work, sensors) in Python. It is useful mainly for system monitoring, profiling and
limiting process resources and management of running processes. It implements
many functionalities offered by UNIX command line tools such as: ps, top, lsof…
(pypi.org )
Windows Management Instrumentation (WMI) is the infrastructure for manage-
ment data and operations on Windows-based operating systems. You can write
WMI scripts or applications to automate administrative tasks on remote comput-
ers but WMI also supplies management data to other parts of the operating system
and products, for example System Center Operations Manager, formerly Microsoft
Operations Manager (MOM), or Windows Remote Management (WinRM).
(microsoft.com )
26
Should any navigation error appear, for example a webdriver crash, an empty
page or a werfault -detected problem, it is considered a script error , being marked in the
final results files as such, if the user would like to further investigate the issue. Any
error-report in the HTML result file (user-side, server-side, script-caused etc) will be
accompanied by a relevant screenshot of the moment it happened.
Figure 3.7: Script error on junbi-tracker.com/ -webdriver crash
While navigation errors could indicate a programmer error, the selenium library
might be prone to not handling too good a few websites, depending on the language of
the website (i.e.: unicode ), geolocation, website’s resources and so on. Manual input is
required in those type of situations to understand why,how and under what conditions
the error occurs.
Some common exceptions that might occur during the testrun:
InsecureCertificateException : Navigation made the user agent to hit a certificate
warning, which is caused by an invalid or expired TLS certificate
RemoteDriverServerException : This Selenium exception is thrown when server
do not respond due to the problem that the capabilities described are not proper
TimeoutException : Thrown when there is not enough time for a command to be
completed.
UnexpectedAlertPresentException : This Selenium exception happens when there
is the appearance of an unexpected alert.
WebDriverException : Base exception class, all exceptions are inherited from this
class.
— as per katalon.com ,Exceptions in Selenium: Have you ever “met” them?
While catching these exceptions is the right thing to do, it should be done only after an
investigation has been conducted as to see the root cause of the problem.
“Catching System.Exception is nearly always the wrong thing to do as well.”
(Choosing the right type of exception to throw, Krzysztof Cwalina )
27
3.3 Results phase – HTML, JSON
Built on top of d3.js and stack.gl, plotly.js is a high-level, declarative charting
library. plotly.js ships with over 40 chart types, including scientific charts, 3D
graphs, statistical charts, SVG maps, financial charts, and more.
(github.com )
After all the selected webpages have been navigated to and everything has been recorded
to the results files, the process monitor runs one more time, in order to check the re-
sources used after the test has been finished. The results are then reviewed by the
script, a HTML table is created with all the necessary data (page number, sitename, ac-
tual sitename , load time, load time difference, load status, current time and a screenshot,
if needed).
Figure 3.8: First 5 webpages in the HTML results file
Final statistics are also available, like number of loaded websites, total load time,
average load time (per webpage), number of problems and test start and end time.
Figure 3.9: Final statistics in the HTML results file
In the end, the HTML results file is populated with a few graphics, containing
used resources and performance evolution of tested webpages. Those graphics are
created in two ways:
files on the hard disk : in-depth process monitoring writes the collected results
(every 5 seconds ) on the harddisk, as json files, in order to load it in the memory
as a dictionary, for better data manipulation
virtual files : test data (number of pages, load times) is kept in the memory, be-
cause of the small number of space used and code simplicity
28
In the final step, these datasets are read and a Javascript -like array is being constructed
for the plotly graph to be constructed. The x-axis will represent the timestamp, while the
y-axis will represent the used resources for the resources’ graph, respectively load time
for websites’ graph. These graphs gives more insight to the user regarding evolution
of used resources and load time, signaling problems if values go over the top.
Figure 3.10: Graph representing load time evolution for google.com
Every resource and every website has a separate graph for better understanding
the data. For example, CPU usage may rise in case of complex websites, because the
browser will try to process certain scripts, whereas RAM usage could rise when the
browser renders the page. The load time graph can be used for checking if the current
impact (be it a better one or a worse one) is a temporary spike, a server-side problem,
a client-side problem or it follows a pattern – in which case, the root cause can be
determined quicker.
Figure 3.11: Load time pattern for pixabay.com
Patterns can be caused by browser caching, search engine caching and so on.
A Web cache is an information technology for the temporary storage of Web docu-
ments, such as Web pages, images, and other types of Web multimedia, to reduce
server lag. A Web cache system stores copies of documents passing through it.
(wikipedia.org )
29
Conclusions
Webpage loading is a long process from a computational point of view — how-
ever, from the user’s point of view, it is, as it should be and remain, a quick one. The
end user does not see all that happens when he clicks a link — neither the hostname-to-
IPaddress translation , nor the DNS lookup protocol , and surely he cannot see (or maybe
even understand) the TCP three-way handshake . The web requests and responses are
mostly hidden from the user (should he choose to look them up, they are not a secret
and can be easily discovered), as is the DOM . The user just entesr an URL , and the
computer does the rest.
However, this process can transform into an unpleasant one — should the user
experience a delay in the performance, if that is caused by a software installed on his
computer, be it a browser, an extension in the browser, an antivirus software or any
kind of software, the user will surely remove the faulty application.
Operating Systems and browsers expose methods to developers to intercept and
modify network traffic — and if this is not done in the right way, the fast way — a
significant performance dropdown will surface.
Developers should pay really good attention when creating applications that
might have an impact on how fast webpages load for the user.
This application constitutes a portable and fast solution to this problem. Users,
developers, testers and anyone interested can run this application to check the impact
of certain software installed on the computer on their loading times — be it on one
website, a list of websites, any browser they want and so on. Created with a Graph-
ical User Interface for the home user to use, and with command-line arguments support
for the tech-guy, the proposed solution will provide a great insight into website load-
ing performance — generated graphs, crafted statistics, saving the results to an online
database are all in there — a portable performance-analysis framework.
30
Bibliography
Arthur Bloch, Murphy’s Law and Other Reasons Why Things Go Wrong , 1977
How Speed Affects Your Website, hostingtribunal.com/blog/how-speed
-affects-website , 2018
Find Out How You Stack Up to New Industry Benchmarks for Mobile Page
Speed, Google , 2017
Hypertext Transfer Protocol, https://ro.wikipedia.org/wiki/Hypert
extTransfer Protocol , 2004
DNS Protocol Explained, ns1.com/resources/dns-protocol , 2018
hosts(file), en.wikipedia.org/wiki/Hosts (file) , 2004
Top-level domain, en.wikipedia.org/wiki/Top-level domain , 2004
DNS Caching and How It Makes Your Internet Better, www.lifewire.com/w
hat-is-a-dns-cache-817514 , 2019
Tranmission Control Protocol, en.wikipedia.org/wiki/Transmission C
ontrol Protocol , 2004
Hypertext Transfer Protocol – HTTP/1.1, IETF , 1999
HTTPS, en.wikipedia.org/wiki/HTTPS , 2004
Transport Layer Security, en.wikipedia.org/wiki/Transport Layer Sec
urity , 2004
Style sheet (web development), en.wikipedia.org/wiki/Style sheet
(web development) , 2004
Javascript, en.wikipedia.org/wiki/JavaScript , 2004
31
Document Object Model, en.wikipedia.org/wiki/Document Object Mod
el, 2004
Beware Javascript thrashing!, blog.idrsolutions.com/2014/08/beware
-javascript-layout-thrashing/ , 2014
Why performance matters, developers.google.com/web/fundamentals
/performance/why-performance-matters/ , 2019
Windows Filtering Platform, docs.microsoft.com/en-us/windows/des
ktop/fwp/windows-filtering-platform-start-page , 2018
chrome.webRequest, developer.chrome.com/extensions/webRequest ,
????
How To: Use mitmproxy to read and modify HTTPS traffic, blog.heckel.xy
z/2013/07/01/how-to-use-mitmproxy-to-read-and-modify-https
-traffic-of-your-phone/ , 2013
TkInter, wiki.python.org/moin/TkInter , 2019
In the Beginning…Was the Command Line, Neal Stephenson , 1999
Selenium – Web Browser Automation, www.seleniumhq.org/ , 2004
Fix Memory Problems, Kayce Basques , 2019
Using and Understanding Mathematics: A Quantitative Reasoning Approach,
Bennett, Jeffrey; Briggs, William , 2005
Pillow (PIL Fork), pillow.readthedocs.io , 2016
Windows Error Reporting, social.technet.microsoft.com/Forums/wi
ndows/en-US/4032dc41-c813-4058-bba2-27317b38bf63/werfaulte
xe, 2013
psutil, pypi.org/project/psutil/ , 2009
Windows Management Instrumentation, docs.microsoft.com/en-us/win
dows/desktop/wmisdk/wmi-start-page , 2018
32
Exceptions in Selenium: Have you ever “met” them?, katalon.com/resour
ces-center/blog/selenium-exceptions/ , 2018
Choosing the Right Type of Exception to Throw, Krzysztof Cwalina , 2006
plotly, github.com/plotly/plotly.js , 2016
Web caching, en.wikipedia.org/wiki/Web cache , 2004
33
Figure reference
Fig. 1.1 – Domain name translation & request, ibm.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Fig. 1.2- How does a DNS query work, totaluptime.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Fig. 1.3 – Three-way handshake, wikipedia.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fig. 1.4 – Hypertext transfer protocol, University of Southhampton . . . . . . . . . . . . . . . . . . . . 8
Fig. 1.5 – HTTP request, ubidots.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Fig. 1.6 – GET request, tutorialspoint.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Fig. 1.7 – HTTP response, mozilla.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Fig. 1.8 – SSL Certificate, exabytes.my . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Fig. 1.9 – The rendering process of a webpage, medium.com . . . . . . . . . . . . . . . . . . . . . . . . .13
Fig. 2.1 – Windows Filtering Platform Architecture Overview, docs.microsoft.com . . . . 15
Fig. 2.2 – chrome.webRequest, developer.chrome.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Fig. 2.3 – Man in the middle (MITM) attack, imperva.com . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Fig. 3.1 – Main view of the application (GUI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
34
Fig. 3.2 – Running the application with command-line arguments . . . . . . . . . . . . . . . . . . 21
Fig. 3.3 – Selenium running Google Chrome, navigating on google.com . . . . . . . . . . . . . . 22
Fig. 3.4 – for loop – loading the webpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Fig. 3.5 – Page error on ettoday.com – ER CONNECTION RESET . . . . . . . . . . . . . . . . . . . . 25
Fig. 3.6 – Graph representing RAM used by vsserv.exe process . . . . . . . . . . . . . . . . . . . . . . 26
Fig. 3.7 – Script error on junbi-tracker.com – webdriver crash . . . . . . . . . . . . . . . . . . . . . . . . . 27
Fig. 3.8 – First 5 webpages in the HTML results file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Fig. 3.9 – Final statistics in the HTML results file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Fig. 3.10 – Graph representing load time evolution for google.com . . . . . . . . . . . . . . . . . . . 29
Fig. 3.11 – Load time pattern for pixabay.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
35
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: ”ALEXANDRU-IOAN CUZA” UNIVERSITY, IAS ,I FACULTY OF COMPUTER SCIENCE FINAL PAPER PORTABLE PERFORMANCE ANALYSIS FRAMEWORK proposed by Cosmin-Adrian… [603028] (ID: 603028)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
